Custom Integration for Data Classification

The custom integration allows you to extend the Data Classification scanning to your own systems. This is especially useful for scanning internal home-brew systems or preventing external connections to internal databases. This is extremely easy to implement with your own API endpoint (such as a cloud function).

The custom integration works by allowing you to register an API endpoint that will get called as part of the Data Classification workflow and will be used to iterate your data store. Just like Mine's built-in integrations, custom integration for Data Classification supports full scan as well as smart sampling.

The custom integration supports the following operations. Each operation requires an endpoint URL in order to be used:

OperationDescription
MetadataAn endpoint designed to get metadata information about the underlying data store.
DataAn endpoint designed to query for data on the underlying data store.

Metadata Endpoint

The Metadata endpoint is used at the start of the scan to obtain information about the data source being scanned. The information includes a collection of data containers (e.g. tables) with their names and number of records.

HTTP MethodURL
GET/metadata?isTest=false

Expected response:

{
  “objects”: [
    {“name”:”table1”, “count”:100000000},
    {“name”:”table2”, “count”:400},
    {“name”:”table3”, “count”:200000000},
    ...
  ]
}

Query Parameters:

NameDescription
isTestOptional. Signals the request was triggered as a test and not a real scan. See “Test Flow” below for more information. Default value: false.

Data Endpoint

The Data endpoint is used to get a batch of records for classification.

HTTP MethodURL
GET/data/{objectName}?offset=0&limit=100&isTest=false

Expected response:

{
  “data”: [
    {“id”:”1ab2”, “header1”:”value11”, “header2”:”value21”...},
    {“id”:”2b5rc”, “header1”:”value12”, “header2”:”value22”...},
    ...
]}

Query Parameters:

NameDescription
isTestOptional. Signals the request was triggered as a test and not a real scan. See “Test Flow” below for more information. Default value: false.
offsetRequired. An integer in the range of [0,count-1].
limitRequired. An integer in the range of [1,maxBatchSize]

Error Handling

The following responses received from any endpoint will fail the scan:

  • HTTP Status 4xx
  • HTTP Status 3xx
  • Invalid response structure/format

The following responses will cause MineOS to retry the request:

  • HTTP Status 5xx
  • Timeouts (more than 30s)

All non-2xx responses support returning a JSON payload to MineOS with an error message:

{
   “Error”: “error message description”
}

MineOS Retry policy:

  • The first 3 retries are with a short exponential backoff (seconds).
  • If this fails, there is a longer exponential backoff (minutes to hours) and the scan is restarted.

Security & Authentication

There are a number of methods MineOS allows customers to secure the integration:

  • IP Whitelisting: all API calls to the customer’s endpoints from static IPs that can be whitelisted in the customer’s firewall. The IPs are listed here Custom Integration for DSR
  • API Key: MineOS allows setting a custom HTTP header that can contain an API Key / Access Token that will be verified by the customer.
  • VPN Tunnel: MineOS allows making the HTTP calls over a VPN tunnel. Note: when using a VPN there is no need to whitelist any IPs.

Example Flows

Sampling Scan

In this example we can see the network calls made during a sampling scan on a database using the Custom Integration:

  1. The operations start with the same /metadata call that returns the names of the objects and the number of items in each. MineOS determines the sample size for each object.
  2. The operation continues with a sequence of /data calls for each of the objects, to get pages of decrypted data for classification.
  3. The operations ends when enough data is returned to satisfy the sampling requirements.
    Note: sampling a table is not guaranteed to be completed with a single call. This depends on the sample size fits in the maxPageSize or not.

Full Scan

In this example we can see the network calls made during a full scan on a database using the Custom Integration:

  1. The operations start with a single /metadata call that returns the number of records to scan.
  2. The operation continues with a sequence of /data calls to get pages of decrypted data for classification.
  3. The operation ends with a 204 status code from the customer endpoint.
    A sample scan will be similar, except that the stopping condition is not required.

Test Integration

In this example, we can see the network calls made during a test integration flow. This is triggered by MineOS during setup, to verify the integration is setup correctly:

  1. The operations start with a call to /metadata.
  2. The operation continues with a single call to the/data endpoint with the first object returned from /metadata. Only 1 small page is requested.

Note: The customer can use the isTest parameter to return partial or mock data, to avoid doing a heavy query on the database.