Custom Integration for Data Classification
The custom integration allows you to extend the Data Classification scanning to your own systems. This is especially useful for scanning internal home-brew systems or preventing external connections to internal databases. This is extremely easy to implement with your own API endpoint (such as a cloud function).
The custom integration works by allowing you to register an API endpoint that will get called as part of the Data Classification workflow and will be used to iterate your data store. Just like Mine's built-in integrations, custom integration for Data Classification supports full scan as well as smart sampling.
The custom integration supports the following operations. Each operation requires an endpoint URL in order to be used:
Operation | Description |
---|---|
Metadata | An endpoint designed to get metadata information about the underlying data store. |
Data | An endpoint designed to query for data on the underlying data store. |
Metadata Endpoint
The Metadata endpoint is used at the start of the scan to obtain information about the data source being scanned. The information includes a collection of data containers (e.g. tables) with their names and number of records.
HTTP Method | URL |
---|---|
GET | /metadata?isTest=false |
Expected response:
{
“objects”: [
{“name”:”table1”, “count”:100000000},
{“name”:”table2”, “count”:400},
{“name”:”table3”, “count”:200000000},
...
]
}
Query Parameters:
Name | Description |
---|---|
isTest | Optional. Signals the request was triggered as a test and not a real scan. See “Test Flow” below for more information. Default value: false. |
Data Endpoint
The Data endpoint is used to get a batch of records for classification.
HTTP Method | URL |
---|---|
GET | /data/{objectName}?offset=0&limit=100&isTest=false |
Expected response:
{
“data”: [
{“id”:”1ab2”, “header1”:”value11”, “header2”:”value21”...},
{“id”:”2b5rc”, “header1”:”value12”, “header2”:”value22”...},
...
]}
Query Parameters:
Name | Description |
---|---|
isTest | Optional. Signals the request was triggered as a test and not a real scan. See “Test Flow” below for more information. Default value: false. |
offset | Required. An integer in the range of [0,count-1]. |
limit | Required. An integer in the range of [1,maxBatchSize] |
Error Handling
The following responses received from any endpoint will fail the scan:
- HTTP Status 4xx
- HTTP Status 3xx
- Invalid response structure/format
The following responses will cause MineOS to retry the request:
- HTTP Status 5xx
- Timeouts (more than 30s)
All non-2xx responses support returning a JSON payload to MineOS with an error message:
{
“Error”: “error message description”
}
MineOS Retry policy:
- The first 3 retries are with a short exponential backoff (seconds).
- If this fails, there is a longer exponential backoff (minutes to hours) and the scan is restarted.
Security & Authentication
There are a number of methods MineOS allows customers to secure the integration:
- IP Whitelisting: all API calls to the customer’s endpoints from static IPs that can be whitelisted in the customer’s firewall. The IPs are listed here: https://docs.mineos.ai/knowledge/ip-whitelisting
- API Key: MineOS allows setting a custom HTTP header that can contain an API Key / Access Token that will be verified by the customer.
- VPN Tunnel: MineOS allows making the HTTP calls over a VPN tunnel. Note: when using a VPN there is no need to whitelist any IPs.
Example Flows
Sampling Scan
In this example we can see the network calls made during a sampling scan on a database using the Custom Integration:
- The operations start with the same
/metadata
call that returns the names of the objects and the number of items in each. MineOS determines the sample size for each object. - The operation continues with a sequence of
/data
calls for each of the objects, to get pages of decrypted data for classification. - The operations ends when enough data is returned to satisfy the sampling requirements.
Note: sampling a table is not guaranteed to be completed with a single call. This depends on the sample size fits in themaxPageSize
or not.
Full Scan
In this example we can see the network calls made during a full scan on a database using the Custom Integration:
- The operations start with a single
/metadata
call that returns the number of records to scan. - The operation continues with a sequence of
/data
calls to get pages of decrypted data for classification. - The operation ends with a
204
status code from the customer endpoint.
A sample scan will be similar, except that the stopping condition is not required.
Test Integration
In this example, we can see the network calls made during a test integration flow. This is triggered by MineOS during setup, to verify the integration is setup correctly:
- The operations start with a call to
/metadata
. - The operation continues with a single call to the
/data
endpoint with the first object returned from/metadata
. Only 1 small page is requested.
Note: The customer can use the isTest parameter to return partial or mock data, to avoid doing a heavy query on the database.
Updated 3 months ago