Skip to main content

Library to predict info types for DataHub

Project description

datahub-classify

Predict InfoTypes for DataHub.

Installation

python3 -m pip install --upgrade acryl-datahub-classify

API predict_infotypes

This API populates infotype proposal(s) for each input column by using metadata, values & confidence level threshold. Following are the input and output contract

API Input

API expects following parameters in the output

  • column_infos - This is a list of ColumnInfo objects. Each ColumnInfo object contains metadata (col_name, description, datatype, etc) and values of a column.
  • confidence_level_threshold - If the infotype prediction confidence is greater than the confidence threshold then the prediction is considered as a proposal. This is the common threshold for all infotypes.
  • global_config - This dictionary contains configuration details about all supported infotypes. Refer section Infotype Configuration for more information.
  • infotypes - This is a list of infotypes that is to be processed. This is an optional argument, if specified then it will override the default list of all supported infotypes. If user is interested in only few infotypes then this list can be specified with correct infotype names. Infotype names are case sensitive.
  • minimum_values_threshold - Minimum number of column values required for processing. This is an optional argument, default is 50.

API Output

API returns a list of ColumnInfo objects of length same as input ColumnInfo objects list. A populated list of Infotype proposal(s), if any, is added in the ColumnInfo object itself with a variable name as infotype_proposals. The infotype_proposals list contains InfotypeProposal objects which has following information

  • infotype - A proposed infotype name.
  • confidence_level - Overall confidence of the infotype proposal.
  • debug_info - confidence score of each prediction factor involved in the overall confidence score calculation. Refer section Debug Information for more information.

Convention: If infotype_proposals list is non-empty then it indicates that there is at least one infotype proposal with confidence greater than confidence_level_threshold.

Infotype Configuration

Infotype configuration is a dictionary with all infotypes at root level key. Each infotype has following configurable parameters (value of each parameter is a dictionary)

  • Prediction_Factors_and_Weights - This is a dictionary that specifies the weight of each prediction factor which will be used in the final confidence calculation. Following are the prediction factors
    1. Name
    2. Description
    3. Datatype
    4. Values
  • ExcludeName - optional exact match list for column names to exclude from classification for this info_type
  • Name - regex list which is to be matched against column name
  • Description - regex list which is to be matched against column description
  • Datatype - list of datatypes to be matched against column datatype
  • Values - this dictionary contains following information
    1. prediction_type - values evaluation model (regex/library)
    2. regex - regex list which is to be matched against column values
    3. library - library name which is to be used to evaluate column values

Sample Infotype Configuration Dictionary

{
    '<Infotype1>': {
        'Prediction_Factors_and_Weights': {
            'Name': 0.4,
            'Description': 0,
            'Datatype': 0,
            'Values': 0.6
        },
        'Name': { 'regex': [<regex patterns>] },
        'Description': { 'regex': [<regex patterns>] },
        'Datatype': { 'type': [<list of datatypes>] },
        'Values': {
            'prediction_type': 'regex/library',
            'regex': [<regex patterns>],
            'library': [<library name>]
        }
    },
    '<Infotype2>': {
    ..
    ..
    ..
    }
}

Debug Information

A debug information is associated with each infotype proposal, it provides details about confidence score from each prediction factor involved in overall confidence score calculation. This is a dictionary with following four prediction factors as key

  • Name
  • Description
  • Datatype
  • Values
{
    'Name': 0.4,
    'Description': 0.2,
    'Values': 0.6,
    'Datatype': 0.3
}

Supported Infotypes

Below Infotypes are supported out of the box.

  1. Age
  2. Gender
  3. Person Name / Full Name
  4. Email Address
  5. Phone Number
  6. Street Address
  7. Credit-Debit Card Number
  8. International Bank Account Number
  9. Vehicle Identification Number
  10. US Social Security Number
  11. Ipv4 Address
  12. Ipv6 Address
  13. Swift Code
  14. US Driving License Number

Regex based custom infotypes are supported. Specify custom infotype configuration in format mentioned here.

Assumptions

  • If value prediction factor weight is non-zero (indicating values should be used for infotype inspection) then a minimum 50 non-null column values should be present.

Development

Set up your Python environment

cd datahub-classify
../gradlew :datahub-classify:installDev # OR pip install -e ".[dev]"
source venv/bin/activate

Runnning tests

pytest tests/ --capture=no --log-cli-level=DEBUG

Sanity check code before committing

# Assumes: pip install -e ".[dev]" and venv is activated
black src/ tests/
isort src/ tests/
flake8 src/ tests/
mypy src/ tests/

Build and Test

../gradlew :datahub-classify:build

You can also run these steps via the gradle build:

../gradlew :datahub-classify:lint
../gradlew :datahub-classify:lintFix
../gradlew :datahub-classify:testQuick

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acryl-datahub-classify-0.0.9.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acryl_datahub_classify-0.0.9-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file acryl-datahub-classify-0.0.9.tar.gz.

File metadata

  • Download URL: acryl-datahub-classify-0.0.9.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for acryl-datahub-classify-0.0.9.tar.gz
Algorithm Hash digest
SHA256 1c26843b33d87da1ed3252f0f0e2317d663a1d81ce23a3cdbc08f11df1e8287b
MD5 37644409c758d455025e3672441b3a95
BLAKE2b-256 a08e7fd7c2084afc89cf913a185d0ff89b66cf51417935462c7c9f23c5d0abdf

See more details on using hashes here.

File details

Details for the file acryl_datahub_classify-0.0.9-py3-none-any.whl.

File metadata

File hashes

Hashes for acryl_datahub_classify-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 dfde6c98a14ff91322f7f97e6c64b4a2dc3cb9e085fbc2a78269cc787bc4ed66
MD5 432b960419bbb3bdc27ce04fdd3a2087
BLAKE2b-256 9e04a84d78693c4134565645a969f4364fe1aa94146a8ad9cb777a9c77275541

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page