Skip to main content

Detect PII columns in your database and warehouse

Project description

🔍 Detect PII

Detect PII is a library inspired by piicatcher and CommonRegex to detect columns in tables that may potentially contain PII. It does so by performing regex matches on column names and column values, flagging the ones that may contain PII.

Usage

Installation

Packages can be installed by specifying extras, e.g.:

pip install detectpii[postgres]

See all supported databases and data warehouses.

Scan tables for PII

from detectpii.catalog import PostgresCatalog
from detectpii.pipeline import PiiDetectionPipeline
from detectpii.scanner import DataScanner, MetadataScanner
from detectpii.util import print_columns

# -- Create a catalog to connect to a database / warehouse
pg_catalog = PostgresCatalog(
    host="localhost",
    user="postgres",
    password="my-secret-pw",
    database="postgres",
    port=5432,
    schema="public"
)

# -- Create a pipeline to detect PII in tables using an English dictionary
pipeline = PiiDetectionPipeline(
    catalog=pg_catalog,
    scanners=[
        MetadataScanner(),
        DataScanner(),
    ],
    times=1,
    percentage=20,
)

# -- Scan for PII columns.
pii_columns = pipeline.scan()

# -- Print them to the console
print_columns(pii_columns)

Persist the pipeline

import json
from detectpii.pipeline import pipeline_to_dict

# -- Create a pipeline
pipeline = ...

# -- Convert it into a dictionary
dictionary = pipeline_to_dict(pipeline)

# -- Print it
print(json.dumps(dictionary, indent=4))

# {
#     "catalog": {
#         "tables": [],
#         "resolver": {
#             "name": "PlaintextResolver",
#             "_type": "PlaintextResolver"
#         },
#         "user": "postgres",
#         "password": "my-secret-pw",
#         "host": "localhost",
#         "port": 5432,
#         "database": "postgres",
#         "schema": "public",
#         "_type": "PostgresCatalog"
#     },
#     "scanners": [
#         {
#             "_type": "MetadataScanner"
#         },
#         {
#             "_type": "DataScanner"
#         }
#     ]
#    "times": 1,
#    "percentage": 10
# }

Load the pipeline

from detectpii.pipeline import dict_to_pipeline

# -- Load the persisted pipeline as a dictionary
dictionary: dict = ...

# -- Convert it back to a pipeline object
pipeline = dict_to_pipeline(dictionary=dictionary)

For more detailed documentation, please see the docs folder.

Supported databases / warehouses

Database / Warehouse Package
Hive detectpii[hive]
Postgres detectpii[postgres]
Snowflake detectpii[snowflake]
Trino detectpii[trino]
Yugabyte detectpii[yugabyte]
BigQuery detectpii[bigquery]

Available languages

The following languages are available for metadata detection:

Language Detector
English EnglishColumnNameRegexDetector
Spanish SpanishColumnNameRegexDetector

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

detectpii-0.1.8.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

detectpii-0.1.8-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file detectpii-0.1.8.tar.gz.

File metadata

  • Download URL: detectpii-0.1.8.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Darwin/23.4.0

File hashes

Hashes for detectpii-0.1.8.tar.gz
Algorithm Hash digest
SHA256 736860440c02993a76bfed226f30eb2201e1f5c787e1c796b788f25d06d43fc8
MD5 be04601903b02a29afdc8b1a69283040
BLAKE2b-256 7e57eeb6fa10d03ab23c5dbd558ec842000ce800bf2075b2c3f9f276990b6277

See more details on using hashes here.

File details

Details for the file detectpii-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: detectpii-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Darwin/23.4.0

File hashes

Hashes for detectpii-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 2eb7bbe1e5f665d2179791d0a9c88a7a4b5d6a853f395f8aeeb229bb7be3e770
MD5 774289036d505dde031b378bf15696d4
BLAKE2b-256 6b67e3ea77d12c60008da6fe73bf012277dccab8235d187439a679ada880eee3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page