Skip to main content

Detect PII columns in your database and warehouse

Project description

🔍 Detect PII

Detect PII is a library inspired by piicatcher and CommonRegex to detect columns in tables that may potentially contain PII. It does so by performing regex matches on column names and column values, flagging the ones that may contain PII.

Usage

Installation

Packages can be installed by specifying extras, e.g.:

pip install detectpii[postgres]

See all supported databases and data warehouses.

Scan tables for PII

from detectpii.catalog import PostgresCatalog
from detectpii.pipeline import PiiDetectionPipeline
from detectpii.scanner import DataScanner, MetadataScanner
from detectpii.util import print_columns

# -- Create a catalog to connect to a database / warehouse
pg_catalog = PostgresCatalog(
    host="localhost",
    user="postgres",
    password="my-secret-pw",
    database="postgres",
    port=5432,
    schema="public"
)

# -- Create a pipeline to detect PII in the tables
pipeline = PiiDetectionPipeline(
    catalog=pg_catalog,
    scanners=[
        MetadataScanner(),
        DataScanner(),
    ],
    times=1,
    percentage=20,
)

# -- Scan for PII columns.
pii_columns = pipeline.scan()

# -- Print them to the console
print_columns(pii_columns)

Persist the pipeline

import json
from detectpii.pipeline import pipeline_to_dict

# -- Create a pipeline
pipeline = ...

# -- Convert it into a dictionary
dictionary = pipeline_to_dict(pipeline)

# -- Print it
print(json.dumps(dictionary, indent=4))

# {
#     "catalog": {
#         "tables": [],
#         "resolver": {
#             "name": "PlaintextResolver",
#             "_type": "PlaintextResolver"
#         },
#         "user": "postgres",
#         "password": "my-secret-pw",
#         "host": "localhost",
#         "port": 5432,
#         "database": "postgres",
#         "schema": "public",
#         "_type": "PostgresCatalog"
#     },
#     "scanners": [
#         {
#             "_type": "MetadataScanner"
#         },
#         {
#             "_type": "DataScanner"
#         }
#     ]
#    "times": 1,
#    "percentage": 10
# }

Load the pipeline

from detectpii.pipeline import dict_to_pipeline

# -- Load the persisted pipeline as a dictionary
dictionary: dict = ...

# -- Convert it back to a pipeline object
pipeline = dict_to_pipeline(dictionary=dictionary)

For more detailed documentation, please see the docs folder.

Supported databases / warehouses

  • Hive in detectpii[hive]
  • Postgres in detectpii[postgres]
  • Snowflake in detectpii[snowflake]
  • Trino in detectpii[trino]
  • Yugabyte in detectpii[yugabyte]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

detectpii-0.1.6.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

detectpii-0.1.6-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file detectpii-0.1.6.tar.gz.

File metadata

  • Download URL: detectpii-0.1.6.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Darwin/23.4.0

File hashes

Hashes for detectpii-0.1.6.tar.gz
Algorithm Hash digest
SHA256 0d045ef3b3bff6ff453f0ff4100aebf1d800cb58b64002bb4423fdf4d45fc7ae
MD5 d898f0d89f7af7c7ca5cddf229f10220
BLAKE2b-256 66bd23d390686f179dfdc5bd18cd6c794a4d09368bd71dca247ef47e4ba8ed40

See more details on using hashes here.

File details

Details for the file detectpii-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: detectpii-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Darwin/23.4.0

File hashes

Hashes for detectpii-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3876e4997ce79b8317194d1dcf348ac3335f7b98d74f12b27f713906d8a8a2f6
MD5 226abb4bbe0ce752e4e3ad3e7dc28ba2
BLAKE2b-256 84a229aa32f84ef24c08d885009ef59a6ad8a693a52af1e80a7873e31a7dcf6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page