Detect PII columns in your database and warehouse
Project description
🔍 Detect PII
Detect PII is a library inspired by piicatcher and CommonRegex to detect columns in tables that may potentially contain PII. It does so by performing regex matches on column names and column values, flagging the ones that may contain PII.
Usage
Installation
Packages can be installed by specifying extras, e.g.:
pip install detectpii[postgres]
See all supported databases and data warehouses.
Scan tables for PII
from detectpii.catalog import PostgresCatalog
from detectpii.pipeline import PiiDetectionPipeline
from detectpii.scanner import DataScanner, MetadataScanner
from detectpii.util import print_columns
# -- Create a catalog to connect to a database / warehouse
pg_catalog = PostgresCatalog(
host="localhost",
user="postgres",
password="my-secret-pw",
database="postgres",
port=5432,
schema="public"
)
# -- Create a pipeline to detect PII in the tables
pipeline = PiiDetectionPipeline(
catalog=pg_catalog,
scanners=[
MetadataScanner(),
DataScanner(),
],
times=1,
percentage=20,
)
# -- Scan for PII columns.
pii_columns = pipeline.scan()
# -- Print them to the console
print_columns(pii_columns)
Persist the pipeline
import json
from detectpii.pipeline import pipeline_to_dict
# -- Create a pipeline
pipeline = ...
# -- Convert it into a dictionary
dictionary = pipeline_to_dict(pipeline)
# -- Print it
print(json.dumps(dictionary, indent=4))
# {
# "catalog": {
# "tables": [],
# "resolver": {
# "name": "PlaintextResolver",
# "_type": "PlaintextResolver"
# },
# "user": "postgres",
# "password": "my-secret-pw",
# "host": "localhost",
# "port": 5432,
# "database": "postgres",
# "schema": "public",
# "_type": "PostgresCatalog"
# },
# "scanners": [
# {
# "_type": "MetadataScanner"
# },
# {
# "_type": "DataScanner"
# }
# ]
# "times": 1,
# "percentage": 10
# }
Load the pipeline
from detectpii.pipeline import dict_to_pipeline
# -- Load the persisted pipeline as a dictionary
dictionary: dict = ...
# -- Convert it back to a pipeline object
pipeline = dict_to_pipeline(dictionary=dictionary)
For more detailed documentation, please see the docs folder.
Supported databases / warehouses
- Hive in
detectpii[hive] - Postgres in
detectpii[postgres] - Snowflake in
detectpii[snowflake] - Trino in
detectpii[trino] - Yugabyte in
detectpii[yugabyte]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file detectpii-0.1.6.tar.gz.
File metadata
- Download URL: detectpii-0.1.6.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Darwin/23.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d045ef3b3bff6ff453f0ff4100aebf1d800cb58b64002bb4423fdf4d45fc7ae
|
|
| MD5 |
d898f0d89f7af7c7ca5cddf229f10220
|
|
| BLAKE2b-256 |
66bd23d390686f179dfdc5bd18cd6c794a4d09368bd71dca247ef47e4ba8ed40
|
File details
Details for the file detectpii-0.1.6-py3-none-any.whl.
File metadata
- Download URL: detectpii-0.1.6-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.9 Darwin/23.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3876e4997ce79b8317194d1dcf348ac3335f7b98d74f12b27f713906d8a8a2f6
|
|
| MD5 |
226abb4bbe0ce752e4e3ad3e7dc28ba2
|
|
| BLAKE2b-256 |
84a229aa32f84ef24c08d885009ef59a6ad8a693a52af1e80a7873e31a7dcf6f
|