Find PII data in databases
Project description
Pii Catcher for Files and Databases
Overview
PiiCatcher finds PII data in your databases. It scans all the columns in your database and finds the following types of PII information:
- PHONE
- CREDIT_CARD
- ADDRESS
- PERSON
- LOCATION
- BIRTH_DATE
- GENDER
- NATIONALITY
- IP_ADDRESS
- SSN
- USER_NAME
- PASSWORD
PiiCatcher uses two types of scanners to detect PII information:
- CommonRegex uses a set of regular expressions for common types of information
- Spacy Named Entity Recognition uses Natural Language Processing to detect named entities. Only English language is currently supported.
Supported Technologies
PiiCatcher supports the following filesystems:
- POSIX
- AWS S3 (for files that are part of tables in AWS Glue and AWS Athena)
- Google Cloud Storage (Coming Soon)
- ADLS (Coming Soon)
PiiCatcher supports the following databases:
- Sqlite3 v3.24.0 or greater
- MySQL 5.6 or greater
- PostgreSQL 9.4 or greater
- AWS Redshift
- SQL Server
- Oracle
- AWS Glue/AWS Athena
Installation
PiiCatcher is available as a command-line application.
To install use pip:
python3 -m venv .env
source .env/bin/activate
pip install piicatcher
Or clone the repo:
git clone https://github.com/vrajat/piicatcher.git
python3 -m venv .env
source .env/bin/activate
python setup.py install
Install Spacy Language Model
python -m spacy download en_core_web_sm
Install Oracle Client
PiiCatcher on Oracle, requires a working client. Please refer to cx_Oracle documentation for more information.
Usage
Relational Databases:
# Print usage to scan databases
piicatcher db -h
usage: piicatcher db [-h] -s HOST [-R PORT] [-u USER] [-p PASSWORD]
[-t {sqlite,mysql,postgres}] [-c {deep,shallow}]
[-o OUTPUT] [-f {ascii_table,json,orm}]
optional arguments:
-h, --help show this help message and exit
-s HOST, --host HOST Hostname of the database. File path if it is SQLite
-R PORT, --port PORT Port of database.
-u USER, --user USER Username to connect database
-p PASSWORD, --password PASSWORD
Password of the user
-t {sqlite,mysql,postgres}, --connection-type {sqlite,mysql,postgres}
Type of database
-c {deep,shallow}, --scan-type {deep,shallow}
Choose deep(scan data) or shallow(scan column names
only)
-o OUTPUT, --output OUTPUT
File path for report. If not specified, then report is
printed to sys.stdout
-f {ascii_table,json,orm}, --output-format {ascii_table,json,orm}
Choose output format type
usage: piicatcher files [-h] [--path PATH] [--output OUTPUT]
[--output-format {ascii_table,json,orm}]
AWS S3 files backed by tables in AWS Glue & AWS Athena
piicatcher aws -h
usage: piicatcher aws [-h] -a ACCESS_KEY -s SECRET_KEY -d STAGING_DIR -r
REGION
[-t {sqlite,mysql,postgres,redshift,oracle,sqlserver}]
[-c {deep,shallow}] [-o OUTPUT]
[-f {ascii_table,json,orm}] [--list-all]
optional arguments:
-h, --help show this help message and exit
-a ACCESS_KEY, --access-key ACCESS_KEY
AWS Access Key
-s SECRET_KEY, --secret-key SECRET_KEY
AWS Secret Key
-d STAGING_DIR, --staging-dir STAGING_DIR
S3 Staging Directory for Athena results
-r REGION, --region REGION
AWS Region
-c {deep,shallow}, --scan-type {deep,shallow}
Choose deep(scan data) or shallow(scan column names
only)
-o OUTPUT, --output OUTPUT
File path for report. If not specified, then report is
printed to sys.stdout
-f {ascii_table,json,orm}, --output-format {ascii_table,json,orm}
Choose output format type
--list-all List all columns. By default only columns with PII
information is listed
Files in a POSIX Filesystem
piicatcher files -h
# Print usage to scan files
optional arguments:
-h, --help show this help message and exit
--path PATH Path to file or directory
--output OUTPUT File path for report. If not specified, then report is
printed to sys.stdout
--output-format {ascii_table,json,orm}
Choose output format type
Example
# run piicatcher on a sqlite db and print report to console
piicatcher db -c '/db/sqlqb'
╭─────────────┬─────────────┬─────────────┬─────────────╮
│ schema │ table │ column │ has_pii │
├─────────────┼─────────────┼─────────────┼─────────────┤
│ main │ full_pii │ a │ 1 │
│ main │ full_pii │ b │ 1 │
│ main │ no_pii │ a │ 0 │
│ main │ no_pii │ b │ 0 │
│ main │ partial_pii │ a │ 1 │
│ main │ partial_pii │ b │ 0 │
╰─────────────┴─────────────┴─────────────┴─────────────╯
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for piicatcher-0.6.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9dcf038253d79d7dac502e1fa9d47d6464aed1c2d7161ea4e305184533a959b |
|
MD5 | e7b1a2baffba81558e215845ff0460ca |
|
BLAKE2b-256 | 99bb56ebdcfcbaf15911b09358ad2db8d31f502fe257d0559892eaed53bdbe65 |