PII desensitization + AES-256-GCM encryption + compliance reporting for cross-border data transfers (PIPL / PDPA / GDPR)
Project description
CloakPII
Moving customer data across borders without leaking PII — or failing a PIPL / PDPA audit — is hard. CloakPII does it in one command.
If your data leaves China or Singapore, you're on the hook for PIPL and PDPA: personal data has to be desensitized, the transfer encrypted, and the paperwork filed. Rolling that yourself means stitching together PII detection, a masking scheme, AES encryption, and a compliance checklist — and getting every one of them right.
CloakPII is a single command-line tool that takes a folder of files, masks (or reversibly tokenizes) the PII inside, encrypts every file with AES-256-GCM, and generates the PIPL / PDPA compliance report — so cross-border data is safe to move and you have the audit trail to prove it.
pip install cloakpii
export CLOAKPII_PASSWORD=... # keep secrets out of ps / shell history
cloakpii migrate --source ./data --output ./safe \
--compliance-profile pipl --compliance-report
# before # after (masked + encrypted)
name,email,phone name,email,phone
Wei,wei@corp.cn,138-1234-5678 W***,w***@c******.cn,138-****-**78
Masked plaintext lands in ./safe/desensitized/, AES-256-GCM copies in ./safe/encrypted/, and a PIPL report (JSON + Markdown) next to them. Restore the whole tree any time with cloakpii decrypt-all.
Why CloakPII
Most teams reach for a general PII library, a separate encryption step, and a hand-written compliance doc. CloakPII collapses all three into one auditable pipeline, with a deliberate focus on China ⇄ Singapore cross-border transfers.
| CloakPII | Microsoft Presidio | Roll-your-own | gpg / openssl | |
|---|---|---|---|---|
| PII detection (regex + field names) | ✅ | ✅ | ⚠️ DIY | ❌ |
| Chinese IDs and Chinese column names | ✅ | ⚠️ partial | ⚠️ DIY | ❌ |
| Irreversible masking | ✅ | ✅ | ⚠️ DIY | ❌ |
| Reversible, join-preserving tokenization | ✅ | ❌ | ⚠️ DIY | ❌ |
| Built-in AES-256-GCM encryption | ✅ | ❌ | ⚠️ DIY | ✅ |
| PIPL / PDPA / GDPR compliance reports | ✅ | ❌ | ❌ | ❌ |
| Multi-format (CSV/JSON/Excel/Parquet/XML/TSV/SQLite/text) | ✅ | ⚠️ text-first | ⚠️ DIY | ⚠️ opaque blobs |
| One command, folder in → safe folder out | ✅ | ❌ | ❌ | ❌ |
Presidio is a great detection engine, but it stops at detection — you still bolt on encryption and compliance yourself. gpg encrypts but never sees the PII. CloakPII is the opinionated, end-to-end path for the cross-border-compliance use case.
Two modes: mask (irreversible) or tokenize (reversible)
# Reversible, join-preserving pseudonyms — masked data stays usable
cloakpii migrate --source ./data --output ./safe --mode tokenize
In --mode tokenize, every PII value is replaced by a stable token, and the same input always maps to the same token (even across separate runs with the same password):
email,city email,city
wei@corp.cn,SH ──tokenize──▶ tkz_p6dk3s7…,SH ┐ same value →
wei@corp.cn,BJ tkz_p6dk3s7…,BJ ┘ same token (joins work)
li@corp.cn,SH tkz_cx5kz36…,SH
So you can still join, GROUP BY, and de-duplicate the protected data — and recover the originals with the password:
cloakpii decrypt-all --input ./safe/encrypted --output ./restored --password "$CLOAKPII_PASSWORD"
cloakpii detokenize --input ./restored --output ./original
Use mask (the default) when the data should never be recoverable; use tokenize when downstream systems still need referential integrity.
What this is — and what it isn't
Use it to turn a directory of files containing PII into a desensitized, encrypted copy that is safe to move across borders (the design focus is China ⇄ Singapore, i.e. PIPL + PDPA), together with the paperwork those regimes expect.
Three things to understand before you rely on it:
- Masking (the default mode) is irreversible. Masked values (
alice@x.com→a***@x******.com) cannot be recovered — even after you decrypt. If you need the data to stay usable (joins, dedup) and recoverable, use--mode tokenize. - Compliance output is documentation, not legal sign-off. The
profiles,assessment, and--compliance-reportfeatures generate checklists and declaration templates to help you prepare a filing. They are not legal advice — have counsel review actual cross-border filings. - Detection is not exhaustive. The built-in detector is regex + column-name keywords. By design it does not catch: phone numbers written as bare digit runs with no separators (e.g.
13812345678— masked only if the column name signals PII), old 15-digit Chinese IDs, IPv6 addresses, free-text personal names, or PII glued directly to surrounding letters. Enable the optional ML backend (see ML_SETUP.md) for names and broader coverage, and spot-check the output on a sample before trusting a new dataset.
Features
- Two modes: irreversible masking or reversible, join-preserving tokenization
- 8 file formats: CSV, JSON, Excel, Parquet, XML, TSV, SQLite, plain text
- 11 PII types: email, phone, SSN, credit card, IP, Chinese ID, passport, bank account, IBAN, MAC address, date of birth
- 5 compliance profiles: GDPR (EU), PDPA (Singapore), CCPA (California), LGPD (Brazil), PIPL (China)
- AES-256-GCM encryption with PBKDF2 key derivation (480k iterations)
- Parallel processing, progress bar, resume interrupted runs
- Integrity verification via SHA-256 manifests, audit trail (JSON Lines)
- YAML configuration with CLI overrides, gzip compression, Docker support
Quick Start
pip install cloakpii
Or from source:
git clone https://github.com/Hellotravisss/cloakpii.git
cd cloakpii
pip install -e .
export CLOAKPII_PASSWORD=mypassword # used by every command
# Migrate a directory (desensitize + encrypt)
cloakpii migrate --source data/ --output output/
# Preview what would happen (dry run)
cloakpii migrate --source data/ --dry-run
# Encrypt / decrypt a single file
cloakpii encrypt input.csv output.csv.enc
cloakpii decrypt output.csv.enc decrypted.csv
# Restore an entire migration output tree
cloakpii decrypt-all --input output/encrypted --output restored/
Tip: prefer
CLOAKPII_PASSWORD(or--key-file) over--password— a password on the command line is visible inpsand your shell history.
CLI Reference
| Command | Description |
|---|---|
migrate |
Run the full migration pipeline |
encrypt |
Encrypt a single file |
decrypt |
Decrypt a single file |
decrypt-all |
Decrypt a whole migration output tree |
detokenize |
Reverse --mode tokenize back to originals |
scan |
Scan a directory for PII without migrating |
assessment |
Generate a PIPL Security Assessment template |
init |
Initialize project configuration |
verify |
Verify file integrity against a manifest |
status |
Show status of a previous migration |
profiles |
List available compliance profiles |
migrate options
--source DIR Source directory (default: examples)
--output DIR Output directory (default: output)
--mode MODE mask (irreversible, default) | tokenize (reversible)
--target NAME Target jurisdiction (default: singapore)
--password PW Encryption password (prefer CLOAKPII_PASSWORD / --key-file)
--key-file FILE Read the password from a file
--config FILE Path to YAML config file
--dry-run Preview without modifying files
--workers N Number of parallel workers (default: 1)
--batch-size N Max files to process (0 = all)
--no-progress Disable progress bar
--compliance-profile P Validate against profile (gdpr/pdpa/ccpa/lgpd/pipl)
--compliance-report Generate a detailed compliance report (JSON + Markdown)
--compress Compress encrypted output with gzip
--resume Skip already-processed files
--no-manifest Skip SHA-256 manifest generation
--audit FILE Path for audit log (JSON Lines)
--skip-patterns PAT... Glob patterns for files to skip
--verbose Enable debug logging
--log-file FILE Write logs to file
Examples
# Parallel processing with 4 workers
cloakpii migrate --source data/ --output out/ --workers 4
# PIPL compliance check + report
cloakpii migrate --source data/ --compliance-profile pipl --compliance-report
# Resume an interrupted migration
cloakpii migrate --source data/ --output out/ --resume
# With audit log and compression
cloakpii migrate --source data/ --audit out/audit.jsonl --compress
# Scan only — find PII without migrating
cloakpii scan --source data/ --output scan_report.json
Configuration File
Create a migration.yaml for reusable settings (CLI arguments override it):
source: /path/to/data
output: /path/to/output
target: singapore
compliance_profile: pdpa
workers: 4
show_progress: true
audit_log: true
generate_manifest: true
compress_output: false
skip_patterns:
- "*.tmp"
- "test_*"
custom_pii_patterns: []
cloakpii migrate --config migration.yaml
For security, CloakPII will not write
password/key_fileto a config file — provide the password viaCLOAKPII_PASSWORD,--key-file, or the interactive prompt.
Supported File Formats
| Format | Extension | Notes |
|---|---|---|
| CSV | .csv |
Comma-separated values |
| JSON | .json |
Nested structures, numbers included |
| Excel | .xlsx |
All sheets |
| Parquet | .parquet |
Apache Parquet columnar format |
| XML | .xml |
Text, tails, and attributes |
| TSV | .tsv |
Tab-separated values |
| SQLite | .db, .sqlite |
All tables (incl. WITHOUT ROWID) |
| Text | .txt, .log, .md |
Plain text |
Supported PII Types
| PII Type | Example | Masked Output |
|---|---|---|
user@example.com |
u***@e******.com |
|
| Phone | 555-123-4567 |
555-***-**67 |
| SSN | 123-45-6789 |
***-**-6789 |
| Credit Card | 4111111111111111 |
4111****1111 |
| IP Address | 192.168.1.100 |
192.168.*.* |
| Chinese ID | 110101199001011234 |
1101***********234 |
| Passport | AB1234567 |
AB***4567 |
| Bank Account | 1234567890123456 |
1234********3456 |
| IBAN | GB29NWBK60161331926819 |
GB29****6819 |
| MAC Address | 00:1B:44:11:3A:B7 |
00:1B:**:**:**:B7 |
| Date of Birth | 1990-01-15 |
****-**-15 |
Column names containing keywords like name, email, phone, 身份证, 手机号 (English and Chinese) are masked even when the value doesn't match a regex.
Compliance Profiles
cloakpii profiles
| Profile | Jurisdiction | Key Requirements |
|---|---|---|
| PIPL | China | Data localization, cross-border security assessment |
| PDPA | Singapore | DPO required, 30-day access requests |
| GDPR | EU | Explicit consent, 72h breach notice, right to erasure |
| CCPA | California | Right to know / delete / opt-out |
| LGPD | Brazil | Legal basis required, ANPD reporting |
With --compliance-report, PIPL and PDPA runs emit a compliance_report_<profile>.json + .md containing a security-assessment checklist and cross-border transfer notes.
Security
CloakPII is a security tool, so it's built and audited like one:
- AES-256-GCM authenticated encryption; random per-file nonce; random per-run salt stored in the wire header.
- PBKDF2-HMAC-SHA256, 480,000 iterations.
- Tokenization uses AES-GCM-SIV (nonce-misuse-resistant) for stable, reversible pseudonyms.
- Secrets are never written to disk; audit logs and decrypted output are created
0o600. - XML parsed with
defusedxml(XXE-safe); SQLite handlers use parameterized identifiers (no SQL injection);decrypt-allcaps decompression to resist zip bombs.
See CHANGELOG.md for the full security history. Found an issue? Please open a GitHub issue.
Docker
docker run --rm -v $(pwd)/data:/data -v $(pwd)/output:/output \
-e CLOAKPII_PASSWORD=mypassword \
cloakpii migrate --source /data --output /output
Incremental Migration & Resume
With --resume, each processed file (path + SHA-256) is recorded in <output>/.migration_state.db. Later runs skip files whose path and hash are unchanged; modified files are re-processed. Delete the state file to force a full re-run.
Development
git clone https://github.com/Hellotravisss/cloakpii.git
cd cloakpii
pip install -e .
pip install pytest ruff
make test # run the test suite
make lint # ruff
make build # build sdist + wheel
License
MIT License. See LICENSE.
Keywords: PIPL compliance tool · PDPA data masking · cross-border data transfer compliance · PII anonymization / desensitization in Python · redact PII in CSV, JSON, Parquet, SQLite · Chinese ID & Chinese column-name masking · AES-256-GCM file encryption · reversible tokenization / pseudonymization · GDPR / CCPA / LGPD data protection.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cloakpii-1.5.0.tar.gz.
File metadata
- Download URL: cloakpii-1.5.0.tar.gz
- Upload date:
- Size: 83.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96088bea132caf53c80af6de3d9128c0f8c24cfd95ec962c7041768c0ac25289
|
|
| MD5 |
316dbf857d03bc10675895ad1bc688a1
|
|
| BLAKE2b-256 |
a3c4c16849b1f8616df2f148babe15e69d140a7f4cc9743f16af86e6b81a5faa
|
Provenance
The following attestation bundles were made for cloakpii-1.5.0.tar.gz:
Publisher:
release.yml on Hellotravisss/cloakpii
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cloakpii-1.5.0.tar.gz -
Subject digest:
96088bea132caf53c80af6de3d9128c0f8c24cfd95ec962c7041768c0ac25289 - Sigstore transparency entry: 1892632830
- Sigstore integration time:
-
Permalink:
Hellotravisss/cloakpii@c949ac4ce82b23f0e1f93b40fb3dd9ce08b98a16 -
Branch / Tag:
refs/tags/v1.5.0 - Owner: https://github.com/Hellotravisss
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c949ac4ce82b23f0e1f93b40fb3dd9ce08b98a16 -
Trigger Event:
release
-
Statement type:
File details
Details for the file cloakpii-1.5.0-py3-none-any.whl.
File metadata
- Download URL: cloakpii-1.5.0-py3-none-any.whl
- Upload date:
- Size: 60.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d5c4e6f2b8640bb405d6a6d3d6502d26e0a4dc8bddee5c34832bf332b4bf339
|
|
| MD5 |
92f500f523d3f2150e02f9707e12ba70
|
|
| BLAKE2b-256 |
8313c5f304550250b1e13b16ce15e3e539ea930e8a5c268e6e1c5480aaf147f6
|
Provenance
The following attestation bundles were made for cloakpii-1.5.0-py3-none-any.whl:
Publisher:
release.yml on Hellotravisss/cloakpii
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cloakpii-1.5.0-py3-none-any.whl -
Subject digest:
2d5c4e6f2b8640bb405d6a6d3d6502d26e0a4dc8bddee5c34832bf332b4bf339 - Sigstore transparency entry: 1892633010
- Sigstore integration time:
-
Permalink:
Hellotravisss/cloakpii@c949ac4ce82b23f0e1f93b40fb3dd9ce08b98a16 -
Branch / Tag:
refs/tags/v1.5.0 - Owner: https://github.com/Hellotravisss
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c949ac4ce82b23f0e1f93b40fb3dd9ce08b98a16 -
Trigger Event:
release
-
Statement type: