Skip to main content

PII desensitization + AES-256-GCM encryption + compliance reporting for cross-border data transfers (PIPL / PDPA / GDPR)

Project description

CloakPII

Python License: MIT CI

Mask the PII in your data files, encrypt the result with AES-256-GCM, and generate the compliance paperwork for moving it across borders — in one command.

CloakPII auto-detects 11 kinds of personal data (emails, phones, national IDs, cards, IBANs… including Chinese ID numbers and Chinese column names), masks them irreversibly, encrypts every file, and produces a compliance report. Built for the strictest regimes — PIPL (China) and PDPA (Singapore) — plus GDPR, CCPA, and LGPD.

pip install cloakpii
# Provide the password via the environment, not the command line
# (--password would leak into `ps` and your shell history)
export CLOAKPII_PASSWORD=...
# Desensitize + encrypt a folder, and emit a PIPL compliance report
cloakpii migrate --source ./data --output ./safe \
  --compliance-profile pipl --compliance-report
# before                              # after (in ./safe/desensitized)
name,email,phone                      name,email,phone
Wei,wei@corp.cn,138-1234-5678         W***,w***@c******.cn,138-****-**78

Encrypted copies land in ./safe/encrypted/ (AES-256-GCM). Restore the whole tree any time with cloakpii decrypt-all.

Two modes: mask (irreversible) or tokenize (reversible)

# Reversible, join-preserving pseudonyms — masked data stays usable
cloakpii migrate --source ./data --output ./safe --password "$PW" --mode tokenize

In --mode tokenize, every PII value is replaced by a stable token, and the same input always maps to the same token (even across separate runs with the same password):

email,city                              email,city
wei@corp.cn,SH        ──tokenize──▶     tkz_p6dk3s7…,SH    ┐ same value →
wei@corp.cn,BJ                          tkz_p6dk3s7…,BJ    ┘ same token (joins work)
li@corp.cn,SH                           tkz_cx5kz36…,SH

So you can still join, GROUP BY, and de-duplicate the protected data — and recover the originals with the password:

cloakpii decrypt-all  --input ./safe/encrypted --output ./restored --password "$PW"
cloakpii detokenize   --input ./restored       --output ./original  --password "$PW"

Use mask (the default) when the data should never be recoverable; use tokenize when downstream systems still need referential integrity.

What this is — and what it isn't

Use it to turn a directory of files containing PII into a desensitized, encrypted copy that is safe to move across borders (the design focus is China ⇄ Singapore, i.e. PIPL + PDPA), together with the paperwork those regimes expect.

Two things to understand before you rely on it:

  • Masking (the default mode) is irreversible. Masked values (alice@x.coma***@x******.com) cannot be recovered — even after you decrypt. If you need the data to stay usable (joins, dedup) and recoverable, use --mode tokenize instead (see above).
  • Compliance output is documentation, not legal sign-off. The profiles, assessment, and --compliance-report features generate checklists and declaration templates to help you prepare a filing. They do not constitute legal advice or a guarantee of compliance — have counsel review actual cross-border filings.
  • Detection is not exhaustive. The built-in detector is regex + column-name keywords. By design it does not catch: phone numbers written as bare digit runs with no separators (e.g. 13812345678 — masked only if the column name signals PII), old 15-digit Chinese IDs, IPv6 addresses, free-text personal names, or PII glued directly to surrounding letters. Enable the optional ML backend (see ML_SETUP.md) for names and broader coverage, and spot-check the output on a sample before trusting a new dataset.

Features

  • Two modes: irreversible masking or reversible, join-preserving tokenization
  • 8 file formats: CSV, JSON, Excel, Parquet, XML, TSV, SQLite, plain text
  • 11 PII types: email, phone, SSN, credit card, IP, Chinese ID, passport, bank account, IBAN, MAC address, date of birth
  • 5 compliance profiles: GDPR (EU), PDPA (Singapore), CCPA (California), LGPD (Brazil), PIPL (China)
  • AES-256-GCM encryption with PBKDF2 key derivation (480k iterations)
  • Parallel processing with configurable worker threads
  • Progress bar for real-time feedback
  • Integrity verification via SHA-256 manifests
  • Audit trail logging (JSON Lines)
  • YAML configuration files with CLI overrides
  • Compression support (gzip)
  • Resume interrupted migrations
  • Docker support

Quick Start

Installation

pip install cloakpii

Or from source:

git clone https://github.com/Hellotravisss/cloakpii.git
cd cloakpii
pip install -e .

Basic Usage

# Migrate a directory (desensitize + encrypt)
cloakpii migrate --source data/ --output output/ --password mypassword

# Preview what would happen (dry run)
cloakpii migrate --source data/ --dry-run

# Encrypt a single file
cloakpii encrypt input.csv output.csv.enc --password mypassword

# Decrypt a file
cloakpii decrypt output.csv.enc decrypted.csv --password mypassword

# Restore an entire migration output tree (desensitized plaintext)
cloakpii decrypt-all --input output/encrypted --output restored/ --password mypassword

Using Environment Variables

export CLOAKPII_PASSWORD=mypassword
cloakpii migrate --source data/ --output output/

CLI Reference

Commands

Command Description
migrate Run full migration pipeline
encrypt Encrypt a single file
decrypt Decrypt a single file
decrypt-all Decrypt a whole migration output tree
detokenize Reverse --mode tokenize back to originals
init Initialize project configuration
verify Verify file integrity against a manifest
status Show status of a previous migration
profiles List available compliance profiles

migrate

cloakpii migrate [OPTIONS]

Options:
  --source DIR            Source directory (default: examples)
  --output DIR            Output directory (default: output)
  --mode MODE             mask (irreversible, default) | tokenize (reversible)
  --target NAME           Target jurisdiction (default: singapore)
  --password PW           Encryption password (or use CLOAKPII_PASSWORD env var)
  --config FILE           Path to YAML config file
  --dry-run               Preview without modifying files
  --workers N             Number of parallel workers (default: 1)
  --batch-size N          Max files to process (0 = all)
  --no-progress           Disable progress bar
  --compliance-profile P  Validate against profile (gdpr/pdpa/ccpa/lgpd/pipl)
  --compress              Compress encrypted output with gzip
  --resume                Skip already-processed files
  --no-manifest           Skip SHA-256 manifest generation
  --audit FILE            Path for audit log (JSON Lines)
  --skip-patterns PAT...  Glob patterns for files to skip
  --verbose               Enable debug logging
  --log-file FILE         Write logs to file

Examples

# Parallel processing with 4 workers
cloakpii migrate --source data/ --output out/ --workers 4

# GDPR compliance check
cloakpii migrate --source data/ --compliance-profile gdpr

# Process only first 10 files
cloakpii migrate --source data/ --batch-size 10

# Resume interrupted migration
cloakpii migrate --source data/ --output out/ --resume

# With audit log and compression
cloakpii migrate --source data/ --audit out/audit.jsonl --compress

# Skip test files
cloakpii migrate --source data/ --skip-patterns "test_*" "*.tmp"

Configuration File

Create a migration.yaml for reusable settings:

source: /path/to/data
output: /path/to/output
target: singapore
compliance_profile: pdpa
workers: 4
batch_size: 0
show_progress: true
encrypt_method: aes-256-gcm
audit_log: true
generate_manifest: true
compress_output: false
skip_patterns:
  - "*.tmp"
  - "test_*"
custom_pii_patterns: []
field_mappings: {}

Use it:

cloakpii migrate --config migration.yaml

CLI arguments override config file values.

Supported File Formats

Format Extension Description
CSV .csv Comma-separated values
JSON .json JSON files (nested structures)
Excel .xlsx, .xls Excel workbooks (all sheets)
Parquet .parquet Apache Parquet columnar format
XML .xml XML documents
TSV .tsv Tab-separated values
SQLite .db, .sqlite SQLite databases (all tables)
Text .txt, .log, .md Plain text files

Supported PII Types

PII Type Example Masked Output
Email user@example.com u***@e******.com
Phone 555-123-4567 555-***-****
SSN 123-45-6789 ***-**-6789
Credit Card 4111111111111111 4111****1111
IP Address 192.168.1.100 192.168.*.*
Chinese ID 110101199001011234 1101***********234
Passport AB1234567 AB***4567
Bank Account 1234567890123456 1234********3456
IBAN GB29NWBK60161331926819 GB29****6819
MAC Address 00:1B:44:11:3A:B7 00:1B:**:**:**:B7
Date of Birth 1990-01-15 ****-**-15

Field names containing keywords like name, email, phone, ssn, address, passport, bank_account are automatically masked even if content doesn't match a regex pattern.

Compliance Profiles

Route A Focus (v1.1.0+): China & Singapore Compliance

CloakPII is now optimized for PIPL (China) and PDPA (Singapore) — two of the strictest data protection regimes for cross-border transfers.

Quick Start - PIPL (China)

Generates:

  • Full PII desensitization + AES-256-GCM encryption
  • Security assessment checklist
  • Cross-border transfer legal path documentation

Quick Start - PDPA (Singapore)

Includes DPO requirements and 30-day access request handling notes.

cloakpii profiles
Profile Jurisdiction Key Requirements
GDPR EU Explicit consent, 72h breach notification, right to erasure
PDPA Singapore DPO required, 30-day access requests
CCPA California Right to know/delete/opt-out
LGPD Brazil Legal basis required, ANPD reporting
PIPL China Data localization, cross-border assessment required

Docker

# Build
docker build -t cloakpii .

# Run
docker run --rm -v $(pwd)/data:/data -v $(pwd)/output:/output \
  -e CLOAKPII_PASSWORD=mypassword \
  cloakpii migrate --source /data --output /output

Or with docker-compose:

CLOAKPII_PASSWORD=mypassword docker-compose run migrator

Architecture

cloakpii/
├── __init__.py        # Version
├── cli.py             # CLI entry point (argparse)
├── crypto.py          # AES-256-GCM encryption
├── pii.py             # PII detection & desensitization (8 formats)
├── migrate.py         # Migration pipeline orchestration
├── compliance.py      # Jurisdiction compliance profiles
├── integrity.py       # SHA-256 manifest verification
├── config.py          # YAML configuration support
└── audit.py           # Audit trail logging

Pipeline flow:

Source files → Classify → Desensitize PII → Encrypt (AES-256-GCM) → Manifest → Output

Development

# Clone and install
git clone https://github.com/Hellotravisss/cloakpii.git
cd cloakpii
pip install -e .
pip install pytest ruff

# Run tests
make test

# Lint
make lint

# Build
make build

License

MIT License. See LICENSE for details.

Route A Quickstart (PIPL + PDPA) — v1.1.0

# 1. List enhanced compliance profiles
cloakpii profiles

# 2. Run migration with compliance report (PIPL)
CLOAKPII_PASSWORD=yourpass cloakpii migrate \
  --source examples \
  --output output/pipl \
  --compliance-profile pipl \
  --compliance-report

# 3. Same for PDPA (Singapore)
CLOAKPII_PASSWORD=yourpass cloakpii migrate \
  --source examples \
  --output output/pdpa \
  --compliance-profile pdpa \
  --compliance-report

# Reports will be generated:
# - compliance_report_pipl.json + .md
# - compliance_report_pdpa.json + .md

New in v1.1.0 (Route A)

New Commands

# Scan a directory for PII without migrating
cloakpii scan --source data/ --output scan_report.json

# Generate PIPL Security Assessment template
cloakpii assessment --output security_assessment.json

Enhanced migrate command

# Generate professional compliance report (JSON + Markdown)
cloakpii migrate \
  --source examples \
  --compliance-profile pipl \
  --compliance-report

Configuration

You can now store password in your migration.yaml:

password: "your-password-here"

Incremental Migration & Resume

CloakPII supports incremental/resume migrations using a local SQLite state database.

How it works

  • When you run with --resume, the tool records each successfully processed file (path + SHA256 hash) in .migration_state.db inside the output directory.
  • On subsequent runs with --resume, files with the same path and hash are automatically skipped.
  • If a file is modified after being processed, its hash changes and it will be re-processed.

Usage

# First run (processes everything)
cloakpii migrate --source data/ --output out/ --resume

# Later runs (only processes new or changed files)
cloakpii migrate --source data/ --output out/ --resume

State Database Location

The state file is stored at:

<output_directory>/.migration_state.db

You can safely delete this file to force a full re-processing.

Corruption Recovery

If the state database becomes corrupted (e.g. interrupted write), the migrator will automatically delete it and start fresh on the next run.

Advanced: Custom State Location

For advanced use cases, you can manage the state manually via the Python API:

from cloakpii.state import MigrationState
from cloakpii.migrate import run_migration
from pathlib import Path

state = MigrationState(Path("custom_state.db"))
report = run_migration(
    source_dir=Path("data"),
    output_dir=Path("out"),
    password="secret",
    resume=True,
    state=state
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloakpii-1.4.2.tar.gz (69.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cloakpii-1.4.2-py3-none-any.whl (52.0 kB view details)

Uploaded Python 3

File details

Details for the file cloakpii-1.4.2.tar.gz.

File metadata

  • Download URL: cloakpii-1.4.2.tar.gz
  • Upload date:
  • Size: 69.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cloakpii-1.4.2.tar.gz
Algorithm Hash digest
SHA256 a84709e686eab7c369daf0d659524624773c4731506fdb92297e083f4feca586
MD5 ad9155b00055f299a4dd431e0e07d20a
BLAKE2b-256 7cd498d38be6aa6cf866b21f22a5b3bf03ef378573301bf5f5051a2c40ed4df6

See more details on using hashes here.

Provenance

The following attestation bundles were made for cloakpii-1.4.2.tar.gz:

Publisher: release.yml on Hellotravisss/cloakpii

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cloakpii-1.4.2-py3-none-any.whl.

File metadata

  • Download URL: cloakpii-1.4.2-py3-none-any.whl
  • Upload date:
  • Size: 52.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cloakpii-1.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f26735b2eb3821379b200714111b75b08b6efd717c5411a55b8f4c5f4397ced3
MD5 c2ec2eebcf51d07b8b74d5b7335d03f4
BLAKE2b-256 eeec23e90e17310dacb57057836936c3c8ee76d0240e4c86abcfb817a15c92a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for cloakpii-1.4.2-py3-none-any.whl:

Publisher: release.yml on Hellotravisss/cloakpii

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page