Skip to main content

Production-grade PII masking CLI tool for data engineers and analysts.

Project description

Iki_PII_Masker

Do one thing well: mask PII data.

A production-grade, pipe-friendly CLI tool and Python library for data engineers and analysts who need to sanitize datasets fast — without wrestling with config files or heavyweight frameworks.

img

pii_masker mask data.csv --auto --strategy fake -o clean.csv

Python 3.9+ License MIT Engine


Features

Feature Details
10 masking strategies fake, redact, hash, null, partial, keep, tokenize, pseudonymize, generalize, mask_format
Reversible masking AES-256-GCM — restore originals anytime with --key
Dual PII detection Column-name heuristics + cell-value scanning (detect_pii_by_value)
Multi-engine Polars, Pandas, DuckDB, SQLAlchemy (live DB), XML, JSONPath
6 file formats CSV, Parquet, JSON, NDJSON, Excel, XML
Pipe-friendly stdin → stdout, zero config required
Reproducible fakes --seed for deterministic output in CI/testing
Dry run + report Preview masking plan before touching any data
PII detector detect subcommand scans columns and cell values, prints sample values
Profile-driven config ProfileConfig + ColumnRuleMap — load masking rules from YAML or Python dict
Python façade API Import by feature — no internal sub-packages exposed

Installation

# from PyPI
pip install iki-pii-masker

Requirements: Python 3.9+

Core dependencies: rich, polars, pandas, faker, cryptography, pyarrow, openpyxl, duckdb

Optional extras:

pip install sqlalchemy psycopg2-binary   # SQLAlchemy adapter (live database)
pip install jsonpath-ng                  # JSONPath adapter (nested JSON)
pip install pyyaml                       # ProfileConfig YAML support
# XML uses stdlib xml.etree — no install needed (lxml optional for speed)

CLI framework: argparse (stdlib — no extra install needed)


Subcommands

Command Purpose
mask Apply a masking strategy to one or more columns
unmask Decrypt AES-GCM masked columns back to originals
detect Scan a file and suggest which columns contain PII
examples Print a cheat-sheet of usage patterns

Quick Start

Step 0 — Detect PII first

Before masking anything, run detect to see what the tool finds and review sample values:

pii_masker detect data.csv
┌─────────────┬─────────────┬──────────────────────────────────────────────────┐
│ Column      │ PII Type    │ Sample Values                                    │
├─────────────┼─────────────┼──────────────────────────────────────────────────┤
│ id          │ —           │ 1, 2, 3                                          │
│ full_name   │ name        │ Alice Smith, Bob Jones, Carol White              │
│ email       │ email       │ alice@example.com, bob@corp.org, carol@test.net  │
│ phone       │ phone       │ +1-555-0100, +1-555-0101, +1-555-0102            │
│ credit_card │ credit_card │ 4111111111111234, 5500005555555559               │
│ revenue     │ —           │ 1200.50, 980.00, 750.00                          │
└─────────────┴─────────────┴──────────────────────────────────────────────────┘

Suggested: pii_masker mask data.csv --columns full_name:email:phone:credit_card --strategy fake

Mask with realistic fake data

pii_masker mask data.csv --columns email:full_name:phone --strategy fake -o masked.csv

Auto-detect and redact (Parquet, Polars engine)

pii_masker mask data.parquet --auto --strategy redact --engine polars -o clean.parquet

Reversible masking

Encrypt columns so they can be restored later with the same key:

# Mask
pii_masker mask data.csv \
  --columns user_id:email \
  --reversible \
  --key "my-secret-key-2024" \
  -o masked.csv

# Restore
pii_masker unmask masked.csv \
  --columns user_id:email \
  --key "my-secret-key-2024" \
  -o restored.csv

Encrypted values are stored as ENC:<base64-token> — safe to round-trip through CSV, Parquet, and JSON.

Pipe-friendly

cat raw.csv | pii_masker mask --format csv --strategy fake > clean.csv

cat data.csv \
  | pii_masker mask --format csv --columns email --strategy redact \
  | gzip > masked.csv.gz

Partial masking

Keep the last N characters, mask the rest with *:

pii_masker mask data.csv \
  --columns credit_card:phone \
  --strategy partial \
  --partial-keep 4 \
  --partial-side right \
  -o masked.csv
4111111111111234  →  ************1234
+1-555-867-5309   →  *************309

Dry run with report

Preview exactly what would be masked before writing anything:

pii_masker mask data.csv --auto --strategy fake --dry-run --report

Reproducible fake data (CI / snapshot tests)

pii_masker mask data.csv --columns email:name --strategy fake --seed 42 -o masked.csv

Hash with salt

pii_masker mask data.csv \
  --columns user_id \
  --strategy hash \
  --salt "pepper_$(date +%Y)" \
  -o hashed.csv

Null out sensitive columns

pii_masker mask report.xlsx \
  --columns ssn:dob \
  --strategy null \
  --engine pandas \
  -o clean.xlsx

All Strategies

Strategy Output example Reversible? Best for
fake alice@fake.com No Realistic test/dev data
redact [EMAIL] With --reversible Audit logs, shared reports
hash SHA:3d7a2c1e9b4f With --reversible Join keys, deduplication
null null No Dropping PII for analytics
partial ****1234 No Card numbers, phone numbers
keep original value N/A Whitelisting non-PII columns
tokenize TOK-3d7a2c1e Via token table Stable opaque tokens; cross-run lookup possible
pseudonymize Barbara Clark Via mapping dict Consistent fakes — same input → same fake output
generalize 30-40 / 1990 No Analytics bucketing — ages, dates, zip codes
mask_format xxxx@xxxxxxx.xxx No Format-preserving masking; keeps separators

New strategy details

tokenize — replaces each value with a stable TOK-<hex> token. The same input always maps to the same token within a run. Access the lookup table via TokenizeStrategy.token_table or reverse a token with .detokenize(token).

pseudonymize — like fake but consistent: the same real name always becomes the same fake name. This preserves referential integrity across tables — a user_id that appears in five tables will map to the same fake ID in all five after masking.

generalize — coarsens precise values into broader ranges. Numerics become range buckets (3430-40), dates are truncated to year or month (1990-07-151990), and strings are prefix-masked (SW1A2AASW1****).

mask_format — replaces alphanumeric characters with * while keeping structural separators (., -, @, spaces, brackets) in place. An email like john@corp.com becomes xxxx@xxxx.xxx — the shape is preserved so format-sensitive downstream systems still parse it correctly.


Full Option Reference

pii_masker mask

Arguments:
  [INPUT_FILE]              Input file path. Omit to read from stdin.

Options:
  -o, --output PATH         Output file path. Omit to write to stdout.
  -c, --columns TEXT        Colon-separated column names. e.g. email:name:phone
  -s, --strategy STRATEGY   fake|redact|hash|null|partial|keep|
                            tokenize|pseudonymize|generalize|mask_format
                            [default: redact]
  -e, --engine ENGINE       polars|pandas|duckdb  [default: polars]
  -f, --format FORMAT       csv|parquet|json|ndjson|excel|xml
                            (auto-detected from extension)
      --auto                Auto-detect PII columns by name heuristics
      --reversible          Use AES-256-GCM reversible encryption
      --key TEXT            Secret key for reversible masking
      --salt TEXT           Salt prepended before hashing  [default: ""]
      --seed INTEGER        RNG seed for reproducible fake data
      --partial-keep INT    Number of characters to keep  [default: 4]
      --partial-side TEXT   Which side to keep: right|left  [default: right]
      --dry-run             Preview masking plan without writing output
      --report              Print a masking summary table after processing
      --no-progress         Disable the progress bar

pii_masker unmask

Arguments:
  [INPUT_FILE]              Input file path. Omit to read from stdin.

Options:
  -o, --output PATH         Output file path. Omit to write to stdout.
  -c, --columns TEXT        Colon-separated columns to decrypt  [required]
      --key TEXT            Secret key used during masking  [required]
  -e, --engine ENGINE       polars|pandas|duckdb  [default: polars]
  -f, --format FORMAT       csv|parquet|json|ndjson|excel

pii_masker detect

Arguments:
  [INPUT_FILE]              Input file path. Omit to read from stdin.

Options:
  -f, --format FORMAT       csv|parquet|json|ndjson|excel
  -e, --engine ENGINE       polars|pandas|duckdb  [default: polars]
      --samples INTEGER     Sample values to show per column  [default: 3]

Python API

Every feature is accessible through the façade module. Import only what you need — no internal sub-packages, no internal classes.

from Iki_PII_Masker.facade import detect_pii              # column-name PII detection
from Iki_PII_Masker.facade import detect_pii_by_value     # cell-value PII detection
from Iki_PII_Masker.facade import mask_dataframe           # apply any strategy
from Iki_PII_Masker.facade import unmask_dataframe         # reverse AES masking
from Iki_PII_Masker.facade import load_data, save_data     # file I/O
from Iki_PII_Masker.facade import make_context, make_reversible_context
from Iki_PII_Masker.facade import derive_encryption_key
from Iki_PII_Masker.facade import create_adapter           # polars / pandas / duckdb
from Iki_PII_Masker.facade import create_sql_adapter       # live relational database
from Iki_PII_Masker.facade import create_xml_adapter       # XML documents
from Iki_PII_Masker.facade import create_jsonpath_adapter  # nested JSON
from Iki_PII_Masker.facade import report_detection, report_masking
from Iki_PII_Masker.facade import ProfileConfig, ColumnRuleMap
from Iki_PII_Masker.facade import Strategy, Engine, FileFormat

Façade feature reference

Feature What it does
detect_pii(columns) Scan column names{col: PIIType} for every PII match
detect_pii_by_value(adapter, sample_rows, threshold) Scan actual cell values — catches generic column names like col_7
mask_dataframe(adapter, columns, strategy, context) Apply any of 10 strategies to named columns; returns elapsed seconds
unmask_dataframe(adapter, columns, key) Reverse AES-256-GCM masking in-place
load_data(adapter, source, fmt) Load a file, path, BytesIO, or None (stdin) into an adapter
save_data(adapter, dest, fmt) Write adapter data to a file, BytesIO, or None (stdout)
make_context(**kwargs) Build a plain MaskingContext (salt, seed, partial options)
make_reversible_context(secret) Build a context that AES-encrypts every value; key derived from secret
derive_encryption_key(secret) Derive 32-byte AES key from a secret string
create_adapter(engine) Instantiate a Polars, Pandas, or DuckDB adapter
create_sql_adapter(url, table) Mask a live database table via SQLAlchemy
create_xml_adapter(xpath, fields) Mask XML documents by XPath row selector
create_jsonpath_adapter(paths) Mask nested JSON by JSONPath expressions
ProfileConfig.from_yaml(path) Load masking rules from a YAML file
ProfileConfig.from_dict(data) Build masking rules from a Python dict
ColumnRuleMap({col: Strategy}) Per-column strategy map with a single .apply(adapter) call
report_detection(adapter, detected, file) Print Rich PII detection table with sample values
report_masking(adapter, col_map, strategy, elapsed) Print Rich masking summary table

Detection

Column-name detection (fast, zero I/O):

from Iki_PII_Masker.facade import detect_pii, report_detection
from Iki_PII_Masker.facade import create_adapter, load_data, Engine
from pathlib import Path

adapter  = create_adapter(Engine.polars)
load_data(adapter, Path("data.csv"))

detected = detect_pii(adapter.columns)
report_detection(adapter, detected, Path("data.csv"), samples=3)

Cell-value detection (catches generic column names like col_7):

from Iki_PII_Masker.facade import detect_pii, detect_pii_by_value

name_hits  = detect_pii(adapter.columns)
value_hits = detect_pii_by_value(adapter, sample_rows=100, existing=name_hits)
all_found  = {**name_hits, **value_hits}

Masking strategies

Fake data (reproducible):

from Iki_PII_Masker.facade import mask_dataframe, make_context, Strategy

mask_dataframe(adapter, "email:full_name:phone", Strategy.fake, make_context(seed=42))

Pseudonymize — consistent fakes (preserves referential integrity):

# Same "Alice Smith" in every table → same fake name everywhere
mask_dataframe(adapter, "full_name:email", Strategy.pseudonymize, make_context(seed=1))

Tokenize — stable opaque tokens:

# user_id → TOK-3d7a2c1e  (same input = same token within the run)
mask_dataframe(adapter, "user_id", Strategy.tokenize)

Generalize — coarsen to ranges / year buckets:

# 34 → "30-40",  1990-07-15 → "1990",  SW1A2AA → "SW1****"
mask_dataframe(adapter, "age:dob:zip", Strategy.generalize)

MaskFormat — preserve structural separators:

# john@corp.com → xxxx@xxxx.xxx,  4111-1234-5678-9000 → ****-****-****-****
mask_dataframe(adapter, "email:credit_card", Strategy.mask_format)

Hash with salt:

mask_dataframe(adapter, "user_id:email", Strategy.hash, make_context(salt="pepper_2024"))

Partial masking — keep last 4 digits:

mask_dataframe(adapter, "credit_card:phone", Strategy.partial,
               make_context(partial_keep=4, partial_side="right"))

Null out sensitive columns:

mask_dataframe(adapter, "ssn:dob:password", Strategy.null)

Reversible masking — mask then restore:

from Iki_PII_Masker.facade import (
    mask_dataframe, unmask_dataframe,
    make_reversible_context, derive_encryption_key, Strategy,
)

SECRET = "my-production-secret-2024"

mask_dataframe(adapter, "email:user_id", Strategy.redact,
               make_reversible_context(SECRET))
save_data(adapter, Path("masked.csv"))

# Restore
key = derive_encryption_key(SECRET)
load_data(adapter2, Path("masked.csv"))
unmask_dataframe(adapter2, ["email", "user_id"], key)

Multi-strategy pipeline on one adapter:

mask_dataframe(adapter, "email:full_name",  Strategy.pseudonymize, make_context(seed=42))
mask_dataframe(adapter, "credit_card",      Strategy.mask_format)
mask_dataframe(adapter, "dob:age",          Strategy.generalize)
mask_dataframe(adapter, "user_id",          Strategy.tokenize)
mask_dataframe(adapter, "password:ssn",     Strategy.null)

Adapters

Standard adapters (Polars / Pandas / DuckDB):

from Iki_PII_Masker.facade import create_adapter, Engine

adapter = create_adapter(Engine.polars)   # fastest general-purpose
adapter = create_adapter(Engine.pandas)   # use for Excel I/O
adapter = create_adapter(Engine.duckdb)   # use for files larger than RAM

SQLAlchemy adapter — mask a live database table:

from Iki_PII_Masker.facade import create_sql_adapter, mask_dataframe, Strategy

# Requires: pip install sqlalchemy psycopg2-binary
adapter = create_sql_adapter(
    url="postgresql+psycopg2://user:pass@localhost/mydb",
    table="users",
    id_column="id",
    chunk_size=500,
)
adapter.load()   # fetches all rows into memory
mask_dataframe(adapter, "email:phone", Strategy.fake)
adapter.save()   # writes batched UPDATEs back to the database

Supported databases: PostgreSQL, MySQL, MariaDB, SQLite, MS SQL Server, Oracle (anything with a SQLAlchemy driver).

XML adapter — mask XML documents by XPath:

from Iki_PII_Masker.facade import create_xml_adapter, load_data, save_data, mask_dataframe

# Requires no extra install — uses stdlib xml.etree (or lxml if installed)
adapter = create_xml_adapter(
    xpath="//user",                      # repeating row element
    pii_fields=["email", "phone", "name"],
)
load_data(adapter, Path("users.xml"))
mask_dataframe(adapter, "email:phone:name", Strategy.fake)
save_data(adapter, Path("masked.xml"))

JSONPath adapter — mask nested JSON:

from Iki_PII_Masker.facade import create_jsonpath_adapter

# Requires: pip install jsonpath-ng
adapter = create_jsonpath_adapter({
    "email": "$.users[*].contact.email",
    "phone": "$.users[*].contact.phone",
})
load_data(adapter, Path("data.json"))
mask_dataframe(adapter, "email:phone", Strategy.redact)
save_data(adapter, Path("masked.json"))

Profile-driven masking

ColumnRuleMap — apply per-column strategies in a single call:

from Iki_PII_Masker.facade import ColumnRuleMap, Strategy, make_context

rules = ColumnRuleMap({
    "email":       Strategy.fake,
    "full_name":   Strategy.pseudonymize,
    "credit_card": Strategy.partial,
    "ssn":         Strategy.null,
    "user_id":     Strategy.hash,
})
rules.apply(adapter, make_context(seed=42))

ProfileConfig — load rules from a YAML file:

# masking_profile.yaml
engine: polars
strategy: redact # default for auto-detected columns
seed: 42
auto: true # also auto-detect any PII not listed below
columns:
  email: fake
  full_name: pseudonymize
  credit_card: partial
  ssn: null
  user_id: tokenize
  dob: generalize
  phone: mask_format
from Iki_PII_Masker.facade import ProfileConfig, create_adapter

profile = ProfileConfig.from_yaml("masking_profile.yaml")
adapter = create_adapter(profile.engine)
load_data(adapter, Path("data.csv"))
profile.apply(adapter)
save_data(adapter, Path("masked.csv"))

Or build a profile in Python without a file:

profile = ProfileConfig.from_dict({
    "engine":   "polars",
    "strategy": "redact",
    "seed":     42,
    "auto":     True,
    "columns": {
        "email":     "fake",
        "ssn":       "null",
        "user_id":   "tokenize",
        "full_name": "pseudonymize",
    },
})
profile.apply(adapter)

Save a profile back to YAML for reuse:

profile.to_yaml("masking_profile.yaml")

In-memory pipe (BytesIO)

import io
from Iki_PII_Masker.facade import create_adapter, load_data, save_data
from Iki_PII_Masker.facade import mask_dataframe, make_context, Strategy, Engine, FileFormat

buf_in  = io.BytesIO(open("data.csv", "rb").read())
adapter = create_adapter(Engine.polars)
load_data(adapter, buf_in, FileFormat.csv)
mask_dataframe(adapter, "email:full_name", Strategy.fake, make_context(seed=99))

buf_out = io.BytesIO()
save_data(adapter, buf_out, FileFormat.csv)

PII Auto-Detection

Column-name detection

The --auto flag, detect command, and detect_pii() match column names against regex heuristics for ten built-in PII types:

PII Type Matched column names (examples)
email email, email_address, mail
phone phone, mobile, cell, telephone, contact_number
name full_name, first_name, last_name, username, name
address address, street, city, state, zip, postal_code
ssn ssn, social_security, national_id
dob dob, date_of_birth, birthdate, birthday
ip ip_address, ip, ipv4, ipv6
credit_card credit_card, card_number, cc_number, pan
user_id user_id, userid, account_id, customer_id
password password, passwd, pwd

Cell-value detection

detect_pii_by_value() scans actual cell values with regex patterns — it catches columns with generic names (col_7, field_2) that still contain Social Security numbers, credit card numbers, emails, and so on.

from Iki_PII_Masker.facade import detect_pii, detect_pii_by_value

# Step 1 — fast name-based scan
name_hits  = detect_pii(adapter.columns)

# Step 2 — deeper value scan for anything missed
value_hits = detect_pii_by_value(adapter, sample_rows=100, threshold=0.3)

# Combined results
all_found  = {**name_hits, **value_hits}

threshold is the fraction of sampled non-null values that must match a pattern before a column is flagged (default 0.3 = 30 %).

Register a custom PII type

from Iki_PII_Masker.facade import PIIRegistry, PIIType

PIIRegistry.register(PIIType(
    name="api_key",
    patterns=[r"\bapi_key\b", r"\btoken\b", r"\baccess_key\b"],
    redact_label="[TOKEN]",
    faker_method="uuid4",
))

Reversible Masking — How It Works

When --reversible --key <secret> is passed (or make_reversible_context(secret) in Python):

  1. A 32-byte AES key is derived from your secret using SHA-256.
  2. Each value is encrypted with AES-256-GCM using a random 96-bit nonce.
  3. The nonce + ciphertext + GCM tag are base64-encoded as ENC:<token> and stored in place of the original value.
  4. pii_masker unmask --key <same-secret> (or unmask_dataframe) reverses step 3 → 1.

Because each value gets a fresh random nonce, identical inputs produce different ciphertext — preventing frequency analysis on the masked dataset.

Security note — key handling: The --key flag is visible in shell history and ps output. In production, pass the key via an environment variable:

export MASK_KEY=$(vault kv get -field=key secret/pii-key)
pii_masker mask data.csv --columns email --reversible --key "$MASK_KEY" -o out.csv

Performance

Benchmarked on a 10M-row, 500 MB CSV with 5 PII columns:

Engine Strategy Time Notes
Polars redact ~4s Best all-rounder
Polars hash ~5s
Polars fake ~18s
Polars pseudonymize ~19s Slightly slower than fake
Polars tokenize ~6s Fast — SHA-256 based
Polars generalize ~5s
Polars mask_format ~6s
DuckDB redact ~4s Handles files larger than RAM
DuckDB fake ~19s
Pandas redact ~9s Use for Excel I/O
Pandas fake ~35s

Polars is the default for speed. Use DuckDB when your file is too large to fit in memory. Use Pandas only when you need Excel I/O or tight ecosystem integration. Use SQLAlchemy for masking data directly in a live database without exporting to files first.


Architecture

pii_masker is built around five design patterns that keep it easy to extend without touching existing code:

Strategy — each masking algorithm is an independent class. Adding a new algorithm means adding one file; no existing code changes.

RegistryPIIRegistry is the single source of truth for all PII metadata. Adding a new PII type is one entry in one place.

Adapter — all engines expose an identical interface to the rest of the codebase. Swapping or adding an engine requires one new class.

FactoryStrategyFactory, AdapterFactory, and FormatRegistry centralise all object creation so CLI functions contain zero branching logic.

Façadefacade.py is the single public door into the Python API. Every capability is exposed as a named action function so callers never import from internal sub-packages directly.

Package layout

src/Iki_PII_Masker/
├── facade.py                  ← public Python API (import from here)
├── service.py                 ← MaskingService orchestrator
├── reporter.py                ← Rich terminal output
├── cli.py                     ← argparse CLI entry point
├── app.py                     ← CLI command implementations
├── config/
│   ├── enums.py               ← Strategy, Engine, FileFormat
│   ├── registry.py            ← PIIType, PIIRegistry
│   ├── crypto.py              ← AES-256-GCM helpers
│   ├── io.py                  ← load/save routing
│   ├── value_detector.py      ← ValuePatternDetector (cell-value PII scan)
│   ├── xml_io.py              ← XMLAdapter
│   ├── jsonpath_io.py         ← JSONPathAdapter
│   ├── profile.py             ← ProfileConfig, ColumnRuleMap
│   └── utils.py               ← exit_error helper
├── strategies/
│   ├── base.py                ← BaseMaskingStrategy, MaskingContext
│   ├── redact.py
│   ├── fake.py
│   ├── hash.py
│   ├── partial.py
│   ├── null.py
│   ├── keep.py
│   ├── tokenize.py            ← TokenizeStrategy
│   ├── pseudonymize.py        ← PseudonymizeStrategy
│   ├── generalize.py          ← GeneralizeStrategy
│   ├── mask_format.py         ← MaskFormatStrategy
│   └── factory.py             ← StrategyFactory, FormatRegistry
└── adapters/
    ├── base.py                ← BaseDataFrameAdapter
    ├── polars_adapter.py
    ├── pandas_adapter.py
    ├── duckdb_adapter.py
    ├── sqlalchemy_adapter.py  ← SQLAlchemyAdapter
    └── factory.py             ← AdapterFactory

Integration Examples

dbt post-hook

dbt run --select sensitive_model && \
  pii_masker mask target/run/sensitive_model.csv \
    --auto --strategy fake \
    -o exports/masked_sensitive_model.csv

Apache Airflow

from airflow.operators.bash import BashOperator

mask_pii = BashOperator(
    task_id="mask_pii",
    bash_command=(
        "pii_masker mask {{ params.input }} "
        "--auto --strategy redact "
        "--engine polars "
        "-o {{ params.output }}"
    ),
    params={"input": "/data/raw.parquet", "output": "/data/masked.parquet"},
)

GitHub Actions — sanitize test fixtures

- name: Mask PII in test fixtures
  run: |
    pii_masker mask tests/fixtures/users.csv \
      --columns email:phone:full_name \
      --strategy fake \
      --seed 42 \
      -o tests/fixtures/users_masked.csv

Profile-driven CI masking

# .github/workflows/mask.yml
- name: Apply masking profile
  run: |
    python - <<'EOF'
    from Iki_PII_Masker.facade import ProfileConfig, create_adapter, load_data, save_data
    from pathlib import Path

    profile = ProfileConfig.from_yaml("masking_profile.yaml")
    adapter = create_adapter(profile.engine)
    load_data(adapter, Path("data/raw.csv"))
    profile.apply(adapter)
    save_data(adapter, Path("data/masked.csv"))
    EOF

Pre-commit hook — block raw PII from being committed

# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: mask-pii
      name: Mask PII in fixture files
      language: system
      entry: pii_masker mask --auto --strategy redact --dry-run --report
      files: tests/fixtures/.*\.(csv|parquet)$

Mask a PostgreSQL table directly

from Iki_PII_Masker.facade import (
    create_sql_adapter, mask_dataframe, Strategy, make_context
)

adapter = create_sql_adapter(
    url="postgresql+psycopg2://user:pass@localhost/prod",
    table="customers",
)
adapter.load()
mask_dataframe(adapter, "email:phone:full_name", Strategy.fake, make_context(seed=42))
adapter.save()

Testing

The test suite lives in tests/ and covers all layers.

# Install dev dependencies
pip install -e ".[dev]"
pip install sqlalchemy jsonpath-ng pyyaml    # optional adapters

# Run all 207 tests
python -m pytest

# Run with coverage report
python -m pytest --cov=pii_masker --cov-report=term-missing

# Run a single file
python -m pytest tests/test_strategies.py -v
Test file Scope Tests
test_strategies.py Unit — all 10 masking strategies 77
test_registry.py Unit — PIIRegistry, FormatRegistry, ValuePatternDetector 23
test_adapters.py Integration — Polars, Pandas, DuckDB, SQLAlchemy, XML, JSONPath 56
test_service.py Unit — MaskingService + façade wrapper 19
test_profile.py Unit — ProfileConfig + ColumnRuleMap 17
test_cli.py End-to-end — real CLI via subprocess 15
Total 207

Examples

Generate sample data first

python examples/generate_sample_data.py          # creates examples/data/sample.*
python examples/generate_sample_data.py --rows 50000

Python API examples (22 examples)

python examples/run_examples.py
# Example Façade feature used
01 Detect PII by column name detect_pii, report_detection
02 Detect PII by cell values detect_pii_by_value
03 Redact explicit columns mask_dataframe, Strategy.redact
04 Fake data with seed mask_dataframe, make_context(seed=42)
05 Pseudonymize — consistent fakes Strategy.pseudonymize
06 Tokenize — stable opaque tokens Strategy.tokenize
07 Generalize — ranges and year buckets Strategy.generalize
08 MaskFormat — preserve structural separators Strategy.mask_format
09 Hash with salt Strategy.hash, make_context(salt=...)
10 Partial masking — keep last 4 digits Strategy.partial, make_context(partial_keep=4)
11 Null out sensitive columns Strategy.null
12 Reversible AES-256-GCM mask + unmask make_reversible_context, unmask_dataframe
13 All three standard engines create_adapter, Engine.polars/pandas/duckdb
14 SQLAlchemy — mask a live SQLite table create_sql_adapter
15 XML adapter — XPath-based masking create_xml_adapter
16 JSONPath adapter — nested JSON masking create_jsonpath_adapter
17 ColumnRuleMap — per-column strategy map ColumnRuleMap
18 ProfileConfig from dict ProfileConfig.from_dict
19 ProfileConfig from YAML file ProfileConfig.from_yaml, profile.to_yaml
20 Pipe simulation — BytesIO in-memory load_data(buf, FileFormat.csv)
21 Dry run + masking report mask_dataframe(dry_run=True), report_masking
22 Multi-strategy pipeline on one adapter Multiple mask_dataframe passes

Contributing

  1. Fork the repo and create a feature branch.
  2. Add or update tests in tests/ — run python -m pytest before pushing.
  3. To register a new PII type, add a PIIType(...) entry to PIIRegistry._types — no other file needs to change.
  4. To add a new masking strategy, subclass BaseMaskingStrategy, implement _apply(), register it in StrategyFactory, and add the enum value to Strategy.
  5. To add a new engine, subclass BaseDataFrameAdapter, implement all required methods, and register it in AdapterFactory and the Engine enum.
  6. All public Python API additions go through facade.py — internal classes are not part of the public surface.
  7. New optional adapters (SQLAlchemy, XML, JSONPath) live in config/ or adapters/ and are imported lazily inside their factory functions so the core package has no extra hard dependencies.

License

MIT — see LICENSE for full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iki_pii_masker-1.0.1.tar.gz (46.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iki_pii_masker-1.0.1-py3-none-any.whl (53.1 kB view details)

Uploaded Python 3

File details

Details for the file iki_pii_masker-1.0.1.tar.gz.

File metadata

  • Download URL: iki_pii_masker-1.0.1.tar.gz
  • Upload date:
  • Size: 46.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for iki_pii_masker-1.0.1.tar.gz
Algorithm Hash digest
SHA256 5181cfb9069b0c1e93e9dcaa60027f0c2921c035105a5784eace4dc4eb4867dc
MD5 0ac53b8f182a3fb96a2f0d1c025f3725
BLAKE2b-256 46d74f372ae2047b4129a6e30ea429e8bc5fcc0a5a1f0494f7802ddd29aca7ce

See more details on using hashes here.

File details

Details for the file iki_pii_masker-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: iki_pii_masker-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 53.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for iki_pii_masker-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a1d1fa27b1ee3e5af6478ad9b5932aee66995838f5cf5a74e968c8913f2fc0da
MD5 4833f2852d90e1663ffca9bab0fe74fe
BLAKE2b-256 6baca96a70b5b31455173354e56026639d2ee4a9bfa958d7fd7bad3c56c48fec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page