Static anonymizer for plain SQL dumps (PostgreSQL, SQLite, SQL Server).
Project description
Dumpling
Sanitize SQL dumps before they go anywhere.
Turn huge pg_dump / SQLite / SQL Server exports into shareable, test-friendly snapshots — no DB connection, no secrets left by accident.
Disclaimer: This project is entirely vibe-coded, but with strong human guidance, review, and attention to quality and safety.
Dumpling reads plain-text SQL dumps (PostgreSQL pg_dump, SQLite .dump, SQL Server / MSSQL scripts) and rewrites sensitive columns using rules you define in TOML. Everything runs offline on files — ideal for CI, staging share-outs, and compliance-minded workflows.
Why Dumpling?
- Offline by design — works on dump files only; nothing connects to your database.
- Streams giant files — line-by-line processing keeps multi‑GB dumps reasonable on modest hardware.
- Fails loud, not silent — missing config exits non‑zero and lists where Dumpling looked; use
--allow-nooponly when you mean it. - Stable pseudonyms — optional domain mappings keep the same source value as the same fake value across tables (foreign keys stay consistent).
- Pipeline-ready —
--check, strict coverage, JSON reports, and residual PII scans fit pre-merge gates and release automation. - Configure once —
.dumplingconfor[tool.dumpling]inpyproject.toml; install via Rust (cargo) orpip install dumpling-cli.
Install
Rust (from source)
cargo build --release
./target/release/dumpling --help
Python / pip (dumpling-cli)
Dumpling is also published as a pip-installable CLI package:
pip install dumpling-cli
Or install from local source (requires maturin as PEP 517 backend):
pip install .
After install the CLI command is the same:
dumpling --help
Usage
dumpling -i dump.sql -o sanitized.sql # read from file, write to file
dumpling -i dump.sql --in-place # overwrite the input file (atomic swap)
cat dump.sql | dumpling > sanitized.sql # stream from stdin to stdout
dumpling -i dump.sql -c .dumplingconf # use explicit config path
dumpling --check -i dump.sql # exit 1 if changes would occur, no output
dumpling --stats -i dump.sql -o out.sql # print summary to stderr
dumpling --report report.json -i dump.sql # write detailed JSON report of changes/drops
dumpling --strict-coverage --report report.json -i dump.sql --check # fail on uncovered sensitive columns
dumpling --scan-output --report report.json -i dump.sql # scan transformed output for residual PII-like patterns
dumpling --scan-output --fail-on-findings --report report.json -i dump.sql --check # fail if scan thresholds are exceeded
dumpling --include-table '^public\\.' -i dump.sql -o out.sql
dumpling --exclude-table '^audit\\.' -i dump.sql -o out.sql
dumpling --allow-ext dmp -i data.dmp # restrict processing to specific extensions
dumpling --allow-noop -i dump.sql -o out.sql # explicitly allow no-op when config is missing
dumpling --format sqlite -i data.db.sql -o out.sql # process a SQLite .dump file
dumpling --format mssql -i backup.sql -o out.sql # process a SQL Server plain-SQL dump
dumpling --security-profile hardened -i dump.sql -o sanitized.sql # hardened CSPRNG + HMAC mode
dumpling lint-policy # lint the anonymization policy config
dumpling lint-policy --config .dumplingconf # lint with explicit config path
Configuration is loaded in this order:
--config <path>if provided.dumplingconfin the current directorypyproject.toml[tool.dumpling]section
If no configuration is found, Dumpling fails closed by default and exits non-zero.
The error output lists every checked location. Use --allow-noop to explicitly
permit no-op behavior.
Dump seal (always on)
Every successful run that writes output prefixes the stream with a single-line SQL comment:
-- dumpling-seal: v=2 version=<semver> profile=<standard|hardened> sha256=<64 hex chars>
The sha256 is over canonical JSON that includes the Dumpling version, the active security profile, a stable encoding of the resolved policy (rules, row filters, column cases, sensitive columns, output scan, global salt), and runtime options that affect transforms: --format, sorted --include-table / --exclude-table patterns, and the effective --seed / DUMPLING_SEED value in standard profile (null in hardened, where seeds are ignored).
If the input already begins with a seal line and it matches the current run, Dumpling copies the rest of the file through unchanged. If the line looks like a seal but does not match (stale policy, different flags, or older v=), that line is dropped and the dump is re-processed so you do not end up with two seal lines. --strict-coverage cannot be combined with a matching seal (table definitions are not scanned in passthrough mode). --check writes no output and therefore emits no seal line.
Configuration (TOML)
Both .dumplingconf and [tool.dumpling] inside pyproject.toml use the same schema:
# Optional global salt for strategies that support it (e.g. hash)
# Prefer env-backed secret references over plaintext.
salt = "${DUMPLING_GLOBAL_SALT}"
# Rules are keyed by either "table" or "schema.table"
[rules."public.users"]
email = { strategy = "email", domain = "customer_identity", unique_within_domain = true }
name = { strategy = "name", locale = "de_de" } # German-locale name
ssn = { strategy = "hash", salt = "${env:DUMPLING_USERS_SSN_SALT}", as_string = true } # SHA-256 of original (salted)
age = { strategy = "int_range", min = 18, max = 90 }
[rules."orders"]
credit_card = { strategy = "redact", as_string = true }
# Optional explicit sensitive columns policy list (for strict coverage)
[sensitive_columns]
"public.users" = ["employee_number", "tax_id"]
[output_scan]
# optional allowlist; if omitted, all built-in categories are enabled
enabled_categories = ["email", "ssn", "pan", "token"]
default_threshold = 0
default_severity = "high"
fail_on_severity = "low"
sample_limit_per_category = 5
[output_scan.thresholds]
email = 0
ssn = 0
pan = 0
token = 0
[output_scan.severities]
email = "medium"
ssn = "high"
pan = "critical"
token = "high"
Anonymization strategies
| Strategy | Description |
|---|---|
null |
Set field to SQL NULL |
redact |
Replace with REDACTED (string) |
uuid |
Random UUIDv4-like string |
hash |
SHA-256 hex of original value; supports per-column salt and global salt |
email |
Safe email address (same generator as faker = "internet::SafeEmail"); supports locale |
name |
Full name (same as faker = "name::Name"); supports locale |
first_name |
First name (same as faker = "name::FirstName"); supports locale |
last_name |
Last name (same as faker = "name::LastName"); supports locale |
phone |
Locale-aware fake phone number (configurable via locale); defaults to English format |
faker |
Values from the Rust fake crate (docs.rs, faker modules), chosen by a string identifier only (faker = "module::Type", e.g. internet::SafeEmail). Config is data only: nothing from TOML is compiled or executed as Rust at runtime. Use locale for locale-aware generators; optional min/max, length, format as documented. Unsupported targets fail at config load. New generators require a new Dumpling release (or your own fork), not config-side code. |
int_range |
Random integer in [min, max] |
string |
Random alphanumeric string (length = 12 by default) |
date_fuzz |
Shifts a date by a random number of days in [min_days, max_days] (defaults: -30..30) |
time_fuzz |
Shifts a time-of-day by a random number of seconds in [min_seconds, max_seconds] with 24h wraparound (defaults: -300..300) |
datetime_fuzz |
Shifts a timestamp/timestamptz by a random number of seconds in [min_seconds, max_seconds] (defaults: -86400..86400) |
faker reference (upstream fake crate): Dumpling’s faker = "module::Type" strings mirror the Rust fake crate’s faker module layout. Use these when picking or extending generators:
- docs.rs —
fakecrate root (overview,Fake/Dummytraits, locales) - docs.rs —
fake::fakermodule index (per-domain submodules:address,internet,name, …) - GitHub —
cksac/fake-rs(source, README with the CLI’s generator name list)
Secret references
Dumpling resolves secret references in string config fields so plaintext salts/keys never need to be committed to version control.
| Syntax | Description |
|---|---|
${ENV_VAR} |
Value of environment variable ENV_VAR |
${env:ENV_VAR} |
Value of environment variable ENV_VAR (explicit provider prefix) |
${file:/path/to/secret} |
Contents of a file (trailing newlines stripped); works with Docker Swarm secrets, Kubernetes mounted secrets, and Vault Agent injected files |
- Missing env references and unreadable/empty files fail fast with a non-zero startup error that includes the config path.
- Plaintext
saltvalues still work for backwards compatibility, but Dumpling prints a startup warning because plaintext secrets are insecure.
# .dumplingconf — keep salts out of source control
salt = "${DUMPLING_GLOBAL_SALT}"
[rules."public.users"]
ssn = { strategy = "hash", salt = "${env:DUMPLING_USERS_SSN_SALT}" }
email = { strategy = "hash", salt = "${file:/run/secrets/dumpling_email_salt}" }
# Local dev
export DUMPLING_GLOBAL_SALT='local-dev-salt'
export DUMPLING_USERS_SSN_SALT='users-ssn-salt'
dumpling --input dump.sql --check
# CI (injected from your secret store)
export DUMPLING_GLOBAL_SALT="$CI_DUMPLING_GLOBAL_SALT"
export DUMPLING_USERS_SSN_SALT="$CI_DUMPLING_USERS_SSN_SALT"
dumpling --input dump.sql --check --strict-coverage --report coverage.json
# Docker / Kubernetes (file-mounted secrets)
# salt = "${file:/run/secrets/dumpling_hmac_key}" in .dumplingconf
# secret mounted at /run/secrets/dumpling_hmac_key by the orchestrator
dumpling --security-profile hardened --input dump.sql --check
Common column options
as_string: if true, forces the anonymized value to be rendered as a quoted SQL string literal. By default Dumpling preserves the original quoting where possible.domain: deterministic mapping domain. When set, the same source value always maps to the same pseudonym inside that domain (across tables/columns). SQLNULLinputs are always preserved asNULL— a null FK reference has no source value to map, so no pseudonym is fabricated.unique_within_domain: when true, different source values are assigned unique pseudonyms within the configureddomain. NULL values are unaffected and always remain NULL.min_days/max_days: used bydate_fuzz.min_seconds/max_seconds: used bytime_fuzzanddatetime_fuzz.locale: selects the language/regional format foremail,name,first_name,last_name,faker, andphone. Supported values:en,fr_fr,de_de,it_it,pt_br,pt_pt,ar_sa,zh_cn,zh_tw,ja_jp,cy_gb. Defaults toenwhen not specified.faker: required whenstrategy = "faker". A plain string"module::Type"(case-insensitive) that maps to a built-in generator compiled into Dumpling—not arbitrary Rust or expressions. Names followfake::faker(e.g.internet::SafeEmail→faker::internet::SafeEmailin the crate).format: used withfaker = "number::NumberWithFormat"; pattern uses#(0–9) and^(1–9) per thefakecrate docs.
Note:
table_optionsare no longer supported; use explicitrulesand optionalcolumn_cases.
Strict coverage
--strict-coverage enforces that all detected sensitive columns have an explicit anonymization rule.
Sensitive columns are detected via:
- Built-in column-name heuristics (the same patterns used by auto-detection).
- Explicit lists under
[sensitive_columns].
A column is considered covered only when it has an explicit rules entry or at least one column_cases entry. When strict coverage fails, Dumpling exits non-zero and reports the uncovered columns.
Coverage reporting
When --report <file> is used, the JSON output includes:
sensitive_columns_detectedsensitive_columns_coveredsensitive_columns_uncovereddeterministic_mapping_domains(columns configured with deterministic domain mapping)output_scan(when--scan-outputis enabled), including category counts and sample locations
CI gate pattern
dumpling --input dump.sql --check --strict-coverage --report coverage.json
This command exits non-zero if:
- Data changes/drops are detected (
--checksemantics), or - Strict coverage finds uncovered sensitive columns.
Residual PII scan
dumpling \
--input dump.sql \
--check \
--scan-output \
--fail-on-findings \
--report scan-report.json
--scan-output scans the transformed output for built-in detector categories:
email: email-address-like stringsssn: U.S. SSN-like valuespan: payment-card-like numbers (Luhn validated)token: common secret/token formats (JWT, AWS access key IDs, GitHub PAT prefixes, etc.)
When --fail-on-findings is set, Dumpling exits non-zero if any configured category exceeds its threshold and meets the configured severity gate.
Input format
Dumpling processes plain-text SQL dump files from multiple sources. Use --format to select the dialect (default: postgres).
PostgreSQL (--format postgres)
Produced by pg_dump --format=plain. Handles:
INSERT INTO schema.table (col1, col2, ...) VALUES (...), (...), ...;COPY schema.table (col1, col2, ...) FROM stdin; ... \.(tab-delimited with\Nas NULL)"double-quoted"identifiers''-escaped string literals
Binary, custom, and directory formats from pg_dump are not parsed directly — Dumpling’s SQL pipeline expects plain text. Use either:
pg_dump --format=plainwhen you control capture, ordumpling --dump-decodewith--inputset to a custom-format (.dump) or directory-format folder: Dumpling runspg_restore -f -and streams the resulting SQL (same as a manualpg_restore“script” output, no database required). Requires PostgreSQL client tools onPATH(pg_restore), or set--pg-restore-path. Use--dump-decode-argto pass extra flags (e.g.--no-owner --no-acl). By default the archive is removed after a fully successful run; pass--dump-decode-keep-inputto retain it.--checkrequires--dump-decode-keep-inputso the archive still exists if changes would be detected.
Example (e.g. after heroku pg:backups:download):
dumpling --dump-decode -i latest.dump -c .dumplingconf -o anonymized.sql
SQLite (--format sqlite)
Produced by the SQLite CLI .dump command or equivalent. Handles:
- Standard
INSERT INTO table (col1, ...) VALUES (...); INSERT OR REPLACE INTO table (...) VALUES (...);INSERT OR IGNORE INTO table (...) VALUES (...);"double-quoted"identifiers''-escaped string literals
The OR REPLACE / OR IGNORE variant keyword is preserved verbatim in the output.
SQL Server / MSSQL (--format mssql)
Produced by SSMS "Script Table as → INSERT To", mssql-scripter, or similar tools. Handles:
INSERT INTO [schema].[table] ([col1], [col2], ...) VALUES (...), ...;[bracket]-quoted identifiers (stripped to unquoted names in output)N'...'Unicode string literals (theNprefix is transparently discarded; value is preserved)nvarchar(n)andnchar(n)column-length declarations (used to truncate generated values)''-escaped string literals
Row filtering
You can retain or delete rows for specific tables using explicit predicate lists.
- If
retainis non-empty, a row is kept only if it matches at least one predicate. - Regardless of
retain, a row is dropped if it matches any predicate indelete.
Supported predicate operators:
| Operator | Description |
|---|---|
eq / neq |
String compare (case-insensitive if case_insensitive = true) |
in / not_in |
List of values (string compare) |
like / ilike |
SQL-like patterns (% and _) |
regex / iregex |
Rust regex (iregex is case-insensitive) |
lt / lte / gt / gte |
Numeric compare (values parsed as numbers) |
is_null / not_null |
No value needed |
Predicates can target nested JSON values using dot notation (payload.profile.tier) or Django-style notation (payload__profile__tier). For JSON arrays, path segments are evaluated against each element, so list-of-dicts structures can be matched naturally.
JSON path list targeting
JSON list/array traversal is automatic once a path segment resolves to an array.
- All elements in an array: use the next field name directly.
payload.items.kindorpayload__items__kind- Matches/rewrites
kindfor every object initems.
- Specific array index: use a numeric segment.
payload.items.0.kindorpayload__items__0__kind- Targets only the first element.
- Nested arrays: combine field and index segments as needed.
payload.groups.members.emailpayload.groups.1.members.0.email
This path behavior is shared by both row_filters predicates and JSON-path anonymization rules in [rules].
[row_filters."public.users"]
retain = [
{ column = "country", op = "eq", value = "US" },
{ column = "email", op = "ilike", value = "%@myco.com" },
{ column = "profile.flags.plan", op = "eq", value = "gold" }
]
delete = [
{ column = "is_admin", op = "eq", value = "true" },
{ column = "email", op = "ilike", value = "%@example.com" },
{ column = "devices__platform", op = "eq", value = "android" }
]
Row filtering works for both INSERT ... VALUES (...) and COPY ... FROM stdin rows.
Conditional per-column cases
Define default strategies in rules."<table>" and add ordered per-column cases in column_cases."<table>"."<column>". For each row and column, Dumpling applies the first matching case; if none match, it falls back to the default from rules.
[rules."public.users"]
email = { strategy = "hash", as_string = true } # default
name = { strategy = "name" }
[[column_cases."public.users".email]]
when.any = [{ column = "is_admin", op = "eq", value = "true" }]
strategy = { strategy = "redact", as_string = true }
[[column_cases."public.users".email]]
when.any = [{ column = "country", op = "in", values = ["DE","FR","GB"] }]
strategy = { strategy = "hash", salt = "eu-salt", as_string = true }
when.anyis OR,when.allis AND; you can use either or both. If both are empty, the case matches unconditionally.- First-match-wins per column; there is no merge or fallthrough.
- Row filtering (
row_filters) is evaluated before cases; deleted rows are not transformed.
Hardened security profile
For adversarial risk environments — where an internal or external actor may have partial auxiliary data — use --security-profile hardened:
dumpling --security-profile hardened -i dump.sql -o sanitized.sql
What changes in hardened mode
| Aspect | Standard | Hardened |
|---|---|---|
| Random generation | xorshift64* seeded from system time | OS CSPRNG (getrandom) — non-predictable |
hash strategy |
SHA-256(salt || input) | HMAC-SHA-256(key=salt, data=input) |
| Deterministic domain byte stream | SHA-256 CTR-mode | HMAC-SHA-256 CTR-mode |
Report security_profile field |
"standard" |
"hardened" |
--seed / DUMPLING_SEED |
Seeds the PRNG | Ignored (warning emitted) |
Why this matters
- Non-predictable output: xorshift64* is seeded from system time, which is guessable. The OS CSPRNG cannot be predicted from timing alone.
- Proper keyed hashing:
SHA-256(key || data)is vulnerable to length-extension attacks and weak as a MAC. HMAC-SHA-256 uses the salt as a genuine cryptographic key, providing provable PRF security. - Domain separation: HMAC construction ensures outputs from one salt/key cannot be confused with another.
Key management guidance
Configure a per-environment secret via an env-backed reference to prevent key leakage:
# .dumplingconf
salt = "${DUMPLING_HMAC_KEY}"
[rules."public.users"]
ssn = { strategy = "hash", as_string = true }
email = { strategy = "email", domain = "users" }
export DUMPLING_HMAC_KEY="$(openssl rand -base64 32)"
dumpling --security-profile hardened -i dump.sql -o sanitized.sql
Key rotation: Changing DUMPLING_HMAC_KEY will produce entirely different pseudonyms for all salted/domain-mapped columns. If you rely on referential consistency across separately-processed dumps (e.g., snapshots over time), keep the same key or re-anonymize all related dumps together. Rotate keys when:
- A key may have been compromised.
- You intentionally want to break prior referential linkability.
Report metadata
The JSON report always includes the active security profile:
{
"security_profile": "hardened",
"total_rows_processed": 1000,
...
}
Policy linting
The lint-policy subcommand statically analyses your configuration and flags common issues before they affect a production pipeline.
dumpling lint-policy # auto-discover config
dumpling lint-policy --config .dumplingconf # explicit config path
| Check | Severity | Description |
|---|---|---|
empty-rules-table |
warning | A [rules] entry has no column rules |
empty-column-cases-table |
warning | A [column_cases] entry has no column cases |
unsalted-hash |
warning | hash strategy used without any salt — reversible for low-entropy inputs |
inconsistent-domain-strategy |
error | Same domain name used with different strategies — breaks referential integrity |
uncovered-sensitive-column |
error | A column in [sensitive_columns] has no matching rule or case |
Exits 0 if no violations are found, 1 if any violations exist. Plug it into CI as a pre-merge gate:
- run: ./target/release/dumpling lint-policy
See the CI guardrails documentation for full pipeline recipes including strict-coverage enforcement, residual PII scan gating, and report diffing.
Notes
- This is a streaming transformer; memory usage stays small even for large dumps.
- For CI/CD and production-like workflows, prefer the default fail-closed mode and avoid
--allow-noopunless a no-op run is intentional. - For best results, configure strategies compatible with column data types. If you hash an integer column, Dumpling will render a string; most databases can coerce this, but explicit
as_string = falsemay help in some cases. - For length-restricted text columns (
varchar(n),character varying(n),char(n),character(n)), Dumpling readsCREATE TABLEdefinitions and truncates generated text values to fit within the declared limit. - Deterministic anonymization for tests: pass
--seed <u64>or set envDUMPLING_SEEDto make fuzz strategies reproducible across runs. Note:--seedhas no effect in--security-profile hardened. - Domain mappings (
domain = "...") are deterministic by source value + domain (+ optional salt), so referential joins stay stable across tables within the same dump.
Full documentation
Detailed docs, including the configuration reference and release process, are available at the project's GitHub Pages site (built from docs/src/).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dumpling_cli-0.6.0.tar.gz.
File metadata
- Download URL: dumpling_cli-0.6.0.tar.gz
- Upload date:
- Size: 118.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4910b4ee404a545dab60d8b06b40065027bdb2add2e1cbd60cf9e82db4c6561a
|
|
| MD5 |
1ff3919b7927d2fc4633efd24453b0fe
|
|
| BLAKE2b-256 |
0e38c85cfce98fd448152591b3f1d8ca20df47310a650ceed5d3cc59437564dd
|
File details
Details for the file dumpling_cli-0.6.0-py3-none-win_amd64.whl.
File metadata
- Download URL: dumpling_cli-0.6.0-py3-none-win_amd64.whl
- Upload date:
- Size: 2.2 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52bbd4b480cf6a3ff96505d925bc9ff049a5932427393c343e4d3505676be13a
|
|
| MD5 |
991ed80a0722236d40b1420e3bd7f930
|
|
| BLAKE2b-256 |
e496905aa0f2c90dabb41330203be2be9636cce67485da7c755ac6cf982fe00d
|
File details
Details for the file dumpling_cli-0.6.0-py3-none-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: dumpling_cli-0.6.0-py3-none-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: Python 3, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
def7a07f56e1818519692f8dce6f6d69699a309adf6a7f9f48bf0197e84bfdad
|
|
| MD5 |
934dc58f35caa879b644275d0a13feb4
|
|
| BLAKE2b-256 |
5bb6bc227139855fd2bf22d1a6a2d1dbf3a54145f8a0bd5715747b843e71c492
|
File details
Details for the file dumpling_cli-0.6.0-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: dumpling_cli-0.6.0-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.5 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6483a33538059e89e8ab32e1f61f3c98de096d56caed1a72fa738b5ad03afea
|
|
| MD5 |
b7c343752195ca0811af10cdb730bbd4
|
|
| BLAKE2b-256 |
67c2aef1fe649e51ecec7e6498f21c810b8c0ed61fc5abd9b00df9cd8293401e
|