Skip to main content

Static anonymizer for Postgres plain SQL dumps produced by pg_dump.

Project description

Dumpling

Static anonymizer for Postgres plain SQL dumps produced by pg_dump. It scans INSERT and COPY FROM stdin statements and replaces sensitive row data based on configurable rules.

Install / Build

cargo build --release
./target/release/dumpling --help

Python package build (maturin)

This repository now includes Python distribution metadata so Dumpling can be published as a pip-installable CLI package (distribution name: dumpling-cli).

# Build wheel/sdist locally
maturin build --release

# Install from local source (requires maturin as PEP 517 backend)
pip install .

After install, the CLI command remains:

dumpling --help

Project automation

  • Lint: .github/workflows/ci.yml runs cargo fmt and cargo clippy only (fast signal).
  • Test: .github/workflows/tests.yml runs cargo test --all-targets --all-features.
  • Platform compatibility (latest): .github/workflows/platform-compat-latest.yml runs cross-platform build checks on latest runner images.
  • Platform compatibility (matrix): .github/workflows/platform-compat-matrix.yml is a manual, explicit-version matrix for legacy compatibility checks over time.
  • Docs: .github/workflows/docs.yml builds this repo's mdBook docs and deploys them from main to GitHub Pages.
  • Publish: .github/workflows/publish.yml builds wheels/sdist via maturin, publishes to PyPI from tags, and supports manual TestPyPI publication.
  • Release: .github/workflows/release.yml publishes tagged releases (v*.*.*) with checksummed Linux artifacts.

Docs

mdbook build

Primary docs live under docs/src/, including the release process.

Usage

dumpling -i dump.sql -o sanitized.sql           # read from file, write to file
dumpling -i dump.sql --in-place                 # overwrite the input file (atomic swap)
cat dump.sql | dumpling > sanitized.sql         # stream from stdin to stdout
dumpling -i dump.sql -c .dumplingconf           # use explicit config path
dumpling --check -i dump.sql                    # exit 1 if changes would occur, no output
dumpling --stats -i dump.sql -o out.sql         # print summary to stderr
dumpling --report report.json -i dump.sql       # write detailed JSON report of changes/drops
dumpling --include-table '^public\\.' -i dump.sql -o out.sql
dumpling --exclude-table '^audit\\.' -i dump.sql -o out.sql
dumpling --allow-ext dmp -i data.dmp            # restrict processing to specific extensions

Configuration is loaded in this order:

  1. --config <path> if provided
  2. .dumplingconf in the current directory
  3. pyproject.toml [tool.dumpling] section

If no configuration is found, Dumpling performs a no-op transformation.

Configuration (TOML)

Both .dumplingconf and [tool.dumpling] inside pyproject.toml use the same schema:

# Optional global salt for strategies that support it (e.g. hash)
salt = "mysalt"

# Rules are keyed by either "table" or "schema.table"
[rules."public.users"]
email = { strategy = "email" }
name  = { strategy = "name" }
ssn   = { strategy = "hash", as_string = true }   # SHA-256 of original (salted)
age   = { strategy = "int_range", min = 18, max = 90 }

[rules."orders"]
credit_card = { strategy = "redact", as_string = true }

Supported strategies:

  • null: set field to SQL NULL
  • redact: replace with REDACTED (string)
  • uuid: random UUIDv4-like string
  • hash: SHA-256 hex of original value; supports per-column salt and global salt
  • email: random-looking email at example.com
  • name, first_name, last_name: simple placeholder names
  • phone: simple US-like phone number (xxx) xxx-xxxx
  • int_range: random integer in [min, max]
  • string: random alphanumeric string, length = 12 by default
  • date_fuzz: shifts a date by a random number of days in [min_days, max_days] (defaults: -30..30)
  • time_fuzz: shifts a time-of-day by a random number of seconds in [min_seconds, max_seconds] with 24h wraparound (defaults: -300..300)
  • datetime_fuzz: shifts a timestamp/timestamptz by a random number of seconds in [min_seconds, max_seconds] (defaults: -86400..86400)

Common option:

  • as_string: if true, forces the anonymized value to be rendered as a quoted SQL string literal. By default Dumpling preserves the original quoting where possible.
  • min_days/max_days: used by date_fuzz
  • min_seconds/max_seconds: used by time_fuzz and datetime_fuzz

Input format

This tool targets the plain-text SQL format from pg_dump, handling:

  • INSERT INTO schema.table (col1, col2, ...) VALUES (...), (...), ...;
  • COPY schema.table (col1, col2, ...) FROM stdin; ... \. (tab-delimited with \N as NULL)

Other pg_dump formats (custom/binary/directory) are not supported.

Row filtering (retain/delete)

You can retain or delete rows for specific tables using explicit predicate lists. Semantics:

  • If retain is non-empty, a row is kept only if it matches at least one of its predicates.
  • Regardless of retain, a row is dropped if it matches any predicate in delete.

Predicates support these operators on a column:

  • eq, neq (string compare; case-insensitive if case_insensitive = true)
  • in, not_in (list of values, string compare)
  • like, ilike (SQL-like: % and _)
  • regex, iregex (Rust regex; iregex is case-insensitive)
  • lt, lte, gt, gte (numeric compare; values parsed as numbers)
  • is_null, not_null (no value needed)

Example:

[row_filters."public.users"]
retain = [
  { column = "country", op = "eq",  value = "US" },
  { column = "email",   op = "ilike", value = "%@myco.com" }
]
delete = [
  { column = "is_admin", op = "eq", value = "true" },
  { column = "email",    op = "ilike", value = "%@example.com" }
]

Row filtering works for both INSERT ... VALUES (...) and COPY ... FROM stdin rows.

Conditional per-column cases (first-match-wins)

Define default strategies in rules."<table>" and add ordered per-column cases in column_cases."<table>"."<column>". For each row, for each column, Dumpling applies the first matching case; if none match, it uses the default from rules.

Example:

[rules."public.users"]
email = { strategy = "hash", as_string = true }   # default
name  = { strategy = "name" }

[[column_cases."public.users".email]]
when.any = [{ column = "is_admin", op = "eq", value = "true" }]
strategy = { strategy = "redact", as_string = true }

[[column_cases."public.users".email]]
when.any = [{ column = "country", op = "in", values = ["DE","FR","GB"] }]
strategy = { strategy = "hash", salt = "eu-salt", as_string = true }

Notes:

  • when.any is OR, when.all is AND; you can use either or both. If both are empty, the case matches unconditionally.
  • First-match-wins per column; there is no merge/replace or fallthrough flag.
  • Row filtering (row_filters) is evaluated before cases; deleted rows are not transformed.

Notes

  • This is a streaming transformer; memory usage stays small even for big dumps.
  • For best results, configure strategies compatible with column data types. If you hash an integer column, Dumpling will render a string which Postgres can usually coerce, but explicit as_string = false may help in some cases.
  • If you switch runtimes/branches frequently and see test DB migration issues in your project, remember you can run tests with pytest --create-db (project convention).
  • Deterministic anonymization for tests: pass --seed <u64> or set env DUMPLING_SEED to make fuzz strategies reproducible across runs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dumpling_cli-0.1.0.tar.gz (33.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dumpling_cli-0.1.0-py3-none-win_amd64.whl (1.3 MB view details)

Uploaded Python 3Windows x86-64

dumpling_cli-0.1.0-py3-none-manylinux_2_34_x86_64.whl (1.6 MB view details)

Uploaded Python 3manylinux: glibc 2.34+ x86-64

dumpling_cli-0.1.0-py3-none-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file dumpling_cli-0.1.0.tar.gz.

File metadata

  • Download URL: dumpling_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dumpling_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 da29739d99ed896f65230ed6332119465fcef24351466c3391546ef70d98b7d6
MD5 006f95daf5b64a67404303fda45b4e23
BLAKE2b-256 4cadf1c13c463a408ddb6196e90aa79c8701c31d3a8cf9948f82683968a6fc97

See more details on using hashes here.

File details

Details for the file dumpling_cli-0.1.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: dumpling_cli-0.1.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dumpling_cli-0.1.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 b302bc7c509beb69e0e10c013db0190c0d5c243ea011b6ad24f82ff14deffa40
MD5 6e219606224ca38e36a2141d921bd38e
BLAKE2b-256 20f12961a1a7d6001abc8b62184fc26e707701ec006aa7457c41ef8bb6b7f0ce

See more details on using hashes here.

File details

Details for the file dumpling_cli-0.1.0-py3-none-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for dumpling_cli-0.1.0-py3-none-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5858d825276ba4486ccb5d6f4cdb773075f6c14fc7935ab45671697862cbe2c3
MD5 0ca3c1153041260dfb4319839ec6ed9c
BLAKE2b-256 11a8a6d45a587db01b356fe0aeaf7dcac442c686830a1a50b4eced9c3cf69f93

See more details on using hashes here.

File details

Details for the file dumpling_cli-0.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for dumpling_cli-0.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8368ac23afb8bd558b290a7269591cf3532fb32b853192949443b9b99f14bec6
MD5 84cd99e3032860a4e06acc12d2d70316
BLAKE2b-256 3a76068c0bba2ddad3880659b8ea72b590052545a091d81ff5a320482e752b51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page