Skip to main content

Config-driven database subsetting tool that creates representative samples while preserving referential integrity. Modernized fork of Tonic's Condenser with performance and feature improvements.

Project description

Database Condenser

A config-driven database subsetting tool for PostgreSQL and MySQL, forked from Tonic's Condenser.

Some changes from the original Condenser:

  • Concurrent thread pool usage to speed up subsetting
  • Optional temp-table strategy for memory-efficient ID batching
  • Optional Postgres COPY protocol for faster transfer
  • Incremental subsetting (skip schema setup and keep existing data)
  • Automatic sequence reset after subsetting
  • Built on psycopg3 and managed with astral's uv

Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.

One common use-case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.

You can find more about the original Condenser details here and here.

Installation

Six steps to set up from source, assuming Python 3.10+:

  1. Install astral-uv

  2. Install Postgres and/or MySQL database tools. For Postgres we need pg_dump and psql tools; they need to be on your $PATH or point to them with $POSTGRES_PATH. For MySQL we need mysqldump and mysql, they can be on your $PATH or point to them with $MYSQL_PATH.

  3. Clone this project locally.

  4. Install the project with uv sync --frozen

  5. Setup your configuration and save it in config.json. The provided config.json.example has the skeleton of what you need to provide: source and destination database connection details, as well as subsetting goals in initial_targets. Here's an example that will collect 10% of a table named public.target_table.

    "initial_targets": [
        {
            "table": "public.target_table",
            "percent": 10
        }
    ]
    

    There may be more required configuration depending on your database, but simple databases should be easy. See the CONFIG.md for more details, and config.json.example_all for all of the options in a single config file.

  6. Run! $ uv run subset

Or, from PyPI:

pip install db-condenser
pipx install db-condenser

Running

Almost all the configuration is in the config.json file, so running it is as simple as

uv run subset

Four command-line arguments are supported:

-v: Verbose output. Useful for performance debugging. Lists almost every query made, and its speed.

--no-constraints: For Postgres this will not add constraints found in the source database to the destination database. This option has no effect for MySQL.

-y: Skip confirmation on subsetting to destination host if it isn't localhost or 127.0.0.1

--config <file>: Use a custom JSON config file instead of the default config.json.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

db_condenser-1.1.1.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

db_condenser-1.1.1-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file db_condenser-1.1.1.tar.gz.

File metadata

  • Download URL: db_condenser-1.1.1.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for db_condenser-1.1.1.tar.gz
Algorithm Hash digest
SHA256 005f6e1abb9988534102033b3fef93edba95650bdeeb503e74a13ec0964e9486
MD5 ccfad11dee14c51da14b3faff50af835
BLAKE2b-256 76e42130a6eb756187c7ae57fc51376d3cf8a943bd1e40b328c36da0d52925f9

See more details on using hashes here.

File details

Details for the file db_condenser-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: db_condenser-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for db_condenser-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3d41d5698c8e26f0f2c54f9a55db3ec67036b969526a5167ed3f4e704d05eebc
MD5 651e7edc677f45d66c115a1648f6e171
BLAKE2b-256 1b963a77eba2960029309af32489d4600046afa008312f361610f47a650ec798

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page