Skip to main content

Config-driven database subsetting tool that creates representative samples while preserving referential integrity. Modernized fork of Tonic's Condenser with performance and feature improvements.

Project description

Database Condenser

A config-driven database subsetting tool for PostgreSQL and MySQL, forked from Tonic's Condenser.

Some changes from the original Condenser:

  • Concurrent thread pool usage to speed up subsetting
  • Optional temp-table strategy for memory-efficient ID batching
  • Optional Postgres COPY protocol for faster transfer
  • Automatic sequence reset after subsetting
  • Built on psycopg3 and managed with astral's uv

Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.

One common use-case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.

You can find more about the original Condenser details here and here.

Installation

Six steps to set up from source, assuming Python 3.10+:

  1. Install astral-uv

  2. Install Postgres and/or MySQL database tools. For Postgres we need pg_dump and psql tools; they need to be on your $PATH or point to them with $POSTGRES_PATH. For MySQL we need mysqldump and mysql, they can be on your $PATH or point to them with $MYSQL_PATH.

  3. Clone this project locally.

  4. Install the project with uv sync --frozen

  5. Setup your configuration and save it in config.json. The provided config.json.example has the skeleton of what you need to provide: source and destination database connection details, as well as subsetting goals in initial_targets. Here's an example that will collect 10% of a table named public.target_table.

    "initial_targets": [
        {
            "table": "public.target_table",
            "percent": 10
        }
    ]
    

    There may be more required configuration depending on your database, but simple databases should be easy. See the CONFIG.md for more details, and config.json.example_all for all of the options in a single config file.

  6. Run! $ uv run subset

Or, from PyPI:

pip install db-condenser
pipx install db-condenser

Running

Almost all the configuration is in the config.json file, so running it is as simple as

uv run subset

Three command-line arguments are supported:

-v: Verbose output. Useful for performance debugging. Lists almost every query made, and its speed.

--no-constraints: For Postgres this will not add constraints found in the source database to the destination database. This option has no effect for MySQL.

-y: Skip confirmation on subsetting to destination host if it isn't localhost or 127.0.0.1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

db_condenser-1.0.0.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

db_condenser-1.0.0-py3-none-any.whl (28.0 kB view details)

Uploaded Python 3

File details

Details for the file db_condenser-1.0.0.tar.gz.

File metadata

  • Download URL: db_condenser-1.0.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for db_condenser-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c81dfc085985c4b2ebd491dc87ce3cbea05b7b38105e74b817760e7d024bdbaa
MD5 772376c3e6284129ebdea4ea1d4a3940
BLAKE2b-256 ecf82c5a6213a641ac4a185d240fb423581248138f25ccd4ce67f1417696d4dd

See more details on using hashes here.

File details

Details for the file db_condenser-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: db_condenser-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 28.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for db_condenser-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 13845efb05b4d51d91bcb4a05f3c2498f878210c3eaafe3588f3129cc1e80b4f
MD5 6e6abb5dfedd93f65070cd1a5b5c9cff
BLAKE2b-256 42bc69b2fdf524489c7cc7762187f0329ff9d5d5942f294baacf36149862279b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page