Skip to main content

Config-driven database subsetting tool that creates representative samples while preserving referential integrity. Modernized fork of Tonic's Condenser with performance and feature improvements.

Project description

Database Condenser

A config-driven database subsetting tool for PostgreSQL and MySQL, forked from Tonic's Condenser.

Some changes from the original Condenser:

  • Concurrent thread pool usage to speed up subsetting
  • Optional temp-table strategy for memory-efficient ID batching
  • Optional Postgres COPY protocol for faster transfer
  • Incremental subsetting (skip schema setup and keep existing data)
  • Automatic sequence reset after subsetting
  • Built on psycopg3 and managed with astral's uv

Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.

One common use-case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.

You can find more about the original Condenser details here and here.

Installation

Six steps to set up from source, assuming Python 3.10+:

  1. Install astral-uv

  2. Install Postgres and/or MySQL database tools. For Postgres we need pg_dump and psql tools; they need to be on your $PATH or point to them with $POSTGRES_PATH. For MySQL we need mysqldump and mysql, they can be on your $PATH or point to them with $MYSQL_PATH.

  3. Clone this project locally.

  4. Install the project with uv sync --frozen

  5. Setup your configuration and save it in config.json. The provided config.json.example has the skeleton of what you need to provide: source and destination database connection details, as well as subsetting goals in initial_targets. Here's an example that will collect 10% of a table named public.target_table.

    "initial_targets": [
        {
            "table": "public.target_table",
            "percent": 10
        }
    ]
    

    There may be more required configuration depending on your database, but simple databases should be easy. See the CONFIG.md for more details, and config.json.example_all for all of the options in a single config file.

  6. Run! $ uv run subset

Or, from PyPI:

pip install db-condenser
pipx install db-condenser

Running

Almost all the configuration is in the config.json file, so running it is as simple as

uv run subset

Three command-line arguments are supported:

-v: Verbose output. Useful for performance debugging. Lists almost every query made, and its speed.

--no-constraints: For Postgres this will not add constraints found in the source database to the destination database. This option has no effect for MySQL.

-y: Skip confirmation on subsetting to destination host if it isn't localhost or 127.0.0.1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

db_condenser-1.1.0.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

db_condenser-1.1.0-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file db_condenser-1.1.0.tar.gz.

File metadata

  • Download URL: db_condenser-1.1.0.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for db_condenser-1.1.0.tar.gz
Algorithm Hash digest
SHA256 8f3ae2a33c44ed73836b1ce66418ad43d47f2060daec7f15bf9f7c6295efb39e
MD5 89581f54df974a658a7914440ee9f8c0
BLAKE2b-256 6d3d8ca24b31cc2f12ebfeca4e655dc504348c7c781e65ec989d2ebb2f2e9969

See more details on using hashes here.

File details

Details for the file db_condenser-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: db_condenser-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for db_condenser-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9930da6a5d340eaee1defdfe41a680a731d2e8f97a334f0e5a79f97be4e7006c
MD5 1f7d89bb50a86b9c2c3439065c61c241
BLAKE2b-256 88b3e9ce213103deda01644e5b3fe721e0a0cdf8e820de781ac87961cb3c0214

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page