Skip to main content

MySQL database subsetting CLI tool

Project description

This project is still largely a work in progress.

Subsetter is a Python utility that can be used for subsetting portions of mysql databases. "Subsetting" is the action extracting a smaller set of rows from your database that still maintain expected relationships between your data. This can be useful for testing against a small but realistic dataset or for generating sample data for use in demonstrations.

Similar tools include Tonic.ai's platform and condenser. This is meant to be a simple CLI tool that overcomes many of the difficulties in using condenser.

Be aware that subsetting is a hard problem. The planner tool is meant to do a bit of "magic" to generate a plan. For some organizations this will entirely match their needs, for others this may only be a starting point. The plan produced can be fairly aribtrarily modified and then fed to the sampler which does the technical work of actually extracting data from the source.

Usage

Create a sampling plan

The first step in subsetting a database is to generate a sampling plan. A sampling plan is nothing more than an ordered sequence of SQL describing how to sample each requested database table. You'll want to create a configuration file similar to planner_config.sample.yaml that tells the planner what tables you want to sample. Then you can create a plan with the below command:

python -m subsetter plan -c my-config.yaml > plan.yaml

If you inspect the generated plan file you will see SQL for each database table. Some tables may have a materialize: true flag; these are sampled tables that need to be referenced by other sampled tables. You may see some queries use a placeholder <SAMPLED> identifier; this means the query is referencing the already sampled data for that table rather than the original source table.

Sample a database with a plan

The sample sub-command will sample rows from the source database into your target output (either a database or as json files) using a plan generated using the plan sub-command.

export SUBSET_DESTINATION_PASSWORD=my-db-password
python -m subsetter sample --plan my-plan.yaml mysql --host my-db-host --user my-db-user

The sampler also supports filters which allow you to transform and anonymize your data using simple column filters. See sampler_config.sample.yaml for more details on what filters are available and how to configure them.

Plan and sample in one action

There's also a subset subcommand to perform the plan and sample actions back-to-back. This will automatically feed the generated plan into the sampler, in addition to ensuring the same source database configuration is used for each.

export SUBSET_DESTINATION_PASSWORD=my-db-password
python -m subsetter subset --plan-config my-config.yaml mysql --host my-db-host --user my-db-user

Future Work

This project is still relatively incomplete and lacks some basic things like:

  • More complete documentation
  • More testing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subsetter-0.0.4.dev3.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

subsetter-0.0.4.dev3-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file subsetter-0.0.4.dev3.tar.gz.

File metadata

  • Download URL: subsetter-0.0.4.dev3.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for subsetter-0.0.4.dev3.tar.gz
Algorithm Hash digest
SHA256 f1d03aebb6b404d792db4eaa6fcb4d649df0f90cad189019ad894f455b8c2ffe
MD5 ff7c77e0ce8137fe6ffa427d2f31c35e
BLAKE2b-256 b71f199d56a80e42c8ace52d22834fc8e364835623473c5891eabb85173c7bbe

See more details on using hashes here.

File details

Details for the file subsetter-0.0.4.dev3-py3-none-any.whl.

File metadata

  • Download URL: subsetter-0.0.4.dev3-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for subsetter-0.0.4.dev3-py3-none-any.whl
Algorithm Hash digest
SHA256 0728d15549743b1f235e18e83e7b0abe3afcf635748b8bb4b5be654f4b829dee
MD5 074aaf622b6756fa68443cefe8fb7f0b
BLAKE2b-256 0a2ccd217a5a9c237cc980e240de7f82a3685220c3d7b121300a5fa35e6c1013

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page