MySQL database subsetting CLI tool
Project description
This project is still largely a work in progress.
Subsetter is a Python utility that can be used for subsetting portions of mysql databases. "Subsetting" is the action extracting a smaller set of rows from your database that still maintain expected relationships between your data. This can be useful for testing against a small but realistic dataset or for generating sample data for use in demonstrations.
Similar tools include Tonic.ai's platform and condenser.
This is meant to be a simple CLI tool that overcomes many of the difficulties in
using condenser
.
Be aware that subsetting is a hard problem. The planner tool is meant to do a bit of "magic" to generate a plan. For some organizations this will entirely match their needs, for others this may only be a starting point. The plan produced can be fairly aribtrarily modified and then fed to the sampler which does the technical work of actually extracting data from the source.
Usage
Create a sampling plan
The first step in subsetting a database is to generate a sampling plan. A sampling plan is nothing more than an ordered sequence of SQL describing how to sample each requested database table. You'll want to create a configuration file similar to planner_config.sample.yaml that tells the planner what tables you want to sample. Then you can create a plan with the below command:
python -m subsetter plan -c my-config.yaml > plan.yaml
If you inspect the generated plan file you will see SQL for each database table.
Some tables may have a materialize: true
flag; these are sampled tables that
need to be referenced by other sampled tables. You may see some queries use a
placeholder <SAMPLED>
identifier; this means the query is referencing the
already sampled data for that table rather than the original source table.
Sample a database with a plan
The sample sub-command will sample rows from the source database into your target output (either a database or as json files) using a plan generated using the plan sub-command.
export SUBSET_DESTINATION_PASSWORD=my-db-password
python -m subsetter sample --plan my-plan.yaml mysql --host my-db-host --user my-db-user
The sampler also supports filters which allow you to transform and anonymize your data using simple column filters. See sampler_config.sample.yaml for more details on what filters are available and how to configure them.
Plan and sample in one action
There's also a subset
subcommand to perform the plan
and sample
actions
back-to-back. This will automatically feed the generated plan into the sampler,
in addition to ensuring the same source database configuration is used for
each.
export SUBSET_DESTINATION_PASSWORD=my-db-password
python -m subsetter subset --plan-config my-config.yaml mysql --host my-db-host --user my-db-user
Future Work
This project is still relatively incomplete and lacks some basic things like:
- More complete documentation
- More testing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for subsetter-0.0.4.dev3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0728d15549743b1f235e18e83e7b0abe3afcf635748b8bb4b5be654f4b829dee |
|
MD5 | 074aaf622b6756fa68443cefe8fb7f0b |
|
BLAKE2b-256 | 0a2ccd217a5a9c237cc980e240de7f82a3685220c3d7b121300a5fa35e6c1013 |