MySQL database subsetting CLI tool

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Subsetter

Subsetter is a Python utility that can be used for subsetting portions of mysql databases. Subsetting is the action extracting a smaller set of rows from your database that still maintain expected foreign-key relationships between your data. This can be useful for testing against a small but realistic dataset or for generating sample data for use in demonstrations. This tool also supports filtering that allows you to remove/randomize rows that may contain sensitive data.

Similar tools include Tonic.ai's platform and condenser. This is meant to be a simple CLI tool that overcomes many of the difficulties in using condenser.

Be aware that subsetting is a hard problem. The planner tool is meant to do a bit of "magic" to generate a plan. For some organizations this will entirely match their needs, for others this may only be a starting point. The plan produced can be fairly aribtrarily modified and then fed to the sampler which does the technical work of actually extracting data from the source.

Usage

Create a sampling plan

The first step in subsetting a database is to generate a sampling plan. A sampling plan is nothing more than an ordered sequence of SQL describing how to sample each requested database table. You'll want to create a configuration file similar to planner_config.sample.yaml that tells the planner what tables you want to sample. Then you can create a plan with the below command:

python -m subsetter plan -c my-config.yaml > plan.yaml

If you inspect the generated plan file you will see SQL for each database table. Some tables may have a materialize: true flag; these are sampled tables that need to be referenced by other sampled tables. You may see some queries use a placeholder <SAMPLED> identifier; this means the query is referencing the already sampled data for that table rather than the original source table.

Generally, you can make arbitrary changes to the SQL listed in the plan with the constraint that each query can only access the sampling results from tables marked as materialized that appear earlier in the plan.

Sample a database with a plan

The sample sub-command will sample rows from the source database into your target output (either a database or as json files) using a plan generated using the plan sub-command. This tool will not copy schema from the source database. Any sampled tables must already exist in the destination database. Additionally you must pass --truncate if you wish to clear any existing data in the sampled tables that may interfere with the sampling process.

export SUBSET_DESTINATION_PASSWORD=my-db-password
python -m subsetter sample --plan my-plan.yaml mysql --host my-db-host --user my-db-user

The sampling process proceeds in three phases:

If --truncate is specified any tables about to be sampled will be first truncated.
Any materialized tables in the plan will be sampled into temporary tables on the source database in the order listed in the plan file.
Data is copied for each table from the source to destination. The ordering may differ from the plan file in order to adhere to foreign key constraints.

The sampler also supports filters which allow you to transform and anonymize your data using simple column filters. See sampler_config.sample.yaml for more details on what filters are available and how to configure them.

Plan and sample in one action

There's also a subset subcommand to perform the plan and sample actions back-to-back. This will automatically feed the generated plan into the sampler, in addition to ensuring the same source database configuration is used for each.

export SUBSET_DESTINATION_PASSWORD=my-db-password
python -m subsetter subset --plan-config my-config.yaml mysql --host my-db-host --user my-db-user

Sampling Multiplicity

Sampling usually means condensing a large dataset into a semantically consistent small dataset. However, there are times that what you really want to do is create a semantically consistent large dataset from your existing data. The sampler has support through this by setting the multiplicity factor.

Multiplicity works by creating multiple copies of your sampled dataset in your output database. To ensure these datasets do not collide it remaps all foreign keys into a new key-space. Note that this process assumes your foreign keys are opaque integers identifiers.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.0

May 1, 2024

0.2.0

Apr 3, 2024

This version

0.1.0

Jan 29, 2024

0.0.4

Oct 19, 2023

0.0.4.dev3 pre-release

Sep 19, 2023

0.0.4.dev2 pre-release

Aug 17, 2023

0.0.4.dev1 pre-release

Aug 17, 2023

0.0.3.dev2 pre-release

Aug 14, 2023

0.0.3.dev1 pre-release

Aug 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subsetter-0.1.0.tar.gz (27.8 kB view hashes)

Uploaded Jan 29, 2024 Source

Built Distribution

subsetter-0.1.0-py3-none-any.whl (27.6 kB view hashes)

Uploaded Jan 29, 2024 Python 3

Hashes for subsetter-0.1.0.tar.gz

Hashes for subsetter-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ec874967fd5bb5a3227e3088c10351cbfec34ad8c6d1591075d63b2fdd9c18a6`
MD5	`3534f584d98311be682889be194805a5`
BLAKE2b-256	`e6ce317a40f1ae88bc04c360a9c6f85048c7b2e2ca414c3bed4f4eeeb5f8bed5`

Hashes for subsetter-0.1.0-py3-none-any.whl

Hashes for subsetter-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`127b4e960ddbdcac86ed140f70272ad2a07a5da322e74a9135c68428b95e34d9`
MD5	`27bc3c011cb8c63242cc2aed40aa689d`
BLAKE2b-256	`74543c1398fadc63797043189c5d52916dbaeb4729857a53ee07f7b75bae42ad`