Python implementation of the SynDiffix synthetic data generation mechanism.
Project description
Overview
SynDiffix is a new approach to generating statistically-accurate and strongly anonymous synthetic data from structured data. Compared to existing open-source and proprietary commercial approaches, SynDiffix is
- many times more accurate,
- has comparable or better ML efficacy,
- runs at least an order of magnitude faster, and
- has equal or stronger anonymization.
Purpose
This library implements the SynDiffix method for tabular data synthesis in pure Python.
Usage
Usage can be as simple as:
from syndiffix import Synthesizer
raw_data = load_dataframe()
syn_data = Synthesizer(raw_data).sample()
This will create a new Pandas DataFrame containing synthetic data, with the same number of columns and the same number of rows as in the input data. Default settings will be used and each row in the original data has to belong to a different protected entity.
The script example_simple.py shows a simple example on how to process all data in a CSV file that holds a different protected entity in each row.
Processing data with multiple rows per-entity
If the same entity can have multiple rows belonging to it, then a dataframe with the AID values has to be passed separately:
from syndiffix import Synthesizer
raw_data = load_dataframe()
aid_columns = ["aid1", "aid2"]
aids = raw_data[aid_columns]
raw_data = raw_data.drop(columns=aid_columns)
syn_data = Synthesizer(raw_data, aids=aids).sample()
The AID columns should not be a part of the input data to the synthesizer.
Changing anonymization parameters
Anonymization parameters determine how much raw data is suppressed and distorted before synthesis.
To modify default parameters, pass a custom instance of the AnonymizationParams class
to the Synthesizer object. For example:
Synthesizer(raw_data, anonymization_params=AnonymizationParams(layer_noise_sd=1.5))
The following parameters are available:
-
salt: noise salt for the data source; if empty, an automatically generated value is used. -
low_count_params: parameters for the low-count filter. -
outlier_count: outlier count interval used during flattening. -
top_count: top count interval used during flattening. -
layer_noise_sd: stddev of each noise layer added to row counts.
Changing bucketization parameters
Bucketization parameters determine how raw data is aggregated before synthesis.
To modify default parameters, pass a custom instance of the BucketizationParams class
to the Synthesizer object. For example:
Synthesizer(raw_data, bucketization_params=BucketizationParams(precision_limit_depth_threshold=10))
The following parameters are available:
-
singularity_low_threshold: low threshold for a singularity bucket. -
range_low_threshold: low threshold for a range bucket. -
precision_limit_row_fraction: the fraction of rows needed for splitting nodes when the tree depth goes beyond the depth threshold; this condition is applied in addition to the low-count filter. -
precision_limit_depth_threshold: tree depth threshold below which nodes are split only if they pass the low-count filter; when above the threshold, the row fraction condition is also applied.
Changing clustering strategy
Clustering strategy determines how columns are grouped together in the forest of trees used for
anonymization and aggregation. Anonymized buckets are harvested from those trees, synthetic
microdata tables are generated, and are then stitched together into a single output table.
To change the clustering strategy, pass an instance of a sub-class of the ClusteringStrategy class
to the Synthesizer object. For example:
Synthesizer(raw_data, clustering=NoClustering())
The following strategies are available:
-
NoClustering: - strategy that disables clustering; puts all columns in a single cluster; note that this will result in very poor performance for the general case. -
DefaultClustering: - general-purpose clustering strategy; columns are grouped together in order to maximize the chi-square dependence measurement values of the generated clusters; a main column that gets put into every cluster can be specified, in order to improve output quality for that specific column. -
MLClustering: - strategy for ML tasks; main feature columns for a target column are automatically detected and grouped together with the target column in the order of their ML prediction-test scores.
Development
Prerequisites: poetry
- Installation:
poetry install
Activate poetry environment: poetry shell. (can skip, then prepend poetry run to the next commands)
- Format:
black . && isort . - Test:
pytest . - Check:
flake8 . && mypy . && black --check . && isort . --check
Implementation details
A step-by-step description of the algorithm can be found here.
A per-dimension range is internally called an interval (and handled by the Interval class), in order to avoid
potential name clashes with the native Python range API.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file syndiffix-1.0.0a1.tar.gz.
File metadata
- Download URL: syndiffix-1.0.0a1.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1663155b1e07d2a9a6b7021d26570064d2d68d17bc48b33c4752c8050ab149f9
|
|
| MD5 |
a073dc82c2c57f1062c9a5082a3c688e
|
|
| BLAKE2b-256 |
94472784ca21c335770f50239097bde760be9858d10904f2090c05d27e35ebe9
|
File details
Details for the file syndiffix-1.0.0a1-py3-none-any.whl.
File metadata
- Download URL: syndiffix-1.0.0a1-py3-none-any.whl
- Upload date:
- Size: 35.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4546c4b79509d422eadac95250d6b891ad015284d46f533e18f3f168bca4cfc
|
|
| MD5 |
47dedd15813b86805b9a71296eb5edda
|
|
| BLAKE2b-256 |
4bdecee15b868a10a1f06a5f036179d050359ea92f7bebee9a25856821d68640
|