A meta framework for synthetic data generation.
Project description
Synthesizers
A meta library for synthetic data generation.
The goal of synthesizers is to simnplify the use of existing frameworks for synthethic data generation:
- All basic operations are available as functional and pipeline abstractions that transform states.
- States keep track of datasets, models, and evaluation results.
- A meta pipeline allows for very simple but expressive synthetic data generation.
- Datasets are read from CSV, TSV, JSON, JSONL, Python Pickle (.pickle), and Excel (.xlsx) files.
- Datasets can be downloaded from the Huggingface Hub.
- States including datasets and models can be saved and loaded from disk.
- Datasets can be converted between list, Numpy, Pandas, and Huggingface datasets formats.
- Datasets are automatically converted to the input format of synthesis or evaluation backends.
Installation
Simply install synthesizers using pip from PyPI:
pip install synthesizers
If you clone or downloaded the source code, you can also install it from the root directory of the repository:
pip install .
Or download and install directly from the terminal:
pip install https://github.com/schneiderkamplab/synthesizers/archive/refs/heads/main.zip
To ensure the right dependencies, it is often preferable to create a virtual environment (here the directory venv
in the current directory):
python -m virtualenv venv
. venv/activate
pip install synthesizers
Conda is a popular alternative:
conda create -n synthesizers python=3.11
conda activate synthesizers
pip install synthesizers
Usage
Functional abstraction
The functional abstraction manipulates states that can be initalized by the pre-defined Load
object and manipulated by functions such as the meta function Synthesize
:
from synthesizers import Load
Load("mstz/breast").Synthesize(split_size=0.8, gen_count=10000, eval_target_col="is_cancer", save_name="breast.xlsx", save_key="synth")
In this case, Load
loads a dataset on breast cancer fromt the Huggingface Hub, resulting in a state containing just a train
dataset. This state is then expanded by the Synthesize
function, which splits the train
dataset into train
and test
datasets, trains a GAN model
, generates a synth
dataset, computes eval
information, and saves the synthetic data to an Excel file.
The meta function Synthesize
can be broken up into separate functions for the individual steps:
from synthesizers import Load
Load("mstz/breast").Split(size=0.8).Train().Generate(count=10000).Evaluate(target_col="is_cancer").Save(name="breast.xlsx", key="synth")
This version can be used to resuse intermediate states, e.g., to generate and save synthetic datasets of different sizes reusing the same trained model:
from synthesizers import Load
state = Load("mstz/breast").Split(size=0.8).Train()
for count in (100, 1000, 10000, 100000):
state.Generate(count=count).Save(name=f"breast-{count}.csv", key="synth")
It is also useful when it is necessary to store the intermediate state to the file system:
from synthesizers import Load
state = Load("mstz/breast").Split(size=0.8).Train().Save("breast_state")
The saved state can be loaded and resumed as one might expect:
from synthesizers import Load
Load("breast_state").Generate(count=10000).Save(name="breast.csv", key="synth")
The count
parameter can be a list or another iterable sequence, indicating that multiple synthetic sets be created. The following code will save two synthetic datasets to breast_1000.csv
and breast_100000.csv
:
from synthesizers import Load
Load("breast_state").Generate(count=[1000,100000]).Save(name="breast_1000.csv", index=0, key="synth").Save(name="breast_100000.csv", index=1, key="synth")
Multiple parameters are also allowed for the plugin
parameter of Train
and the size
parameter of Split
.
Furthermore, the Load
function takes either a single dataset or a tuple of such datasets. With the help of the optional jobs
parameter (with variants train_jobs
, eval_jobs
etc.) parameter, the number of concurrent processes can be set. In the following example, we generate synthetic versions of two different splits of two different datasets:
from synthesizers import Load
Load(("mstz/titanic","mstz/breast")).Synthesize(split_size=[0.5,0.8], train_jobs=4, do_eval=False).Save("mstz")
Pipeline abstraction
Internally, the functional abstraction instantiates pipelines to accomplish its functionality. These pipelines can be used as an expressive alternative. Here is a usage example with the synthesis meta pipeline, which again loads the breast cancer dataset from the Huggingface Hub, trains a GAN model, synthesizes 10,000 synthetic records, evaluates it, and saves it as a JSON file:
from synthesizers import pipeline
pipeline("synthesize", split_size=0.8, gen_count=10000, eval_target_col="is_cancer", save_name="breast.json", save_key="synth")("mstz/breast")
The meta pipeline pools the functionality of multiple base pipelines. The same functionality as in the above example might be accomplished with base pipelines:
from synthesizers import pipeline
state = pipeline("split", size=0.8)("mstz/breast")
state = pipeline("train")(state)
state = pipeline("generate", count=10000)(state)
state = pipeline("evaluate", target_col="is_cancer")
state = pipeline("identity", save_name="breast.json", save_key="synth")
Pipelines are exposed not only as an internal representation but provide the ability to reuse settings, e.g., by having a pipeline for training CTGANs. The following example also illustrates that functional and pipeline abstractions can readily be combined as preferred by the user:
from synthesizers import Load, pipeline
s1 = Load("mstz/breast").Split()
s2 = Load("julien-c/titanic-survival").Split()
train = pipeline("train", plugin="ctgan")
train(s1).Generate(count=1000).Save(name="breast.jsonl", key="synth")
train(s2).Generate(count=1000).Save(name="titanic.jsonl", key="synth")
The plugins depend on the backend used. The standard backend for generation is synthcity, which offers a variety of plugins including adsgan
, ctgan
, tvae
, and bayesian_network
.
For evaluation, the standard backend is SynthEval.
Ideas for future development
- add possibility to allow methods from multiple backenders by allowing multiple adapters (mapping method name to adapter)
- make sure all parameters can be iterables/sequences where it makes sense (e.g. target_col)
- check argument validity before running pipeline
- improved error handling (e.g. evaluating without synth dataset, training without train dataset etc.)
- add source and meta to StateDict with initial data source and parameters to reproduce
- revamp loading saving to a more useful format, e.g., pickle everything to one file instead of directories
- implement overwrite parameter to State with Load(overwrite=...), three values:
- copy: add new state if a value would be overwritten
- overwrite: just overwrite the value
- raise: raise an error if a value would be overwritten
- implement TabularSynthesisDPPipeline
- use benchmark module from syntheval?
- standardized list of supported metrics (supported by any backend)
- standardized list of supported generation methods (supported by any backend)
- accumulation of multiple outputs (model, synth, and eval as lists)
- select and combine evaluation backends automatically for given list of metrics
- select generation backend automatically for given generation method
- make syntheval plots available as PIL images
- push_to_hub method on models a la https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py
- push_to_hub method on datasets
- R synthpop as backend
- integration of other backends
- Put string options as literals so they are more visible in tooltips
- Docstrings for all modules used in the examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file synthesizers-1.2.3.tar.gz
.
File metadata
- Download URL: synthesizers-1.2.3.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1bb3b3fd7e106a71c92c6003012b9ed00f5b600ca7fee0ab05628b024066438f |
|
MD5 | ea7dde5eb88bf6ce08e9b832fae3160d |
|
BLAKE2b-256 | 5370ceac414da97073d1bc5df78959de004051e892e898bd00822e0c34aaf5c3 |