Skip to main content

A python implementation of synthpop sequential CART model for generation of synthetic tabular data

Project description

Python sequential CART model for synthetic data generation

Synthpop is a popular R package [1] for generating synthetic data using sequential CART models. Previous Python implementation are not well maintained and do not support the latest versions of scikit-learn. This project implements a simple Python version of the sequential CART model for synthetic data generation, inspired by the approach used in Synthpop. The implementation is designed to be compatible with the latest versions of scikit-learn and pandas, and includes functionality for fitting the model to training data, generating synthetic data, and saving/loading fitted models.

While lots of research is currently ongoing investigating deep learning frameworks for synthetic data generation, such as GANs and VAEs, simpler methods such as sequential CART are still highly competitive in terms of data quality, especially for tabular data with mixed data types and small to medium sized datasets.

Installation

pip install git+https://github.com/notna07/python-generative-cart#

Usage

Basic example of fitting the PyCART model to a dataset and generating synthetic data. The tree_args parameter is optional, but allows users to pass arguments to the underlying decision tree models, such as maximum depth and minimum samples per leaf, which can impact the quality of the generated synthetic data.

import pandas as pd
from py_cart import PyCART

# Load your dataset
df_real = pd.read_csv("your_dataset.csv")

# Fit the PyCART model
cart = PyCART(tree_args = {"max_depth": 5})
cart.fit(df_real)

# Generate synthetic data
df_syn = cart.generate(1000)

The original R Synthpop package [1] had limited functionality for saving and loading fitted models and generating additional synthetic data post hoc. This implementation adds this functionality:

# Save the fitted model
cart.save_model("pycart_model.pkl")

# Load the fitted model
cart_loaded = PyCART()
cart_loaded = cart_loaded.load_model("pycart_model.pkl")

# Generate synthetic data using the loaded model
df_syn_loaded = cart_loaded.generate(1000)

El Emam et al. [2], showed that the order of the sequential synthesis is important for the quality of the synthetic data. Therefore, this implementation also allows users to specify the order of the variables to be synthesized:

# Specify the order of variables to be synthesized
variable_order = ['age', 'income', 'education', ...]
cart.fit(df_real, visit_order = variable_order)

# Generate synthetic data with the specified variable order
df_syn = cart.generate(1000)

Comparison with R Synthpop

To check our implementation against the original R Synthpop package, we ran repeated experiments on eight datasets, and conducted a multi-faceted evaluation of the synthetic data using SynthEval [3] across various statistical measures, down-stream task performance, and privacy metrics. We also performed statistical tests to compare the results between the two implementations.

The experiments are shown in comparison.ipynb, and the results are saved in the experiments/results directory. The main takeaway for our implementation is that performance appers slightly worsened for statistical similarity metrics on datasets with many numerical features, on the other hand, empirical privacy measures are improved for most datasets.

Changing the settings on the PyCART model, such as tree depth and minimum samples per leaf, can have a significant impact on the quality of the generated synthetic data. In general, picking more fine grained options will lead to better statistical similarity but worse privacy. Going more coarse grained, can improve privacy, however, it is worth noting that even with very neutered settings, the privacy of the generated data is still far from comparable to that of a privacy focused framework.

References

[1] Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop : Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74(11), 1–26. 10.18637/jss.v074.i11

[2] El Emam, K., Mosquera, L., & Zheng, C. (2021). Optimizing the synthesis of clinical trial data using sequential trees. Journal of the American Medical Informatics Association, 28(1), 3–13. 10.1093/jamia/ocaa249

[3] Lautrup, A. D., Hyrup, T., Zimek, A., & Schneider-Kamp, P. (2025). Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data. Data Mining and Knowledge Discovery, 39(1), 6. 10.1007/s10618-024-01081-4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyseq_cart-0.1.0a1.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyseq_cart-0.1.0a1-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file pyseq_cart-0.1.0a1.tar.gz.

File metadata

  • Download URL: pyseq_cart-0.1.0a1.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyseq_cart-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 167007adc5c8721cdc380bbb5ef8ae592809095fbff819d047c87fc96e7b5aa3
MD5 9a8e2cff378fa568c364520884d6e782
BLAKE2b-256 01c157578b437d1a436fb485d42bc13be42f9adfa952a8927eb77b603edaf69f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyseq_cart-0.1.0a1.tar.gz:

Publisher: release.yml on notna07/python-generative-cart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyseq_cart-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: pyseq_cart-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyseq_cart-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 7eed4c1ea00ef62ccfffdfff26250932a63948ab6be0444a83ed9667bd258ea0
MD5 3a0050ac52a215fbfb1687eb082da37e
BLAKE2b-256 7178c5ff43c054f40bc5a43eb84c6f53d4b6c0462c0f56a5c410fbdb1ba1c956

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyseq_cart-0.1.0a1-py3-none-any.whl:

Publisher: release.yml on notna07/python-generative-cart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page