Skip to main content

A python implementation of synthpop sequential CART model for generation of synthetic tabular data

Project description

Python sequential CART model for synthetic data generation

Synthpop is a popular R package [1] for generating synthetic data using sequential CART models. Previous Python implementation are not well maintained and do not support the latest versions of scikit-learn. This project implements a simple Python version of the sequential CART model for synthetic data generation, inspired by the approach used in Synthpop. The implementation is designed to be compatible with the latest versions of scikit-learn and pandas, and includes functionality for fitting the model to training data, generating synthetic data, and saving/loading fitted models.

While lots of research is currently ongoing investigating deep learning frameworks for synthetic data generation, such as GANs and VAEs, simpler methods such as sequential CART are still highly competitive in terms of data quality, especially for tabular data with mixed data types and small to medium sized datasets.

Installation

pip install git+https://github.com/notna07/python-generative-cart#

Usage

Basic example of fitting the PyCART model to a dataset and generating synthetic data. The tree_args parameter is optional, but allows users to pass arguments to the underlying decision tree models, such as maximum depth and minimum samples per leaf, which can impact the quality of the generated synthetic data.

import pandas as pd
from py_cart import PyCART

# Load your dataset
df_real = pd.read_csv("your_dataset.csv")

# Fit the PyCART model
cart = PyCART(tree_args = {"max_depth": 5})
cart.fit(df_real)

# Generate synthetic data
df_syn = cart.generate(1000)

The original R Synthpop package [1] had limited functionality for saving and loading fitted models and generating additional synthetic data post hoc. This implementation adds this functionality:

# Save the fitted model
cart.save_model("pycart_model.pkl")

# Load the fitted model
cart_loaded = PyCART()
cart_loaded = cart_loaded.load_model("pycart_model.pkl")

# Generate synthetic data using the loaded model
df_syn_loaded = cart_loaded.generate(1000)

El Emam et al. [2], showed that the order of the sequential synthesis is important for the quality of the synthetic data. Therefore, this implementation also allows users to specify the order of the variables to be synthesized:

# Specify the order of variables to be synthesized
variable_order = ['age', 'income', 'education', ...]
cart.fit(df_real, visit_order = variable_order)

# Generate synthetic data with the specified variable order
df_syn = cart.generate(1000)

Comparison with R Synthpop

To check our implementation against the original R Synthpop package, we ran repeated experiments on eight datasets, and conducted a multi-faceted evaluation of the synthetic data using SynthEval [3] across various statistical measures, down-stream task performance, and privacy metrics. We also performed statistical tests to compare the results between the two implementations.

The experiments are shown in comparison.ipynb, and the results are saved in the experiments/results directory. The main takeaway for our implementation is that performance appers slightly worsened for statistical similarity metrics on datasets with many numerical features, on the other hand, empirical privacy measures are improved for most datasets.

Changing the settings on the PyCART model, such as tree depth and minimum samples per leaf, can have a significant impact on the quality of the generated synthetic data. In general, picking more fine grained options will lead to better statistical similarity but worse privacy. Going more coarse grained, can improve privacy, however, it is worth noting that even with very neutered settings, the privacy of the generated data is still far from comparable to that of a privacy focused framework.

References

[1] Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop : Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74(11), 1–26. 10.18637/jss.v074.i11

[2] El Emam, K., Mosquera, L., & Zheng, C. (2021). Optimizing the synthesis of clinical trial data using sequential trees. Journal of the American Medical Informatics Association, 28(1), 3–13. 10.1093/jamia/ocaa249

[3] Lautrup, A. D., Hyrup, T., Zimek, A., & Schneider-Kamp, P. (2025). Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data. Data Mining and Knowledge Discovery, 39(1), 6. 10.1007/s10618-024-01081-4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyseq_cart-0.1.0a5.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyseq_cart-0.1.0a5-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file pyseq_cart-0.1.0a5.tar.gz.

File metadata

  • Download URL: pyseq_cart-0.1.0a5.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyseq_cart-0.1.0a5.tar.gz
Algorithm Hash digest
SHA256 f7037704dd9ea69c17ce75e25ff7551cb47caa6d02a1e13ffb66b59600d23b23
MD5 b5cfb17a01019da299c02c23c62a21db
BLAKE2b-256 e04dd298e7e620bf9131c43a9098612c77f8987c321a50a73d6b9c851ac65445

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyseq_cart-0.1.0a5.tar.gz:

Publisher: release.yml on notna07/python-generative-cart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyseq_cart-0.1.0a5-py3-none-any.whl.

File metadata

  • Download URL: pyseq_cart-0.1.0a5-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyseq_cart-0.1.0a5-py3-none-any.whl
Algorithm Hash digest
SHA256 2af46b3df97364dd855619f80d7be13df41094cd333090e9f1feda1483c24d04
MD5 271b2c61261dd112df91c58c9da306e6
BLAKE2b-256 f1a2f38857d5469effb9a410892baf18b76862702ce92b74e15f86b11584571d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyseq_cart-0.1.0a5-py3-none-any.whl:

Publisher: release.yml on notna07/python-generative-cart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page