Skip to main content

A python implementation of synthpop sequential CART model for generation of synthetic tabular data

Project description

Python sequential CART model for synthetic data generation

Synthpop is a popular R package [1] for generating synthetic data using sequential CART models. Previous Python implementation are not well maintained and do not support the latest versions of scikit-learn. This project implements a simple Python version of the sequential CART model for synthetic data generation, inspired by the approach used in Synthpop. The implementation is designed to be compatible with the latest versions of scikit-learn and pandas, and includes functionality for fitting the model to training data, generating synthetic data, and saving/loading fitted models.

While lots of research is currently ongoing investigating deep learning frameworks for synthetic data generation, such as GANs and VAEs, simpler methods such as sequential CART are still highly competitive in terms of data quality, especially for tabular data with mixed data types and small to medium sized datasets.

Installation

pip install git+https://github.com/notna07/python-generative-cart#

Usage

Basic example of fitting the PyCART model to a dataset and generating synthetic data. The tree_args parameter is optional, but allows users to pass arguments to the underlying decision tree models, such as maximum depth and minimum samples per leaf, which can impact the quality of the generated synthetic data.

import pandas as pd
from py_cart import PyCART

# Load your dataset
df_real = pd.read_csv("your_dataset.csv")

# Fit the PyCART model
cart = PyCART(tree_args = {"max_depth": 5})
cart.fit(df_real)

# Generate synthetic data
df_syn = cart.generate(1000)

The original R Synthpop package [1] had limited functionality for saving and loading fitted models and generating additional synthetic data post hoc. This implementation adds this functionality:

# Save the fitted model
cart.save_model("pycart_model.pkl")

# Load the fitted model
cart_loaded = PyCART()
cart_loaded = cart_loaded.load_model("pycart_model.pkl")

# Generate synthetic data using the loaded model
df_syn_loaded = cart_loaded.generate(1000)

El Emam et al. [2], showed that the order of the sequential synthesis is important for the quality of the synthetic data. Therefore, this implementation also allows users to specify the order of the variables to be synthesized:

# Specify the order of variables to be synthesized
variable_order = ['age', 'income', 'education', ...]
cart.fit(df_real, visit_order = variable_order)

# Generate synthetic data with the specified variable order
df_syn = cart.generate(1000)

Comparison with R Synthpop

To check our implementation against the original R Synthpop package, we ran repeated experiments on eight datasets, and conducted a multi-faceted evaluation of the synthetic data using SynthEval [3] across various statistical measures, down-stream task performance, and privacy metrics. We also performed statistical tests to compare the results between the two implementations.

The experiments are shown in comparison.ipynb, and the results are saved in the experiments/results directory. The main takeaway for our implementation is that performance appers slightly worsened for statistical similarity metrics on datasets with many numerical features, on the other hand, empirical privacy measures are improved for most datasets.

Changing the settings on the PyCART model, such as tree depth and minimum samples per leaf, can have a significant impact on the quality of the generated synthetic data. In general, picking more fine grained options will lead to better statistical similarity but worse privacy. Going more coarse grained, can improve privacy, however, it is worth noting that even with very neutered settings, the privacy of the generated data is still far from comparable to that of a privacy focused framework.

References

[1] Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop : Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74(11), 1–26. 10.18637/jss.v074.i11

[2] El Emam, K., Mosquera, L., & Zheng, C. (2021). Optimizing the synthesis of clinical trial data using sequential trees. Journal of the American Medical Informatics Association, 28(1), 3–13. 10.1093/jamia/ocaa249

[3] Lautrup, A. D., Hyrup, T., Zimek, A., & Schneider-Kamp, P. (2025). Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data. Data Mining and Knowledge Discovery, 39(1), 6. 10.1007/s10618-024-01081-4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyseq_cart-0.1.0a3.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyseq_cart-0.1.0a3-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file pyseq_cart-0.1.0a3.tar.gz.

File metadata

  • Download URL: pyseq_cart-0.1.0a3.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyseq_cart-0.1.0a3.tar.gz
Algorithm Hash digest
SHA256 10f91ee4a91f1fefd3ddc45cbd7bc9ef4c1b9d58a42cf0b0d71826a4b9a8fa47
MD5 090e56c39ede93b1da7c9ee94b98e513
BLAKE2b-256 c272a3400f7fea3e7a83aa7385e63a0e724309bee35437ceda7a78633d4015c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyseq_cart-0.1.0a3.tar.gz:

Publisher: release.yml on notna07/python-generative-cart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyseq_cart-0.1.0a3-py3-none-any.whl.

File metadata

  • Download URL: pyseq_cart-0.1.0a3-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyseq_cart-0.1.0a3-py3-none-any.whl
Algorithm Hash digest
SHA256 60aa12d7035642c55eebf69a167b9c6061c6b292c01b4298dd1e9aaafc63f17f
MD5 2a22965d4dedd0a376d31dc3534c502c
BLAKE2b-256 45b36df086651bd06a7dacae4b24342e5936dcabce3b32a96baa1d6b04dc47eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyseq_cart-0.1.0a3-py3-none-any.whl:

Publisher: release.yml on notna07/python-generative-cart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page