Skip to main content

A python implementation of synthpop sequential CART model for generation of synthetic tabular data

Project description

Python sequential CART model for synthetic data generation

Synthpop is a popular R package [1] for generating synthetic data using sequential CART models. Previous Python implementation are not well maintained and do not support the latest versions of scikit-learn. This project implements a simple Python version of the sequential CART model for synthetic data generation, inspired by the approach used in Synthpop. The implementation is designed to be compatible with the latest versions of scikit-learn and pandas, and includes functionality for fitting the model to training data, generating synthetic data, and saving/loading fitted models.

While lots of research is currently ongoing investigating deep learning frameworks for synthetic data generation, such as GANs and VAEs, simpler methods such as sequential CART are still highly competitive in terms of data quality, especially for tabular data with mixed data types and small to medium sized datasets.

Installation

pip install git+https://github.com/notna07/python-generative-cart#

Usage

Basic example of fitting the PyCART model to a dataset and generating synthetic data. The tree_args parameter is optional, but allows users to pass arguments to the underlying decision tree models, such as maximum depth and minimum samples per leaf, which can impact the quality of the generated synthetic data.

import pandas as pd
from py_cart import PyCART

# Load your dataset
df_real = pd.read_csv("your_dataset.csv")

# Fit the PyCART model
cart = PyCART(tree_args = {"max_depth": 5})
cart.fit(df_real)

# Generate synthetic data
df_syn = cart.generate(1000)

The original R Synthpop package [1] had limited functionality for saving and loading fitted models and generating additional synthetic data post hoc. This implementation adds this functionality:

# Save the fitted model
cart.save_model("pycart_model.pkl")

# Load the fitted model
cart_loaded = PyCART()
cart_loaded = cart_loaded.load_model("pycart_model.pkl")

# Generate synthetic data using the loaded model
df_syn_loaded = cart_loaded.generate(1000)

El Emam et al. [2], showed that the order of the sequential synthesis is important for the quality of the synthetic data. Therefore, this implementation also allows users to specify the order of the variables to be synthesized:

# Specify the order of variables to be synthesized
variable_order = ['age', 'income', 'education', ...]
cart.fit(df_real, visit_order = variable_order)

# Generate synthetic data with the specified variable order
df_syn = cart.generate(1000)

Comparison with R Synthpop

To check our implementation against the original R Synthpop package, we ran repeated experiments on eight datasets, and conducted a multi-faceted evaluation of the synthetic data using SynthEval [3] across various statistical measures, down-stream task performance, and privacy metrics. We also performed statistical tests to compare the results between the two implementations.

The experiments are shown in comparison.ipynb, and the results are saved in the experiments/results directory. The main takeaway for our implementation is that performance appers slightly worsened for statistical similarity metrics on datasets with many numerical features, on the other hand, empirical privacy measures are improved for most datasets.

Changing the settings on the PyCART model, such as tree depth and minimum samples per leaf, can have a significant impact on the quality of the generated synthetic data. In general, picking more fine grained options will lead to better statistical similarity but worse privacy. Going more coarse grained, can improve privacy, however, it is worth noting that even with very neutered settings, the privacy of the generated data is still far from comparable to that of a privacy focused framework.

References

[1] Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop : Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74(11), 1–26. 10.18637/jss.v074.i11

[2] El Emam, K., Mosquera, L., & Zheng, C. (2021). Optimizing the synthesis of clinical trial data using sequential trees. Journal of the American Medical Informatics Association, 28(1), 3–13. 10.1093/jamia/ocaa249

[3] Lautrup, A. D., Hyrup, T., Zimek, A., & Schneider-Kamp, P. (2025). Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data. Data Mining and Knowledge Discovery, 39(1), 6. 10.1007/s10618-024-01081-4

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyseq_cart-0.1.0a4.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyseq_cart-0.1.0a4-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file pyseq_cart-0.1.0a4.tar.gz.

File metadata

  • Download URL: pyseq_cart-0.1.0a4.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyseq_cart-0.1.0a4.tar.gz
Algorithm Hash digest
SHA256 6928c7c866fe10403ccebea06d40a7b786e6de1c65e20bd66b28691cbd667cdf
MD5 ef88e55cd336f20d1106baed70d9cd19
BLAKE2b-256 02040f1254fb94a04a4602c2cb5cb7edd02c472c9d40917bfb5d589cec508559

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyseq_cart-0.1.0a4.tar.gz:

Publisher: release.yml on notna07/python-generative-cart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyseq_cart-0.1.0a4-py3-none-any.whl.

File metadata

  • Download URL: pyseq_cart-0.1.0a4-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyseq_cart-0.1.0a4-py3-none-any.whl
Algorithm Hash digest
SHA256 aa64e2816bec74b29d1290ff2e611d7fd7a2140da5e22ba39a712248b42ee569
MD5 6b8eb0eb039a1e0bc6e4f725dcfceed0
BLAKE2b-256 7cef0beedb6ef96524c1b494f5ec9b8bc33969f5a2832e30cf7567233a390ece

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyseq_cart-0.1.0a4-py3-none-any.whl:

Publisher: release.yml on notna07/python-generative-cart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page