A python implementation of synthpop sequential CART model for generation of synthetic tabular data
Project description
Python sequential CART model for synthetic data generation
Synthpop is a popular R package [1] for generating synthetic data using sequential CART models. Previous Python implementation are not well maintained and do not support the latest versions of scikit-learn. This project implements a simple Python version of the sequential CART model for synthetic data generation, inspired by the approach used in Synthpop. The implementation is designed to be compatible with the latest versions of scikit-learn and pandas, and includes functionality for fitting the model to training data, generating synthetic data, and saving/loading fitted models.
While lots of research is currently ongoing investigating deep learning frameworks for synthetic data generation, such as GANs and VAEs, simpler methods such as sequential CART are still highly competitive in terms of data quality, especially for tabular data with mixed data types and small to medium sized datasets.
Installation
pip install git+https://github.com/notna07/python-generative-cart#
Usage
Basic example of fitting the PyCART model to a dataset and generating synthetic data. The tree_args parameter is optional, but allows users to pass arguments to the underlying decision tree models, such as maximum depth and minimum samples per leaf, which can impact the quality of the generated synthetic data.
import pandas as pd
from py_cart import PyCART
# Load your dataset
df_real = pd.read_csv("your_dataset.csv")
# Fit the PyCART model
cart = PyCART(tree_args = {"max_depth": 5})
cart.fit(df_real)
# Generate synthetic data
df_syn = cart.generate(1000)
The original R Synthpop package [1] had limited functionality for saving and loading fitted models and generating additional synthetic data post hoc. This implementation adds this functionality:
# Save the fitted model
cart.save_model("pycart_model.pkl")
# Load the fitted model
cart_loaded = PyCART()
cart_loaded = cart_loaded.load_model("pycart_model.pkl")
# Generate synthetic data using the loaded model
df_syn_loaded = cart_loaded.generate(1000)
El Emam et al. [2], showed that the order of the sequential synthesis is important for the quality of the synthetic data. Therefore, this implementation also allows users to specify the order of the variables to be synthesized:
# Specify the order of variables to be synthesized
variable_order = ['age', 'income', 'education', ...]
cart.fit(df_real, visit_order = variable_order)
# Generate synthetic data with the specified variable order
df_syn = cart.generate(1000)
Comparison with R Synthpop
To check our implementation against the original R Synthpop package, we ran repeated experiments on eight datasets, and conducted a multi-faceted evaluation of the synthetic data using SynthEval [3] across various statistical measures, down-stream task performance, and privacy metrics. We also performed statistical tests to compare the results between the two implementations.
The experiments are shown in comparison.ipynb, and the results are saved in the experiments/results directory. The main takeaway for our implementation is that performance appers slightly worsened for statistical similarity metrics on datasets with many numerical features, on the other hand, empirical privacy measures are improved for most datasets.
Changing the settings on the PyCART model, such as tree depth and minimum samples per leaf, can have a significant impact on the quality of the generated synthetic data. In general, picking more fine grained options will lead to better statistical similarity but worse privacy. Going more coarse grained, can improve privacy, however, it is worth noting that even with very neutered settings, the privacy of the generated data is still far from comparable to that of a privacy focused framework.
References
[1] Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop : Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74(11), 1–26. 10.18637/jss.v074.i11
[2] El Emam, K., Mosquera, L., & Zheng, C. (2021). Optimizing the synthesis of clinical trial data using sequential trees. Journal of the American Medical Informatics Association, 28(1), 3–13. 10.1093/jamia/ocaa249
[3] Lautrup, A. D., Hyrup, T., Zimek, A., & Schneider-Kamp, P. (2025). Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data. Data Mining and Knowledge Discovery, 39(1), 6. 10.1007/s10618-024-01081-4
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyseq_cart-0.1.0a4.tar.gz.
File metadata
- Download URL: pyseq_cart-0.1.0a4.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6928c7c866fe10403ccebea06d40a7b786e6de1c65e20bd66b28691cbd667cdf
|
|
| MD5 |
ef88e55cd336f20d1106baed70d9cd19
|
|
| BLAKE2b-256 |
02040f1254fb94a04a4602c2cb5cb7edd02c472c9d40917bfb5d589cec508559
|
Provenance
The following attestation bundles were made for pyseq_cart-0.1.0a4.tar.gz:
Publisher:
release.yml on notna07/python-generative-cart
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyseq_cart-0.1.0a4.tar.gz -
Subject digest:
6928c7c866fe10403ccebea06d40a7b786e6de1c65e20bd66b28691cbd667cdf - Sigstore transparency entry: 1198563488
- Sigstore integration time:
-
Permalink:
notna07/python-generative-cart@66a8e9ab6d96eb0187d1b444bea9c4f03c6b4bf9 -
Branch / Tag:
refs/tags/v0.1.0a4 - Owner: https://github.com/notna07
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@66a8e9ab6d96eb0187d1b444bea9c4f03c6b4bf9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pyseq_cart-0.1.0a4-py3-none-any.whl.
File metadata
- Download URL: pyseq_cart-0.1.0a4-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa64e2816bec74b29d1290ff2e611d7fd7a2140da5e22ba39a712248b42ee569
|
|
| MD5 |
6b8eb0eb039a1e0bc6e4f725dcfceed0
|
|
| BLAKE2b-256 |
7cef0beedb6ef96524c1b494f5ec9b8bc33969f5a2832e30cf7567233a390ece
|
Provenance
The following attestation bundles were made for pyseq_cart-0.1.0a4-py3-none-any.whl:
Publisher:
release.yml on notna07/python-generative-cart
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyseq_cart-0.1.0a4-py3-none-any.whl -
Subject digest:
aa64e2816bec74b29d1290ff2e611d7fd7a2140da5e22ba39a712248b42ee569 - Sigstore transparency entry: 1198563589
- Sigstore integration time:
-
Permalink:
notna07/python-generative-cart@66a8e9ab6d96eb0187d1b444bea9c4f03c6b4bf9 -
Branch / Tag:
refs/tags/v0.1.0a4 - Owner: https://github.com/notna07
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@66a8e9ab6d96eb0187d1b444bea9c4f03c6b4bf9 -
Trigger Event:
release
-
Statement type: