Comprehensive evaluation framework for tabular synthetic data generators
Project description
synth-data-eval repo
A collaborative research project investigating methods for generating and evaluating synthetic tabular data across multiple domains. This repository contains reproducible code, datasets, and experiment configurations used in our paper preparation.
Project Overview
Synthetic data is crucial for privacy-preserving machine learning. This project evaluates different synthetic data generators (CTGAN, TVAE, Gaussian Copula, KAN-CTGAN, KAN-TVAE) across statistical fidelity, ML utility, privacy, and data quality.
Research Objective: To provide a systematic benchmark framework and identify trade-offs between realism, privacy, and downstream task performance.
Installation
From PyPI (Recommended)
pip install synth-data-eval
From Source (Development)
git clone https://github.com/ahmed-fouad-lagha/synth-data-eval.git
cd synth-data-eval
pip install -e ".[all]" # Install with all optional dependencies
Optional Dependencies
pip install -e ".[dev]" # Development tools (pytest, mypy, black, etc.)
pip install -e ".[docs]" # Documentation building
pip install -e ".[notebooks]" # Jupyter notebook support
Download Datasets
python scripts/download_low_resource_datasets.py
Repository Structure
synthetic-tabular-eval/
├── pyproject.toml
├── README.md
├── CONTRIBUTING.md
├── LICENSE
├── .gitignore
├── generators/
│ ├── __init__.py
│ ├── base_generator.py
│ ├── ctgan_model.py
│ ├── tvae_model.py
│ ├── gaussian_copula.py
│ ├── kan_ctgan_model.py
│ ├── kan_tvae_model.py
│ ├── KAN_code.py
│ ├── KAN_CTGAN_code.py
│ └── KAN_TVAE_code.py
├── evaluation/
│ ├── __init__.py
│ ├── sdmetrics_evaluation.py
│ ├── ml_utility.py
│ └── privacy_metrics.py
├── scripts/
│ ├── config.yaml
│ ├── run_benchmark.py
│ ├── visualize_results.py
│ └── download_datasets.py
├── tests/
│ ├── __init__.py
│ ├── test_generators.py
│ └── test_evaluation.py
├── datasets/
├── results/
└── logs/
Experimental Setup
Datasets
We evaluated on five benchmark datasets for comprehensive evaluation:
- Adult Income: 32,561 training samples, 14 features (8 categorical, 6 numerical) - Classification
- Credit Card Default: 30,000 training samples, 23 features (mixed) - Classification
- Diabetes: 442 training samples, 10 numerical features - Regression
- California Housing: 20,640 training samples, 8 numerical features - Regression
- Wine Quality: 1,599 training samples, 11 numerical features - Regression
Generators
- CTGAN: GAN-based with mode-specific normalization for categorical data
- TVAE: Variational autoencoder approach optimized for tabular data
- Gaussian Copula: Parametric baseline using copula-based modeling
- KAN-CTGAN: Kolmogorov-Arnold Networks enhanced CTGAN with spline-based function approximation
- KAN-TVAE: Kolmogorov-Arnold Networks enhanced TVAE with spline-based function approximation
KAN Integration
The project now includes Kolmogorov-Arnold Networks (KAN) enhanced versions of CTGAN and TVAE generators. KANs use learnable spline-based activation functions instead of traditional MLPs, potentially offering better expressiveness and performance on complex tabular data distributions.
KAN Features:
- Spline-based function approximation for enhanced modeling capacity
- Configurable grid size and spline order parameters
- Compatible with existing evaluation pipeline
- Requires even batch sizes for CTGAN variant
- KAN-CTGAN: Kolmogorov-Arnold Networks enhanced CTGAN with spline-based function approximation
- KAN-TVAE: Kolmogorov-Arnold Networks enhanced TVAE with spline-based function approximation
Evaluation Metrics
- Statistical Fidelity: Correlation similarity, Kolmogorov-Smirnov complement
- ML Utility: Train-on-Synthetic-Test-on-Real (TSTR) paradigm with utility ratios
- Privacy: Distance to Closest Record (DCR), Nearest Neighbor Distance Ratio (NNDR)
Implementation Details
- 5 independent runs per configuration for statistical robustness
- 300 epochs for deep learning models (CTGAN, TVAE)
- Python 3.10, SDV 1.28, CTGAN 0.7
- Statistical significance testing with t-tests and confidence intervals
Key Findings
Performance Highlights:
- TVAE excels on classification tasks (Adult Income: 0.908 ± 0.028 utility ratio)
- Gaussian Copula dominates regression tasks (Diabetes: 0.964 ± 0.000 utility ratio)
- Massive training time differences: CTGAN (1022s) vs Gaussian Copula (4.9s) = 200x efficiency gap
- 8 statistically significant differences detected across metrics and datasets
Trade-offs Identified:
- GAN-based generators (CTGAN, TVAE) show negative utility on small regression datasets
- Gaussian Copula provides best privacy-utility balance, especially for smaller datasets
- Dataset size significantly impacts generator performance and optimal choice
Experiment Pipeline
Completed Research Workflow:
- Data Preparation: 5 diverse datasets (Adult Income 32K, Credit 30K, California Housing 20K, Wine Quality 1.6K, Diabetes 442 samples)
- Generation: 5 independent runs each of CTGAN (300 epochs), TVAE (300 epochs), Gaussian Copula
- Evaluation: Statistical fidelity (SDMetrics), ML utility (TSTR paradigm), privacy metrics (DCR, NNDR)
- Analysis: Statistical significance testing, confidence intervals, comprehensive visualizations
Key Scripts:
scripts/run_benchmark.py- Execute complete experimental pipelinescripts/statistical_analysis.py- Generate significance tests and LaTeX tablesscripts/visualize_results.py- Create radar plots, heatmaps, and utility comparisonspaper/main.tex- Complete research paper with results and analysis
Reproducing Results
# 1. Install dependencies
pip install -e ".[all]"
# 2. Download datasets
python scripts/download_datasets.py
# 3. Run complete benchmark (will take several hours)
python scripts/run_benchmark.py
# 4. Generate statistical analysis
python scripts/statistical_analysis.py
# 5. Create visualizations
python scripts/visualize_results.py
# 6. Compile paper
cd paper && pdflatex main.tex
Development
Prerequisites
- Python 3.8+
- pip
Setup
# Clone the repository
git clone https://github.com/ahmed-fouad-lagha/synth-data-eval.git
cd synth-data-eval
# Install in development mode with all dependencies
pip install -e ".[dev,docs,notebooks]"
# Optional: Install pre-commit hooks for code quality
pip install pre-commit
pre-commit install
Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=generators --cov=evaluation
# Run specific test file
pytest tests/test_generators.py
Code Quality
# Format code
black .
isort .
# Lint code
flake8 .
# Type check
mypy generators/ evaluation/ scripts/
Documentation
# Build documentation
cd docs
sphinx-build -b html . _build/html
# View documentation
open _build/html/index.html
CI/CD
This project uses GitHub Actions for continuous integration:
- CI Pipeline: Runs on every push/PR with testing, linting, documentation building, and security scanning
- Multi-Python Support: Tests on Python 3.8, 3.9, 3.10, and 3.11
- Code Quality: Automated checks for formatting, linting, and type safety
- Coverage: Code coverage reporting with Codecov integration
- Security: Automated vulnerability scanning
- Release: Automated PyPI publishing on version tags
Creating Releases
Automated Release Process
Use the provided release script for consistent versioning and publishing:
# Patch release (0.1.0 -> 0.1.1)
python scripts/make_release.py patch
# Minor release (0.1.0 -> 0.2.0)
python scripts/make_release.py minor
# Major release (0.1.0 -> 1.0.0)
python scripts/make_release.py major
# Specific version release
python scripts/make_release.py v1.0.0
The script will:
- ✅ Run all quality checks (tests, linting, type checking)
- ✅ Update version in
pyproject.toml - ✅ Update
CHANGELOG.mdwith release date - ✅ Build and validate the package
- ✅ Create a git tag and push to trigger PyPI publishing
Manual Release Process
If you prefer manual control:
- Update version in
pyproject.toml - Update
CHANGELOG.md - Commit changes:
git commit -m "Release v1.0.0" - Create tag:
git tag -a v1.0.0 -m "Release v1.0.0" - Push:
git push origin v1.0.0 - GitHub Actions will automatically publish to PyPI
Testing Releases
You can test releases on TestPyPI before publishing to production:
- Go to GitHub Actions → Release workflow
- Click "Run workflow"
- Select "testpypi" target
- Install from TestPyPI:
pip install --index-url https://test.pypi.org/simple/ synth-data-eval
Private Repository Setup
If your repository is private, GitHub release creation requires a Personal Access Token (PAT):
-
Create a Personal Access Token (PAT):
- Go to https://github.com/settings/tokens
- Generate a new token with
reposcope - Copy the token
-
Add to Repository Secrets:
- Go to your repo → Settings → Secrets and variables → Actions
- Add a new secret named
RELEASE_TOKEN - Paste your PAT as the value
-
✅ Status: RELEASE_TOKEN is now configured - GitHub releases will work automatically!
Repository Policy
This repository is currently private and contains research code under development. It will be made public upon publication of the associated research paper to ensure proper attribution and compliance with venue policies.
- Do not upload confidential or non-public datasets
- Results and scripts shared here are for pre-publication collaboration only
- Contact authors for pre-publication access requests
License
This repository contains research code that will be made publicly available under the MIT License upon publication of the associated research paper.
For pre-publication access, please contact the authors.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synth_data_eval-0.1.6.tar.gz.
File metadata
- Download URL: synth_data_eval-0.1.6.tar.gz
- Upload date:
- Size: 91.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99299798ca8aa08a04a635d88e1df308de34f3d367ca09fdb94c39b6c84aa69f
|
|
| MD5 |
020b3e404a69e001588a358a2a68b884
|
|
| BLAKE2b-256 |
c80999c7d2740bb1f40190cf5c2bec6eacee6f54309134e9e527cb5cb7a118ef
|
File details
Details for the file synth_data_eval-0.1.6-py3-none-any.whl.
File metadata
- Download URL: synth_data_eval-0.1.6-py3-none-any.whl
- Upload date:
- Size: 106.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b915323d84b8737ee668c194224d15e05680b7eb812fd1c7134cac9c75cf6cea
|
|
| MD5 |
76bf011f084bcda1a61abd40fbe66ff1
|
|
| BLAKE2b-256 |
9751c23feac4806964bfe6c62d84b35b0c3195b41a7754b42ced97156a0a1a8a
|