Privacy-Preserving Synthetic Tabular Data Generation
Project description
๐จ TabularForge
Privacy-Preserving Synthetic Tabular Data Generation
๐ฏ What is TabularForge?
TabularForge is a unified, production-ready Python library for generating high-quality synthetic tabular data with built-in privacy guarantees. It combines multiple state-of-the-art approaches (GANs, VAEs, Copulas) into a simple, one-line API.
Why Synthetic Data?
Organizations have valuable tabular data (patient records, financial transactions, customer data) but often can't share it due to:
- Privacy regulations (GDPR, HIPAA, CCPA)
- Competitive sensitivity
- Data scarcity for ML development
Synthetic data solves this by generating realistic, statistically similar data that protects individual privacy while preserving analytical utility.
โจ Key Features
| Feature | Description |
|---|---|
| ๐ค Multiple Generators | CTGAN, TVAE, Gaussian Copula, and more |
| ๐ Differential Privacy | Formal privacy guarantees with configurable epsilon |
| ๐ Quality Metrics | Statistical similarity, ML utility, privacy leakage tests |
| ๐ง Auto Preprocessing | Handles mixed types, missing values, imbalanced data |
| โก One-Line API | Generate synthetic data in a single line of code |
| ๐ Benchmarking | Compare generators on your specific data |
๐ Quick Start
Installation
# Install from PyPI
pip install tabularforge-sgk
or
pip install git+https://github.com/ganeshreddy28/tabularforge.git
# Or install from source
git clone https://github.com/ganeshreddy28/tabularforge.git
cd tabularforge
pip install -e .
Basic Usage
from tabularforge import TabularForge
import pandas as pd
# Load your real data
real_data = pd.read_csv("your_data.csv")
# Generate synthetic data in ONE line!
forge = TabularForge(real_data)
synthetic_data = forge.generate(n_samples=1000)
# That's it! synthetic_data is a pandas DataFrame
print(synthetic_data.head())
With Privacy Guarantees
from tabularforge import TabularForge
# Generate with differential privacy (epsilon=1.0)
forge = TabularForge(real_data, privacy_epsilon=1.0)
private_synthetic = forge.generate(n_samples=1000)
# Check privacy metrics
privacy_report = forge.evaluate_privacy()
print(privacy_report)
Compare Different Generators
from tabularforge import TabularForge
# Benchmark all available generators
forge = TabularForge(real_data)
benchmark_results = forge.benchmark(generators=['ctgan', 'tvae', 'copula'])
# See which generator works best for your data
print(benchmark_results)
๐ Detailed Usage
Choosing a Generator
TabularForge supports multiple synthetic data generators:
| Generator | Best For | Speed | Quality |
|---|---|---|---|
copula |
Simple distributions, fast generation | โกโกโก | โญโญโญ |
ctgan |
Complex relationships, mixed types | โกโก | โญโญโญโญ |
tvae |
High-dimensional data | โกโก | โญโญโญโญ |
# Specify a generator
forge = TabularForge(real_data, generator='ctgan')
synthetic = forge.generate(n_samples=500)
Handling Different Data Types
TabularForge automatically detects and handles:
- Numerical columns (continuous and discrete)
- Categorical columns (including high-cardinality)
- DateTime columns
- Missing values
# Explicit column type specification (optional)
forge = TabularForge(
real_data,
categorical_columns=['gender', 'country', 'product_type'],
numerical_columns=['age', 'income', 'score'],
datetime_columns=['signup_date', 'last_purchase']
)
Evaluating Synthetic Data Quality
from tabularforge import TabularForge
forge = TabularForge(real_data)
synthetic = forge.generate(n_samples=1000)
# Get comprehensive quality report
quality_report = forge.evaluate_quality(synthetic)
print(quality_report)
# Output:
# {
# 'statistical_similarity': 0.92,
# 'column_correlations': 0.89,
# 'distribution_match': 0.94,
# 'ml_utility': 0.87
# }
Conditional Generation
Generate data satisfying specific conditions:
# Generate only high-income customers
synthetic = forge.generate(
n_samples=500,
conditions={'income': '>100000', 'country': 'UK'}
)
๐ Privacy Features
Differential Privacy
TabularForge implements differential privacy to provide formal privacy guarantees:
# Lower epsilon = stronger privacy (but lower utility)
# Higher epsilon = weaker privacy (but higher utility)
forge = TabularForge(real_data, privacy_epsilon=0.1) # Strong privacy
forge = TabularForge(real_data, privacy_epsilon=1.0) # Balanced
forge = TabularForge(real_data, privacy_epsilon=10.0) # Weak privacy
Privacy Attack Simulation
Test your synthetic data against common privacy attacks:
# Simulate membership inference attack
attack_results = forge.simulate_attack(
attack_type='membership_inference',
synthetic_data=synthetic
)
print(f"Attack success rate: {attack_results['success_rate']:.2%}")
# A good synthetic dataset should have ~50% (random guess)
๐ Use Cases
Healthcare
# Generate synthetic patient cohorts for research
patient_data = pd.read_csv("patient_records.csv")
forge = TabularForge(patient_data, privacy_epsilon=1.0)
synthetic_patients = forge.generate(n_samples=10000)
# Share with researchers without exposing real patients
Finance
# Create synthetic transactions for fraud detection R&D
transactions = pd.read_csv("transactions.csv")
forge = TabularForge(transactions)
synthetic_transactions = forge.generate(n_samples=50000)
# Develop ML models without sensitive financial data
ML Development
# Augment small datasets
small_dataset = pd.read_csv("rare_events.csv") # Only 100 samples
forge = TabularForge(small_dataset)
augmented = forge.generate(n_samples=10000)
# Now you have enough data to train robust models
๐๏ธ Architecture
tabularforge/
โโโ __init__.py # Main API exports
โโโ forge.py # TabularForge main class
โโโ generators/ # Synthetic data generators
โ โโโ base.py # Abstract base generator
โ โโโ copula.py # Gaussian Copula generator
โ โโโ ctgan.py # CTGAN generator
โ โโโ tvae.py # TVAE generator
โโโ preprocessing/ # Data preprocessing
โ โโโ encoder.py # Column encoding/decoding
โ โโโ transformer.py # Data transformations
โโโ privacy/ # Privacy mechanisms
โ โโโ differential.py # Differential privacy
โ โโโ attacks.py # Privacy attack simulations
โโโ metrics/ # Quality & privacy metrics
โ โโโ statistical.py # Statistical similarity
โ โโโ utility.py # ML utility metrics
โ โโโ privacy.py # Privacy metrics
โโโ utils/ # Utilities
โโโ config.py # Configuration management
โโโ logging.py # Logging utilities
๐งช Development
Setting Up Development Environment
# Clone the repository
git clone https://github.com/ganeshreddy28/tabularforge.git
cd tabularforge
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run linting
flake8 tabularforge/
black tabularforge/
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=tabularforge --cov-report=html
# Run specific test file
pytest tests/test_generators.py -v
๐ Documentation
๐ค Contributing
Contributions are welcome! Please see our Contributing Guide for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- SDV for inspiration on synthetic data APIs
- CTGAN Paper for the CTGAN architecture
- The differential privacy research community
๐ฌ Contact
- Author: Sai Ganesh Kolan
- Email: aiganesh1299@gmail.com
- LinkedIn: (https://linkedin.com/in/saiganeshkolan/)
Made with โค๏ธ for the data science community
โญ Star us on GitHub if you find this useful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabularforge_sgk-0.1.1.tar.gz.
File metadata
- Download URL: tabularforge_sgk-0.1.1.tar.gz
- Upload date:
- Size: 42.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e31e5932833bcf748509d5eb487ab0052fe3f57cc87e736a1e9a1ea779c9db5
|
|
| MD5 |
346cc24a1efb2313a3cf46db0677b58b
|
|
| BLAKE2b-256 |
97b941cdd87b838615d9a8024d055bf4627b07017198ab653e22597f7fbd4ed3
|
File details
Details for the file tabularforge_sgk-0.1.1-py3-none-any.whl.
File metadata
- Download URL: tabularforge_sgk-0.1.1-py3-none-any.whl
- Upload date:
- Size: 44.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0500f8dd6fe4ebe67e5522a7d1c85cae39bc9dbd43887565b69dd1b5dcaf9749
|
|
| MD5 |
e2022277c5d887354b75c459aadd661a
|
|
| BLAKE2b-256 |
56f0d7c0c9964286635920b9cbe07b38830ee6d28976a187b57325b5e86f8b84
|