Privacy-Preserving Synthetic Tabular Data Generation

These details have not been verified by PyPI

Project links

Project description

🔨 TabularForge

TabularForge Logo

Privacy-Preserving Synthetic Tabular Data Generation

🎯 What is TabularForge?

TabularForge is a unified, production-ready Python library for generating high-quality synthetic tabular data with built-in privacy guarantees. It combines multiple state-of-the-art approaches (GANs, VAEs, Copulas) into a simple, one-line API.

Why Synthetic Data?

Organizations have valuable tabular data (patient records, financial transactions, customer data) but often can't share it due to:

Privacy regulations (GDPR, HIPAA, CCPA)
Competitive sensitivity
Data scarcity for ML development

Synthetic data solves this by generating realistic, statistically similar data that protects individual privacy while preserving analytical utility.

✨ Key Features

Feature	Description
🤖 Multiple Generators	CTGAN, TVAE, Gaussian Copula, and more
🔒 Differential Privacy	Formal privacy guarantees with configurable epsilon
📊 Quality Metrics	Statistical similarity, ML utility, privacy leakage tests
🔧 Auto Preprocessing	Handles mixed types, missing values, imbalanced data
⚡ One-Line API	Generate synthetic data in a single line of code
📈 Benchmarking	Compare generators on your specific data

🚀 Quick Start

Installation

# Install from PyPI
pip install tabularforge

# Or install from source
git clone https://github.com/ganeshreddy28/tabularforge.git
cd tabularforge
pip install -e .

Basic Usage

from tabularforge import TabularForge
import pandas as pd

# Load your real data
real_data = pd.read_csv("your_data.csv")

# Generate synthetic data in ONE line!
forge = TabularForge(real_data)
synthetic_data = forge.generate(n_samples=1000)

# That's it! synthetic_data is a pandas DataFrame
print(synthetic_data.head())

With Privacy Guarantees

from tabularforge import TabularForge

# Generate with differential privacy (epsilon=1.0)
forge = TabularForge(real_data, privacy_epsilon=1.0)
private_synthetic = forge.generate(n_samples=1000)

# Check privacy metrics
privacy_report = forge.evaluate_privacy()
print(privacy_report)

Compare Different Generators

from tabularforge import TabularForge

# Benchmark all available generators
forge = TabularForge(real_data)
benchmark_results = forge.benchmark(generators=['ctgan', 'tvae', 'copula'])

# See which generator works best for your data
print(benchmark_results)

📖 Detailed Usage

Choosing a Generator

TabularForge supports multiple synthetic data generators:

Generator	Best For	Speed	Quality
`copula`	Simple distributions, fast generation	⚡⚡⚡	⭐⭐⭐
`ctgan`	Complex relationships, mixed types	⚡⚡	⭐⭐⭐⭐
`tvae`	High-dimensional data	⚡⚡	⭐⭐⭐⭐

# Specify a generator
forge = TabularForge(real_data, generator='ctgan')
synthetic = forge.generate(n_samples=500)

Handling Different Data Types

TabularForge automatically detects and handles:

Numerical columns (continuous and discrete)
Categorical columns (including high-cardinality)
DateTime columns
Missing values

# Explicit column type specification (optional)
forge = TabularForge(
    real_data,
    categorical_columns=['gender', 'country', 'product_type'],
    numerical_columns=['age', 'income', 'score'],
    datetime_columns=['signup_date', 'last_purchase']
)

Evaluating Synthetic Data Quality

from tabularforge import TabularForge

forge = TabularForge(real_data)
synthetic = forge.generate(n_samples=1000)

# Get comprehensive quality report
quality_report = forge.evaluate_quality(synthetic)

print(quality_report)
# Output:
# {
#     'statistical_similarity': 0.92,
#     'column_correlations': 0.89,
#     'distribution_match': 0.94,
#     'ml_utility': 0.87
# }

Conditional Generation

Generate data satisfying specific conditions:

# Generate only high-income customers
synthetic = forge.generate(
    n_samples=500,
    conditions={'income': '>100000', 'country': 'UK'}
)

🔒 Privacy Features

Differential Privacy

TabularForge implements differential privacy to provide formal privacy guarantees:

# Lower epsilon = stronger privacy (but lower utility)
# Higher epsilon = weaker privacy (but higher utility)
forge = TabularForge(real_data, privacy_epsilon=0.1)  # Strong privacy
forge = TabularForge(real_data, privacy_epsilon=1.0)  # Balanced
forge = TabularForge(real_data, privacy_epsilon=10.0) # Weak privacy

Privacy Attack Simulation

Test your synthetic data against common privacy attacks:

# Simulate membership inference attack
attack_results = forge.simulate_attack(
    attack_type='membership_inference',
    synthetic_data=synthetic
)

print(f"Attack success rate: {attack_results['success_rate']:.2%}")
# A good synthetic dataset should have ~50% (random guess)

📊 Use Cases

Healthcare

# Generate synthetic patient cohorts for research
patient_data = pd.read_csv("patient_records.csv")
forge = TabularForge(patient_data, privacy_epsilon=1.0)
synthetic_patients = forge.generate(n_samples=10000)
# Share with researchers without exposing real patients

Finance

# Create synthetic transactions for fraud detection R&D
transactions = pd.read_csv("transactions.csv")
forge = TabularForge(transactions)
synthetic_transactions = forge.generate(n_samples=50000)
# Develop ML models without sensitive financial data

ML Development

# Augment small datasets
small_dataset = pd.read_csv("rare_events.csv")  # Only 100 samples
forge = TabularForge(small_dataset)
augmented = forge.generate(n_samples=10000)
# Now you have enough data to train robust models

🏗️ Architecture

tabularforge/
├── __init__.py              # Main API exports
├── forge.py                 # TabularForge main class
├── generators/              # Synthetic data generators
│   ├── base.py              # Abstract base generator
│   ├── copula.py            # Gaussian Copula generator
│   ├── ctgan.py             # CTGAN generator
│   └── tvae.py              # TVAE generator
├── preprocessing/           # Data preprocessing
│   ├── encoder.py           # Column encoding/decoding
│   └── transformer.py       # Data transformations
├── privacy/                 # Privacy mechanisms
│   ├── differential.py      # Differential privacy
│   └── attacks.py           # Privacy attack simulations
├── metrics/                 # Quality & privacy metrics
│   ├── statistical.py       # Statistical similarity
│   ├── utility.py           # ML utility metrics
│   └── privacy.py           # Privacy metrics
└── utils/                   # Utilities
    ├── config.py            # Configuration management
    └── logging.py           # Logging utilities

🧪 Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/ganeshreddy28/tabularforge.git
cd tabularforge

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run linting
flake8 tabularforge/
black tabularforge/

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=tabularforge --cov-report=html

# Run specific test file
pytest tests/test_generators.py -v

📚 Documentation

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

SDV for inspiration on synthetic data APIs
CTGAN Paper for the CTGAN architecture
The differential privacy research community

📬 Contact

Author: Sai Ganesh Kolan
Email: aiganesh1299@gmail.com
LinkedIn: (https://linkedin.com/in/saiganeshkolan/)

Made with ❤️ for the data science community

⭐ Star us on GitHub if you find this useful!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jan 1, 2026

This version

0.1.0

Dec 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabularforge_sgk-0.1.0.tar.gz (42.2 kB view details)

Uploaded Dec 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tabularforge_sgk-0.1.0-py3-none-any.whl (44.1 kB view details)

Uploaded Dec 27, 2025 Python 3

File details

Details for the file tabularforge_sgk-0.1.0.tar.gz.

File metadata

Download URL: tabularforge_sgk-0.1.0.tar.gz
Upload date: Dec 27, 2025
Size: 42.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for tabularforge_sgk-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`960d2962aed236249ec1490d3c8717dbfb4e5e70365358422c01440a07f95f1b`
MD5	`342f993e2be182e9e103c2c613e75777`
BLAKE2b-256	`ed323159c0c723eb1c3d6ff5c6aca49c235b921ac3a311b4872e96dc7a698c12`

See more details on using hashes here.

File details

Details for the file tabularforge_sgk-0.1.0-py3-none-any.whl.

File metadata

Download URL: tabularforge_sgk-0.1.0-py3-none-any.whl
Upload date: Dec 27, 2025
Size: 44.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for tabularforge_sgk-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`426dd93f2f4dd26f746d4e94701bcc15d10633478934b1810adcf9702387186c`
MD5	`5efb0aa553243aa8aa270e995f0a0671`
BLAKE2b-256	`cf7cb92b38a5762ffda8baa7185b838e3584818e4fe3dd0dc7ff60aee10f5c11`

See more details on using hashes here.

tabularforge-sgk 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔨 TabularForge

🎯 What is TabularForge?

Why Synthetic Data?

✨ Key Features

🚀 Quick Start

Installation

Basic Usage

With Privacy Guarantees

Compare Different Generators

📖 Detailed Usage

Choosing a Generator

Handling Different Data Types

Evaluating Synthetic Data Quality

Conditional Generation

🔒 Privacy Features

Differential Privacy

Privacy Attack Simulation

📊 Use Cases

Healthcare

Finance

ML Development

🏗️ Architecture

🧪 Development

Setting Up Development Environment

Running Tests

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📬 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes