Synthetic tabular data generation using energy-based modeling and TabPFN
Project description
TabPFGen: Synthetic Tabular Data Generation with TabPFN
TabPFGen is a Python library for generating high-quality synthetic tabular data using energy-based modeling and stochastic gradient Langevin dynamics (SGLD). It supports both classification and regression tasks with built-in visualization capabilities.
Key Features
- Energy-based synthetic data generation
- Support for both classification and regression tasks
- Class-balanced sampling option
- Comprehensive visualization tools
- Built on TabPFN transformer architecture
- No additional training required
Installation
pip install tabpfgen
Quick Start
Classification Example
from tabpfgen import TabPFGen
from tabpfgen.visuals import visualize_classification_results
from sklearn.datasets import load_breast_cancer
# Load data
X, y = load_breast_cancer(return_X_y=True)
# Initialize generator
generator = TabPFGen(n_sgld_steps=500)
# Generate synthetic data
X_synth, y_synth = generator.generate_classification(
X, y,
n_samples=100,
balance_classes=True
)
# Visualize results
visualize_classification_results(
X, y, X_synth, y_synth,
feature_names=load_breast_cancer().feature_names
)
Regression Example
from tabpfgen import TabPFGen
from tabpfgen.visuals import visualize_regression_results
from sklearn.datasets import load_diabetes
# Load regression dataset
X, y = load_diabetes(return_X_y=True)
# Initialize generator
generator = TabPFGen(n_sgld_steps=500)
# Generate synthetic regression data
X_synth, y_synth = generator.generate_regression(
X, y,
n_samples=100,
use_quantiles=True
)
# Visualize results
visualize_regression_results(
X, y, X_synth, y_synth,
feature_names=load_diabetes().feature_names
)
Visualization Features
The package includes comprehensive visualization tools:
Classification Visualizations
- Class distribution comparison
- t-SNE visualization of feature space
- Feature importance analysis
- Feature distribution comparisons
- Feature correlation matrices
Regression Visualizations
- Target value distribution comparison
- Q-Q plots for distribution analysis
- Box plot comparisons
- Feature importance analysis
- Scatter plots of important features
- t-SNE visualization with target value mapping
- Residuals analysis
- Feature correlation difference matrices
Parameters
TabPFGen
n_sgld_steps: Number of SGLD iterations (default: 1000)sgld_step_size: Step size for SGLD updates (default: 0.01)sgld_noise_scale: Scale of noise in SGLD (default: 0.01)device: Computing device ('cpu' or 'cuda', default: 'cpu')
Classification Generation
n_samples: Number of synthetic samples to generatebalance_classes: Whether to generate balanced class distributions (default: True)
Regression Generation
n_samples: Number of synthetic samples to generateuse_quantiles: Whether to use quantile-based sampling (default: True)
Tests
python -m unittest tests/test_tabpfgen.py
How It Works
-
Energy-Based Modeling: Uses a distance-based energy function that combines:
- Feature space distances between synthetic and real samples
- Class-conditional information for classification tasks
-
SGLD Sampling: Generates synthetic samples through iterative updates:
x_new = x - step_size * gradient + noise_scale * random_noise -
Quality Assurance:
- Automatic feature scaling
- Class balance maintenance
- Distribution matching through energy minimization
- Quantile-based sampling for regression
Limitations
- Memory usage scales with dataset size
- SGLD convergence can be sensitive to step size parameters
- Computation time increases with
n_sgld_steps
References
Ma, Junwei, et al. "TabPFGen--Tabular Data Generation with TabPFN." arXiv preprint arXiv:2406.05216 (2024).
Hollmann, Noah, et al. "Accurate predictions on small data with a tabular foundation model." Nature 637.8045 (2025): 319-326.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tabpfgen-0.1.0.tar.gz.
File metadata
- Download URL: tabpfgen-0.1.0.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccf00bc210d26a03af5a7840c22c90f62fd26ee280d1f008a6b1fdb43cd95a14
|
|
| MD5 |
bffa91313601c5e6a85eb63bc00241f0
|
|
| BLAKE2b-256 |
474adb9ce637b5fc14146ed1da8f51e273e82e128b3dcb72b2b7e0a03c2e34d6
|