Skip to main content

Synthetic tabular data generation using energy-based modeling and TabPFN

Project description

TabPFGen: Synthetic Tabular Data Generation with TabPFN

TabPFGen is a Python library for generating high-quality synthetic tabular data using energy-based modeling and stochastic gradient Langevin dynamics (SGLD). It supports both classification and regression tasks with built-in visualization capabilities.

Key Features

  • Energy-based synthetic data generation
  • Support for both classification and regression tasks
  • Class-balanced sampling option
  • Comprehensive visualization tools
  • Built on TabPFN transformer architecture
  • No additional training required

Installation

pip install tabpfgen

Quick Start

Classification Example

from tabpfgen import TabPFGen
from tabpfgen.visuals import visualize_classification_results
from sklearn.datasets import load_breast_cancer

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Initialize generator
generator = TabPFGen(n_sgld_steps=500)

# Generate synthetic data
X_synth, y_synth = generator.generate_classification(
    X, y,
    n_samples=100,
    balance_classes=True
)

# Visualize results
visualize_classification_results(
    X, y, X_synth, y_synth,
    feature_names=load_breast_cancer().feature_names
)

Regression Example

from tabpfgen import TabPFGen
from tabpfgen.visuals import visualize_regression_results
from sklearn.datasets import load_diabetes

# Load regression dataset
X, y = load_diabetes(return_X_y=True)

# Initialize generator
generator = TabPFGen(n_sgld_steps=500)

# Generate synthetic regression data
X_synth, y_synth = generator.generate_regression(
    X, y,
    n_samples=100,
    use_quantiles=True
)

# Visualize results
visualize_regression_results(
    X, y, X_synth, y_synth,
    feature_names=load_diabetes().feature_names
)

Visualization Features

The package includes comprehensive visualization tools:

Classification Visualizations

  • Class distribution comparison
  • t-SNE visualization of feature space
  • Feature importance analysis
  • Feature distribution comparisons
  • Feature correlation matrices

Regression Visualizations

  • Target value distribution comparison
  • Q-Q plots for distribution analysis
  • Box plot comparisons
  • Feature importance analysis
  • Scatter plots of important features
  • t-SNE visualization with target value mapping
  • Residuals analysis
  • Feature correlation difference matrices

Parameters

TabPFGen

  • n_sgld_steps: Number of SGLD iterations (default: 1000)
  • sgld_step_size: Step size for SGLD updates (default: 0.01)
  • sgld_noise_scale: Scale of noise in SGLD (default: 0.01)
  • device: Computing device ('cpu' or 'cuda', default: 'cpu')

Classification Generation

  • n_samples: Number of synthetic samples to generate
  • balance_classes: Whether to generate balanced class distributions (default: True)

Regression Generation

  • n_samples: Number of synthetic samples to generate
  • use_quantiles: Whether to use quantile-based sampling (default: True)

Tests

python -m unittest tests/test_tabpfgen.py

How It Works

  1. Energy-Based Modeling: Uses a distance-based energy function that combines:

    • Feature space distances between synthetic and real samples
    • Class-conditional information for classification tasks
  2. SGLD Sampling: Generates synthetic samples through iterative updates:

    x_new = x - step_size * gradient + noise_scale * random_noise
    
  3. Quality Assurance:

    • Automatic feature scaling
    • Class balance maintenance
    • Distribution matching through energy minimization
    • Quantile-based sampling for regression

Limitations

  • Memory usage scales with dataset size
  • SGLD convergence can be sensitive to step size parameters
  • Computation time increases with n_sgld_steps

References

Ma, Junwei, et al. "TabPFGen--Tabular Data Generation with TabPFN." arXiv preprint arXiv:2406.05216 (2024).

Hollmann, Noah, et al. "Accurate predictions on small data with a tabular foundation model." Nature 637.8045 (2025): 319-326.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabpfgen-0.1.0.tar.gz (13.8 kB view details)

Uploaded Source

File details

Details for the file tabpfgen-0.1.0.tar.gz.

File metadata

  • Download URL: tabpfgen-0.1.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for tabpfgen-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ccf00bc210d26a03af5a7840c22c90f62fd26ee280d1f008a6b1fdb43cd95a14
MD5 bffa91313601c5e6a85eb63bc00241f0
BLAKE2b-256 474adb9ce637b5fc14146ed1da8f51e273e82e128b3dcb72b2b7e0a03c2e34d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page