Skip to main content

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Project description

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Python License Task Model Architecture Latent Prior Data Regime Workshop OpenReview Workshop Page Status

BSTabDiff Architecture

BSTabDiff is a block-subunit generative framework for High-Dimensional Low-Sample-Size (HDLSS) tabular data synthesis. Rather than learning dependence directly in the original high-dimensional feature space, it partitions the feature space into M latent blocks, where M ≪ m, models global structure through a compact diffusion/flow prior over block latents, and decodes back to the full table using copula-based dependence, flexible feature-wise marginals, and explicit missingness modeling. This design makes BSTabDiff especially well suited for omics-style and other HDLSS settings, where direct high-dimensional density learning is often unstable. Across multiple HDLSS benchmarks, BSTabDiff generates more realistic and stable synthetic data than several widely used tabular generators, while often approaching downstream performance obtained from real data.

Citation

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, and Donald A. Adjeroh.
“BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation.”
In ICLR 2026 2nd Workshop on Deep Generative Models in Machine Learning: Theory, Principle and Efficacy (DeLTa), 2026.

BibTeX:

@inproceedings{habib2026bstabdiff,
  title     = {BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation},
  author    = {Habib, Al Zadid Sultan Bin and Ahamed, Md Younus and Gyawali, Prashnna Kumar and Doretto, Gianfranco and Adjeroh, Donald A.},
  booktitle = {ICLR 2026 2nd Workshop on Deep Generative Models in Machine Learning: Theory, Principle and Efficacy (DeLTa)},
  year      = {2026}
}

Files and Repository Structure

Python package: bstabdiff/

This folder contains the core BSTabDiff implementation:

  • __init__.py - Package initializer and high-level API exports.
  • block_subunit_gen.py - Main BSTabDiff implementation, including feature schema, empirical marginals, block-subunit emissions, diffusion/flow priors, training, and synthetic sampling utilities.

Notebooks

  • Dummy Example Usage.ipynb
    Contains simple toy examples showing how to install/import the bstabdiff package, fit BSTabDiff on a dummy HDLSS dataset, and sample synthetic data.

  • BSTabDiff_Colon.ipynb
    Contains the Colon dataset experiments from the paper. The downstream classifiers include Logistic Regression, TabPFN-2.5 (currently applicable only when the number of features is within its supported range, so Colon is eligible), TANDEM (NeurIPS 2025), and CatBoost. This notebook also includes the paper’s ablation studies and related fidelity analysis.

  • BSTabDiff_GLI.ipynb
    Contains the GLI-85 experiments using Logistic Regression as the downstream classifier, along with selected fidelity analysis.

  • BSTabDiff_Lung.ipynb
    Contains the Lung dataset experiments using Logistic Regression as the downstream classifier, along with selected fidelity analysis.

Other top-level files

  • requirements.txt - Python dependencies required to run the BSTabDiff package and notebooks.
  • BSTabDiffArchi.png - High-level architecture diagram of the BSTabDiff framework.
  • LICENSE - MIT license for this repository.
  • README.md - Project overview, installation, usage instructions, and citation information.
  • .gitignore - Standard Git ignore rules for Python and Jupyter projects.
  • pyproject.toml - Build system and packaging metadata for installation.
  • setup.cfg - Package configuration and installation metadata.

Tested Environment

  • Python 3.10.13
  • torch 2.9.1+cu128
  • numpy 2.2.6
  • pandas 2.3.3
  • scikit-learn 1.7.2
  • catboost 1.2.8
  • tabpfn 6.3.1

Installation

You can install BSTabDiff in several ways depending on your workflow.


Option 1: Clone the Repository (Recommended for Development)

git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install -e .

Option 2: Install Directly from GitHub (No Cloning Needed)

pip install "git+https://github.com/zadid6pretam/BSTabDiff.git"

Option 3: Use a Virtual Environment

python -m venv bstabdiff-env
source bstabdiff-env/bin/activate  # On Windows: bstabdiff-env\Scripts\activate

git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install -e .

Option 4: Local Install Without Editable Mode

git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install .

Option 5: Install from PyPI (Planned)

pip install bstabdiff

Example Usage

Below is a minimal example showing how to fit BSTabDiff on a dummy HDLSS dataset and generate synthetic samples.

import numpy as np
from bstabdiff import FeatureSpec, fit_block_subunit_generator

# Dummy HDLSS data
np.random.seed(42)
n, m = 80, 2000
X = np.random.randn(n, m).astype(np.float32)
y = np.random.randint(0, 2, size=n)
X[np.random.rand(n, m) < 0.1] = np.nan

# Feature schema
feature_specs = [FeatureSpec(name=f"f{j}", kind="continuous") for j in range(m)]

# Fit BSTabDiff
gen, train_info = fit_block_subunit_generator(
    X=X,
    feature_specs=feature_specs,
    y=y,
    M=20,
    blocks=None,
    permute_features=False,
    prior_type="diffusion",
    device="cpu",
    seed=42,
    prior_epochs=300,
    prior_batch=64,
    prior_lr=1e-3,
    verbose_every=100,
    save_dir=None,
    save_name="bstabdiff_demo",
    save_best=True,
    use_ema=True,
    ema_decay=0.999,
    return_train_info=True,
)

# Sample synthetic data
X_syn, R_syn, y_syn = gen.sample(n=50)

print("X_syn shape:", X_syn.shape)
print("R_syn shape:", R_syn.shape)
print("y_syn shape:", y_syn.shape if y_syn is not None else None)
print("Best training info:", train_info)

For fuller experiments, ablations, and fidelity studies, see:

  • Dummy Example Usage.ipynb
  • BSTabDiff_Colon.ipynb
  • BSTabDiff_GLI.ipynb
  • BSTabDiff_Lung.ipynb

Our Previous Related Work on Tabular Deep Learning

BSTabDiff is part of our broader line of work on tabular deep learning and high-dimensional tabular modeling.

TabSeq

Our earlier work on sequential modeling for tabular data:

@inproceedings{habib2024tabseq,
  title={TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering},
  author={Habib, Al Zadid Sultan Bin and Wang, Kesheng and Hartley, Mary-Anne and Doretto, Gianfranco and A. Adjeroh, Donald},
  booktitle={International Conference on Pattern Recognition},
  pages={418--434},
  year={2024},
  organization={Springer}
}
  • If you are interested in sequential ordering for tabular data, deep sequential backbones, and early feature-ordering-based tabular modeling, please also refer to the TabSeq repository and paper.

DynaTab

Our more recent work on learned feature ordering for high-dimensional tabular data:

@inproceedings{habib2026dynatab,
  title     = {{DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data}},
  author    = {Habib, Al Zadid Sultan Bin and Doretto, Gianfranco and Adjeroh, Donald A.},
  booktitle = {Proceedings of the AAAI 2026 First International Workshop on Neuro for AI \& AI for Neuro: Towards Multi-Modal Natural Intelligence (NeuroAI)},
  year      = {2026},
  series    = {PMLR}
}
  • If you are interested in learned feature ordering, neural rewiring for high-dimensional tabular data, and sequential backbone design for HDLSS settings, please also refer to the DynaTab repository and paper.
  • DynaTab has completed camera-ready submission, and the public proceedings version is expected to appear online later.

Contact

For any questions, issues, or suggestions related to this repository, please feel free to contact us or open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bstabdiff-0.1.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bstabdiff-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file bstabdiff-0.1.0.tar.gz.

File metadata

  • Download URL: bstabdiff-0.1.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for bstabdiff-0.1.0.tar.gz
Algorithm Hash digest
SHA256 55df026d84668a646499a413cae6fcf8ed1981f078af0b4dec6a6a072f0d7508
MD5 7246c2e80edaf94ee42a1af7708b5f9c
BLAKE2b-256 93376a853c817000f700f42039041bdc1b3d21c8a92bd91a8e01cb7d159e9b21

See more details on using hashes here.

File details

Details for the file bstabdiff-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bstabdiff-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for bstabdiff-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4279aa458d1aa5dd7a7a8280fed6fe7c7a6594f8908c94916610d9a08cbd2aa0
MD5 18ba70192d503679ef2949a5ad59a29d
BLAKE2b-256 3a47122872ad3537cf84e688d9f89490e73e76773eea9ff71073a30e262b4a54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page