BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

These details have not been verified by PyPI

Project links

Project description

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Python License Task Model Architecture Latent Prior Data Regime Status

BSTabDiff Architecture

BSTabDiff is a block-subunit generative framework for High-Dimensional Low-Sample-Size (HDLSS) tabular data synthesis. Rather than learning dependence directly in the original high-dimensional feature space, it partitions the feature space into M latent blocks, where M ≪ m, models global structure through a compact diffusion/flow prior over block latents, and decodes back to the full table using copula-based dependence, flexible feature-wise marginals, and explicit missingness modeling. This design makes BSTabDiff especially well suited for omics-style and other HDLSS settings, where direct high-dimensional density learning is often unstable. Across multiple HDLSS benchmarks, BSTabDiff generates more realistic and stable synthetic data than several widely used tabular generators, while often approaching downstream performance obtained from real data.

Citation

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, and Donald A. Adjeroh.
“BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation.”
In ICLR 2026 2nd Workshop on Deep Generative Models in Machine Learning: Theory, Principle and Efficacy (DeLTa), 2026.

BibTeX:

@inproceedings{habib2026bstabdiff,
  title     = {BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation},
  author    = {Habib, Al Zadid Sultan Bin and Ahamed, Md Younus and Gyawali, Prashnna Kumar and Doretto, Gianfranco and Adjeroh, Donald A.},
  booktitle = {ICLR 2026 2nd Workshop on Deep Generative Models in Machine Learning: Theory, Principle and Efficacy (DeLTa)},
  year      = {2026}
}

Files and Repository Structure

Python package: `bstabdiff/`

This folder contains the core BSTabDiff implementation:

__init__.py - Package initializer and high-level API exports.
block_subunit_gen.py - Main BSTabDiff implementation, including feature schema, empirical marginals, block-subunit emissions, diffusion/flow priors, training, and synthetic sampling utilities.

Notebooks

Dummy Example Usage.ipynb
Contains simple toy examples showing how to install/import the bstabdiff package, fit BSTabDiff on a dummy HDLSS dataset, and sample synthetic data.
BSTabDiff_Colon.ipynb
Contains the Colon dataset experiments from the paper. The downstream classifiers include Logistic Regression, TabPFN-2.5 (currently applicable only when the number of features is within its supported range, so Colon is eligible), TANDEM (NeurIPS 2025), and CatBoost. This notebook also includes the paper’s ablation studies and related fidelity analysis.
BSTabDiff_GLI.ipynb
Contains the GLI-85 experiments using Logistic Regression as the downstream classifier, along with selected fidelity analysis.
BSTabDiff_Lung.ipynb
Contains the Lung dataset experiments using Logistic Regression as the downstream classifier, along with selected fidelity analysis.

Other top-level files

requirements.txt - Python dependencies required to run the BSTabDiff package and notebooks.
BSTabDiffArchi.png - High-level architecture diagram of the BSTabDiff framework.
LICENSE - MIT license for this repository.
README.md - Project overview, installation, usage instructions, and citation information.
.gitignore - Standard Git ignore rules for Python and Jupyter projects.
pyproject.toml - Build system and packaging metadata for installation.
setup.cfg - Package configuration and installation metadata.

Tested Environment

Python 3.10.13
torch 2.9.1+cu128
numpy 2.2.6
pandas 2.3.3
scikit-learn 1.7.2
catboost 1.2.8
tabpfn 6.3.1

Installation

You can install BSTabDiff in several ways depending on your workflow.

Option 1: Clone the Repository (Recommended for Development)

git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install -e .

Option 2: Install Directly from GitHub (No Cloning Needed)

pip install "git+https://github.com/zadid6pretam/BSTabDiff.git"

Option 3: Use a Virtual Environment

python -m venv bstabdiff-env
source bstabdiff-env/bin/activate  # On Windows: bstabdiff-env\Scripts\activate

git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install -e .

Option 4: Local Install Without Editable Mode

git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install .

Option 5: Install from PyPI (Planned)

pip install bstabdiff

Example Usage

Below is a minimal example showing how to fit BSTabDiff on a dummy HDLSS dataset and generate synthetic samples.

import numpy as np
from bstabdiff import FeatureSpec, fit_block_subunit_generator

# Dummy HDLSS data
np.random.seed(42)
n, m = 80, 2000
X = np.random.randn(n, m).astype(np.float32)
y = np.random.randint(0, 2, size=n)
X[np.random.rand(n, m) < 0.1] = np.nan

# Feature schema
feature_specs = [FeatureSpec(name=f"f{j}", kind="continuous") for j in range(m)]

# Fit BSTabDiff
gen, train_info = fit_block_subunit_generator(
    X=X,
    feature_specs=feature_specs,
    y=y,
    M=20,
    blocks=None,
    permute_features=False,
    prior_type="diffusion",
    device="cpu",
    seed=42,
    prior_epochs=300,
    prior_batch=64,
    prior_lr=1e-3,
    verbose_every=100,
    save_dir=None,
    save_name="bstabdiff_demo",
    save_best=True,
    use_ema=True,
    ema_decay=0.999,
    return_train_info=True,
)

# Sample synthetic data
X_syn, R_syn, y_syn = gen.sample(n=50)

print("X_syn shape:", X_syn.shape)
print("R_syn shape:", R_syn.shape)
print("y_syn shape:", y_syn.shape if y_syn is not None else None)
print("Best training info:", train_info)

For fuller experiments, ablations, and fidelity studies, see:

Dummy Example Usage.ipynb
BSTabDiff_Colon.ipynb
BSTabDiff_GLI.ipynb
BSTabDiff_Lung.ipynb

Our Previous Related Work on Tabular Deep Learning

BSTabDiff is part of our broader line of work on tabular deep learning and high-dimensional tabular modeling.

TabSeq

Our earlier work on sequential modeling for tabular data:

TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering
GitHub: https://github.com/zadid6pretam/TabSeq
Springer (ICPR 2024 proceedings): https://link.springer.com/chapter/10.1007/978-3-031-78128-5_27

@inproceedings{habib2024tabseq,
  title={TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering},
  author={Habib, Al Zadid Sultan Bin and Wang, Kesheng and Hartley, Mary-Anne and Doretto, Gianfranco and A. Adjeroh, Donald},
  booktitle={International Conference on Pattern Recognition},
  pages={418--434},
  year={2024},
  organization={Springer}
}

If you are interested in sequential ordering for tabular data, deep sequential backbones, and early feature-ordering-based tabular modeling, please also refer to the TabSeq repository and paper.

DynaTab

Our more recent work on learned feature ordering for high-dimensional tabular data:

DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data
GitHub: https://github.com/zadid6pretam/DynaTab

@inproceedings{habib2026dynatab,
  title     = {{DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data}},
  author    = {Habib, Al Zadid Sultan Bin and Doretto, Gianfranco and Adjeroh, Donald A.},
  booktitle = {Proceedings of the AAAI 2026 First International Workshop on Neuro for AI \& AI for Neuro: Towards Multi-Modal Natural Intelligence (NeuroAI)},
  year      = {2026},
  series    = {PMLR}
}

If you are interested in learned feature ordering, neural rewiring for high-dimensional tabular data, and sequential backbone design for HDLSS settings, please also refer to the DynaTab repository and paper.
DynaTab has completed camera-ready submission, and the public proceedings version is expected to appear online later.

Contact

For any questions, issues, or suggestions related to this repository, please feel free to contact us or open an issue on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bstabdiff-0.1.0.tar.gz (19.9 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bstabdiff-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file bstabdiff-0.1.0.tar.gz.

File metadata

Download URL: bstabdiff-0.1.0.tar.gz
Upload date: Mar 24, 2026
Size: 19.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for bstabdiff-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`55df026d84668a646499a413cae6fcf8ed1981f078af0b4dec6a6a072f0d7508`
MD5	`7246c2e80edaf94ee42a1af7708b5f9c`
BLAKE2b-256	`93376a853c817000f700f42039041bdc1b3d21c8a92bd91a8e01cb7d159e9b21`

See more details on using hashes here.

File details

Details for the file bstabdiff-0.1.0-py3-none-any.whl.

File metadata

Download URL: bstabdiff-0.1.0-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 16.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for bstabdiff-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4279aa458d1aa5dd7a7a8280fed6fe7c7a6594f8908c94916610d9a08cbd2aa0`
MD5	`18ba70192d503679ef2949a5ad59a29d`
BLAKE2b-256	`3a47122872ad3537cf84e688d9f89490e73e76773eea9ff71073a30e262b4a54`

See more details on using hashes here.

bstabdiff 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Citation

Files and Repository Structure

Python package: bstabdiff/

Notebooks

Other top-level files

Tested Environment

Installation

Option 1: Clone the Repository (Recommended for Development)

Option 2: Install Directly from GitHub (No Cloning Needed)

Option 3: Use a Virtual Environment

Option 4: Local Install Without Editable Mode

Option 5: Install from PyPI (Planned)

Example Usage

Our Previous Related Work on Tabular Deep Learning

TabSeq

DynaTab

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Python package: `bstabdiff/`