TabStruct is a comprehensive benchmark suite for tabular data generation, prediction, and evaluation.
Project description
[ICLR 2026 Oral] TabStruct – Tabular Structural Fidelity
[!IMPORTANT] Official code for the paper "TabStruct: Measuring Structural Fidelity of Tabular Data", published in The Fourteenth International Conference on Learning Representations (ICLR 2026 Oral).
TabStruct provides the full experimental pipeline used in the paper, including generation, predictive modelling, and evaluation protocols for structural fidelity.
Authored by Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik, University of Cambridge, UK
📌 Overview
TabStruct is an end‑to‑end benchmark for tabular data generation, prediction, and evaluation. It ships with ready‑to‑use pipelines for
- generating high‑quality synthetic tables,
- predicting with machine learning models, and
- analysing results with a rich suite of metrics – especially those that quantify structural fidelity.
The benchmark is designed for both research and applied workflows: you can run standard baselines out of the box, or plug in custom generators/predictors and fairly evaluate them under the same protocol. All components are designed to plug‑and‑play, so you can mix, match, and extend them to suit your own workflow.
📚 Key Features
Data generation
- Out‑of‑the‑box support for popular tabular generators: SMOTE, TVAE, CTGAN, NFlow, TabDDPM, ARF, and more.
- Supports customised setups (classical oversampling, deep generative models, and probabilistic approaches) so different modelling assumptions can be compared under one interface.
Evaluation dimensions
- Density estimation – How well does the synthetic data approximate the real distribution?
- Privacy preservation – Does the generator leak sensitive records?
- ML efficacy – How do models trained on synthetic data perform compared to real data?
- Structural fidelity – Does the generator respect the causal structures of real data?
Predictive tasks
- Classification & regression pipelines built on scikit‑learn, with optional neural‑network backbones.
- Unified training/evaluation entry points make it straightforward to benchmark models across datasets with consistent splits, logging, and reproducibility settings.
🚀 Installation
We recommend managing dependencies with conda + mamba.
# 1️⃣ Upgrade conda and activate the base env
conda update -n base -c conda-forge conda
conda activate base
# 2️⃣ Install the high‑performance dependency resolver
conda install conda-libmamba-solver --yes
conda config --set solver libmamba
conda install -c conda-forge mamba --yes
# 3️⃣ Create a new conda env
conda create --name tabstruct python=3.10.18 --no-default-packages
conda activate tabstruct
# 4️⃣ Set up the env
bash scripts/utils/install.sh
📊 Logging with W&B
TabStruct logs every experiment to Weights & Biases (W&B). Use the default project or set your own credentials in src/tabstruct/common/__init__.py:
WANDB_ENTITY = "tabular-data-generation"
WANDB_PROJECT = "TabStruct"
✅ Quick sanity check
Run a toy classification job (K‑NN on the Adult dataset):
python -m src.tabstruct.experiment.run_experiment \
--model knn \
--save_model \
--dataset adult \
--test_size 0.2 \
--valid_size 0.1 \
--tags ENV-TEST
A successful run prints a series of green log lines like:
[YYYY‑MM‑DD] Codebase: >>>>>>>>>> Launching create_data_module() <<<<<<<<<<<
…
If you see those, congratulations – your environment is ready! 🎉
💥 Example Workflows
1. Generate synthetic data
Template script: docs/tutorial/example_scripts/generation/train.sh
python -m src.tabstruct.experiment.run_experiment \
--pipeline "generation" \
--generation_only \
--model "smote" \
--dataset "mfeat-fourier" \
--test_size 0.2 \
--valid_size 0.1 \
--tags "dev"
2. Evaluate synthetic data
Template script: docs/tutorial/example_scripts/generation/eval.sh
python -m src.tabstruct.experiment.run_experiment \
--pipeline "generation" \
--model "smote" \
--eval_only \
--dataset "mfeat-fourier" \
--test_size 0.2 \
--valid_size 0.1 \
--generator_tags "dev" \
--tags "dev"
3. Predict on tabular data
Template script: docs/tutorial/example_scripts/prediction/train.sh
python -m src.tabstruct.experiment.run_experiment \
--model 'mlp' \
--save_model \
--max_steps_tentative 1500 \
--dataset 'adult' \
--test_size 0.2 \
--valid_size 0.1 \
--tags 'dev'
📖 Citation
For attribution in academic contexts, please cite this work as:
@inproceedings{jiang2026tabstruct,
title={TabStruct: Measuring Structural Fidelity of Tabular Data},
author={Jiang, Xiangjian and Simidjievski, Nikola and Jamnik, Mateja},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
@inproceedings{jiang2025well,
title={How Well Does Your Tabular Generator Learn the Structure of Tabular Data?},
author={Jiang, Xiangjian and Simidjievski, Nikola and Jamnik, Mateja},
booktitle={ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tabstruct-2026.3.4.tar.gz.
File metadata
- Download URL: tabstruct-2026.3.4.tar.gz
- Upload date:
- Size: 514.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
deadf4177d40f89f893dfb50d903402bb34786f828a328ff80d7c5c3e8413989
|
|
| MD5 |
aa21e7e4a0e8d9ac7428bb2ec5ad90c8
|
|
| BLAKE2b-256 |
e97b1a9708d1fc35c3c04f9a08663f37de8e097fd76b1ebb49c85b02702721aa
|
Provenance
The following attestation bundles were made for tabstruct-2026.3.4.tar.gz:
Publisher:
pypi.yaml on SilenceX12138/TabStruct
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tabstruct-2026.3.4.tar.gz -
Subject digest:
deadf4177d40f89f893dfb50d903402bb34786f828a328ff80d7c5c3e8413989 - Sigstore transparency entry: 1031355346
- Sigstore integration time:
-
Permalink:
SilenceX12138/TabStruct@501f48942f890c92a796edf236e670cdc270ad9f -
Branch / Tag:
refs/tags/v2026.03.04 - Owner: https://github.com/SilenceX12138
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@501f48942f890c92a796edf236e670cdc270ad9f -
Trigger Event:
release
-
Statement type:
File details
Details for the file tabstruct-2026.3.4-py3-none-any.whl.
File metadata
- Download URL: tabstruct-2026.3.4-py3-none-any.whl
- Upload date:
- Size: 127.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97ed529636986b95acc61b4a764e85ded4041ca23c3c68e52f6d88a41ee55090
|
|
| MD5 |
ca24aa23e41edc8a078829d36884fbd7
|
|
| BLAKE2b-256 |
efaff5b7e5ae185de6303ddd449a56d7878613043480ab91bfbd10b17690cf42
|
Provenance
The following attestation bundles were made for tabstruct-2026.3.4-py3-none-any.whl:
Publisher:
pypi.yaml on SilenceX12138/TabStruct
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tabstruct-2026.3.4-py3-none-any.whl -
Subject digest:
97ed529636986b95acc61b4a764e85ded4041ca23c3c68e52f6d88a41ee55090 - Sigstore transparency entry: 1031355435
- Sigstore integration time:
-
Permalink:
SilenceX12138/TabStruct@501f48942f890c92a796edf236e670cdc270ad9f -
Branch / Tag:
refs/tags/v2026.03.04 - Owner: https://github.com/SilenceX12138
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@501f48942f890c92a796edf236e670cdc270ad9f -
Trigger Event:
release
-
Statement type: