Skip to main content

Generating and Imputing Tabular Data via Diffusion and Flow XGBoost Models

Project description

Tabular data is hard to acquire and is subject to missing values. This paper proposes a novel approach to generate and impute mixed-type (continuous and categorical) tabular data using score-based diffusion and conditional flow matching. Contrary to previous work that relies on neural networks as function approximators, we instead utilize XGBoost, a popular Gradient-Boosted Tree (GBT) method. In addition to being elegant, we empirically show on various datasets that our method i) generates highly realistic synthetic data when the training dataset is either clean or tainted by missing data and ii) generates diverse plausible data imputations. Our method often outperforms deep-learning generation methods and can trained in parallel using CPUs without the need for a GPU. To make it easily accessible, we release our code through a Python library and an R package <arXiv:2309.09968>.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

ForestDiffusion-1.0.4-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file ForestDiffusion-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for ForestDiffusion-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d34552f0d1b5a28c19d36f82fb4c84ecbeca162dcb925dd531061160f1fd9422
MD5 6ede010f9612dcccb313441be8f30797
BLAKE2b-256 41e840242d98ac5233aa5f5cc6413dcbce3310916e4e8e7bcb7399f747d521bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page