Skip to main content

Generating and Imputing Tabular Data via Diffusion and Flow XGBoost Models

Project description

Tabular data is hard to acquire and is subject to missing values. This paper proposes a novel approach to generate and impute mixed-type (continuous and categorical) tabular data using score-based diffusion and conditional flow matching. Contrary to previous work that relies on neural networks as function approximators, we instead utilize XGBoost, a popular Gradient-Boosted Tree (GBT) method. In addition to being elegant, we empirically show on various datasets that our method i) generates highly realistic synthetic data when the training dataset is either clean or tainted by missing data and ii) generates diverse plausible data imputations. Our method often outperforms deep-learning generation methods and can trained in parallel using CPUs without the need for a GPU. To make it easily accessible, we release our code through a Python library and an R package <arXiv:2309.09968>.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

ForestDiffusion-1.0.3-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file ForestDiffusion-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for ForestDiffusion-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5695ca196d12939d81aabc62939b85952e92f1b4b0de43063ac80b31ee3cbe55
MD5 e088b4b8a5983bf5bba2f2b2f862c7fc
BLAKE2b-256 5e30500a6bbed22173b464bd87ecbc3cb4298ca21d9dee3325374f370e904bff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page