Missingness-aware data imputation with Bayesian generative modeling (MissBGM).
Project description
MissBGM
Missingness-aware data imputation with Bayesian generative modeling and uncertainty quantification.
MissBGM is a missingness-aware Bayesian generative model for imputing data with non-ignorable missingness (Missing Not At Random, MNAR). It jointly models the data-generating process and the missingness mechanism, enabling both accurate imputations and principled uncertainty quantification via posterior sampling.
Highlights
- Joint modeling data-generating process and missingness mechanisms
- MAP imputation for fast point estimates
- Posterior sampling (MCMC) for uncertainty quantification and prediction intervals
- Bridging Bayesian deep learning and missing data imputation
Model overview
Figure 1: MissBGM model overview. If the PDF does not render inline on GitHub, click the figure to open fig1.pdf.
Installation
Create a conda environment
conda create -n missbgm python=3.12 -y
conda activate missbgm
Install from source
git clone https://github.com/liuq-lab/MissBGM.git && cd MissBGM
pip install -e .
Install from PyPI
pip install missbgm
Dependencies
This project is tested with:
- Python 3.12
- TensorFlow 2.18.0
- TensorFlow Probability 0.25.0
tf-keras==2.18.0numpy,pandas,pyyaml,scikit-learn
Quickstart (Python API)
Below is a minimal end-to-end example that mirrors the logic in main.py: simulate MNAR data, train MissBGM, read MAP imputations, and then draw posterior samples to produce uncertainty intervals.
import yaml
from missbgm.models import MissBGM
from missbgm.datasets import simulate_mnar_oracle_data
# Load experiment configuration
params = yaml.safe_load(open("configs/Sim_MNAR_oracle.yaml", "r"))
# Simulate MNAR-oracle data (synthetic benchmark)
data = simulate_mnar_oracle_data(
n_samples=500,
x_dim=50,
n_anchor=5,
missing_rate=0.5,
alpha=0.05,
seed=123,
)
# Instantiate the MissBGM model with the configuration
model = MissBGM(params, random_seed=42)
# Train the MissBGM model
model.fit(data=data["x_obs"], mask=data["mask"], x_true=data["x_full"], verbose=1)
# Get the MAP imputation
map_imputed = model.x_map_imputed_
# Make posterior predictions with uncertainty quantification
mcmc_imputed, intervals = model.predict(
data=data["x_obs"],
mask=data["mask"],
x_true=data["x_full"],
alpha=0.05,
n_mcmc=1000,
burn_in=1000,
step_size=0.1,
num_leapfrog_steps=5,
seed=42,
verbose=1,
)
Reproducing experiments with main.py
main.py is the primary experiment entrypoint. It reads a YAML config (-c) and runs:
- Synthetic simulation when
dataset: Sim_MNAR_oracle - Real-data benchmark otherwise (the code normalizes data, trains, and evaluates RMSE on missing entries)
1) Synthetic MNAR simulation
Use configs/Sim_MNAR_oracle.yaml:
python main.py -c configs/Sim_MNAR_oracle.yaml
2) Real-data benchmarks (4 datasets)
The following configs are provided:
configs/Real_Wine.yaml(UCI Wine)configs/Real_Concrete.yaml(UCI Concrete)configs/Real_Breast.yaml(UCI Breast Cancer Wisconsin Original)configs/Real_Gisette.yaml(UCI Gisette (high-dimensional))
Run each dataset:
python main.py -c configs/Real_Wine.yaml
python main.py -c configs/Real_Concrete.yaml
python main.py -c configs/Real_Breast.yaml
python main.py -c configs/Real_Gisette.yaml
If you want to re-run even when cached staged files / outputs exist, add --force:
python main.py -c configs/Real_Wine.yaml --force
What main.py does (pipeline summary)
For each run:
- Loads config from YAML (e.g.,
configs/Real_Wine.yaml) - Prepares data
- Synthetic: calls
simulate_mnar_oracle_data(...) - Real: calls
prepare_real_benchmark_data(dataset_name, missing_rate=..., seed=..., force=...)
- Synthetic: calls
- Trains
MissBGM.fit(...) - Computes MAP imputation via
model.x_map_imputed_ - Draws posterior samples via
model.predict(...)to get:mcmc_imputed: posterior mean imputation (from samples)intervals: prediction intervals (controlled byalpha)
- Reports metrics such as RMSE on missing entries
Project structure
missbgm/
datasets/ # synthetic simulators + real dataset staging / preprocessing
models/ # BGM base model + MissBGM
utils/ # masking, baselines, metrics, prediction intervals
configs/ # YAML configs for experiments
main.py # experiment entrypoint (simulation + real benchmarks)
Citation
If you use MissBGM in your research, please cite our paper (arXiv link coming soon):
@misc{missbgm2026,
title = {MissBGM: Missingness-Aware Bayesian Generative Modeling for MNAR Imputation},
author = {TBD},
year = {2026},
eprint = {TBD},
archivePrefix= {arXiv},
primaryClass = {stat.ML}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file missbgm-0.1.0.tar.gz.
File metadata
- Download URL: missbgm-0.1.0.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f6c695e6dea439c83cf4f2e9935c3e2cd7721380bea8d932b4b350492d18eed
|
|
| MD5 |
a9f13d6bda52f025e681a8ac09aa34a4
|
|
| BLAKE2b-256 |
cf2de364c54e15dd8cc0016eaa4cd869f1a52cd7c339c6a1d88bd3c493f12123
|