Missingness-aware data imputation with Bayesian generative modeling (MissBGM).

These details have not been verified by PyPI

Project description

MissBGM

Missingness-aware data imputation with Bayesian generative modeling and uncertainty quantification.

MissBGM is a missingness-aware Bayesian generative model for imputing data with non-ignorable missingness (Missing Not At Random, MNAR). It jointly models the data-generating process and the missingness mechanism, enabling both accurate imputations and principled uncertainty quantification via posterior sampling.

Highlights

Joint modeling data-generating process and missingness mechanisms
MAP imputation for fast point estimates
Posterior sampling (MCMC) for uncertainty quantification and prediction intervals
Bridging Bayesian deep learning and missing data imputation

Model overview

Figure 1: MissBGM model overview. If the PDF does not render inline on GitHub, click the figure to open fig1.pdf.

Installation

Create a conda environment

conda create -n missbgm python=3.12 -y
conda activate missbgm

Install from source

git clone https://github.com/liuq-lab/MissBGM.git && cd MissBGM
pip install -e .

Install from PyPI

pip install missbgm

Dependencies

This project is tested with:

Python 3.12
TensorFlow 2.18.0
TensorFlow Probability 0.25.0
tf-keras==2.18.0
numpy, pandas, pyyaml, scikit-learn

Quickstart (Python API)

Below is a minimal end-to-end example that mirrors the logic in main.py: simulate MNAR data, train MissBGM, read MAP imputations, and then draw posterior samples to produce uncertainty intervals.

import yaml

from missbgm.models import MissBGM
from missbgm.datasets import simulate_mnar_oracle_data

# Load experiment configuration
params = yaml.safe_load(open("configs/Sim_MNAR_oracle.yaml", "r"))

# Simulate MNAR-oracle data (synthetic benchmark)
data = simulate_mnar_oracle_data(
    n_samples=500,
    x_dim=50,
    n_anchor=5,
    missing_rate=0.5,
    alpha=0.05,
    seed=123,
)

# Instantiate the MissBGM model with the configuration
model = MissBGM(params, random_seed=42)

# Train the MissBGM model
model.fit(data=data["x_obs"], mask=data["mask"], x_true=data["x_full"], verbose=1)

# Get the MAP imputation
map_imputed = model.x_map_imputed_

# Make posterior predictions with uncertainty quantification
mcmc_imputed, intervals = model.predict(
    data=data["x_obs"],
    mask=data["mask"],
    x_true=data["x_full"],
    alpha=0.05,
    n_mcmc=1000,
    burn_in=1000,
    step_size=0.1,
    num_leapfrog_steps=5,
    seed=42,
    verbose=1,
)

Reproducing experiments with `main.py`

main.py is the primary experiment entrypoint. It reads a YAML config (-c) and runs:

Synthetic simulation when dataset: Sim_MNAR_oracle
Real-data benchmark otherwise (the code normalizes data, trains, and evaluates RMSE on missing entries)

1) Synthetic MNAR simulation

Use configs/Sim_MNAR_oracle.yaml:

python main.py -c configs/Sim_MNAR_oracle.yaml

2) Real-data benchmarks (4 datasets)

The following configs are provided:

configs/Real_Wine.yaml (UCI Wine)
configs/Real_Concrete.yaml (UCI Concrete)
configs/Real_Breast.yaml (UCI Breast Cancer Wisconsin Original)
configs/Real_Gisette.yaml (UCI Gisette (high-dimensional))

Run each dataset:

python main.py -c configs/Real_Wine.yaml
python main.py -c configs/Real_Concrete.yaml
python main.py -c configs/Real_Breast.yaml
python main.py -c configs/Real_Gisette.yaml

If you want to re-run even when cached staged files / outputs exist, add --force:

python main.py -c configs/Real_Wine.yaml --force

What `main.py` does (pipeline summary)

For each run:

Loads config from YAML (e.g., configs/Real_Wine.yaml)
Prepares data
- Synthetic: calls simulate_mnar_oracle_data(...)
- Real: calls prepare_real_benchmark_data(dataset_name, missing_rate=..., seed=..., force=...)
Trains MissBGM.fit(...)
Computes MAP imputation via model.x_map_imputed_
Draws posterior samples via model.predict(...) to get:
- mcmc_imputed: posterior mean imputation (from samples)
- intervals: prediction intervals (controlled by alpha)
Reports metrics such as RMSE on missing entries

Project structure

missbgm/
  datasets/        # synthetic simulators + real dataset staging / preprocessing
  models/          # BGM base model + MissBGM
  utils/           # masking, baselines, metrics, prediction intervals
configs/           # YAML configs for experiments
main.py            # experiment entrypoint (simulation + real benchmarks)

Citation

If you use MissBGM in your research, please cite our paper (arXiv link coming soon):

@misc{missbgm2026,
  title        = {MissBGM: Missingness-Aware Bayesian Generative Modeling for MNAR Imputation},
  author       = {TBD},
  year         = {2026},
  eprint       = {TBD},
  archivePrefix= {arXiv},
  primaryClass = {stat.ML}
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Apr 30, 2026

This version

0.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

missbgm-0.1.0.tar.gz (24.6 kB view details)

Uploaded Apr 30, 2026 Source

File details

Details for the file missbgm-0.1.0.tar.gz.

File metadata

Download URL: missbgm-0.1.0.tar.gz
Upload date: Apr 30, 2026
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for missbgm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6f6c695e6dea439c83cf4f2e9935c3e2cd7721380bea8d932b4b350492d18eed`
MD5	`a9f13d6bda52f025e681a8ac09aa34a4`
BLAKE2b-256	`cf2de364c54e15dd8cc0016eaa4cd869f1a52cd7c339c6a1d88bd3c493f12123`

See more details on using hashes here.

missbgm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

MissBGM

Highlights

Model overview

Installation

Create a conda environment

Install from source

Install from PyPI

Dependencies

Quickstart (Python API)

Reproducing experiments with `main.py`

1) Synthetic MNAR simulation

2) Real-data benchmarks (4 datasets)

What `main.py` does (pipeline summary)

Project structure

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

missbgm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

MissBGM

Highlights

Model overview

Installation

Create a conda environment

Install from source

Install from PyPI

Dependencies

Quickstart (Python API)

Reproducing experiments with main.py

1) Synthetic MNAR simulation

2) Real-data benchmarks (4 datasets)

What main.py does (pipeline summary)

Project structure

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Reproducing experiments with `main.py`

What `main.py` does (pipeline summary)