A library for imputation of missing data in tabular datasets with comprehensive evaluation metrics
Project description
Imputify
A Python library for evaluating and performing missing data imputation. It measures imputation quality across three dimensions: reconstruction (how close are imputed values to the truth?), distribution preservation (are statistical properties maintained?), and predictive utility (can downstream models still perform well?).
The library is fully compatible with scikit-learn's fit/transform API and provides ready-to-use imputers: KNN, statistical baselines, autoencoders (DAE, VAE), GAIN, and a decoder-only LLM fine-tuned for tabular imputation.
This library is part of my master's research proposal, so apart from scikit-learn compatibility, expect breaking changes. The API will stabilize as the research progresses.
Why missingness matters
Not all missing data is created equal. The mechanism behind missingness determines which imputation methods will work. MCAR (Missing Completely at Random) is the easy case, values disappear randomly with no pattern, like a sensor failing at random times. MAR (Missing at Random) is trickier, missingness depends on other observed values, like high earners being more likely to skip income questions. MNAR (Missing Not at Random) is the hardest, missingness depends on the missing value itself, like very sick patients being unable to complete health surveys.
Most imputation methods assume MCAR or MAR. MNAR breaks these assumptions because the data you're trying to recover is exactly what's causing it to be missing. This is where I think LLMs might help, they can learn complex conditional distributions from the observed data and extrapolate patterns that simpler methods miss.
Evaluation
A good imputation isn't just "close to the true value". Imputify measures quality from three complementary perspectives:
Reconstruction, point-wise accuracy, with MAE, RMSE, NRMSE for numerical features and accuracy for categorical features.
Distribution, statistical properties, as Wasserstein distance, KS statistic, KL divergence (how much distributions shifted), as well as Correlation shift (did we break relationships between variables?)
Predictive utility, downstream impact, by training a model on original vs imputed data and compare the performance gap.
Predictive metrics:
Classification: accuracy, precision, recall, F1
Regression: R², MAE, RMSE
The overall score combines these into a single number in [0, 1]. Reconstruction and distribution are normalized as 1 / (1 + error), predictive as 1 - |Δmetrics|. The final score is the mean of all three.
Imputers
| Imputer | Category | Description |
|---|---|---|
StatisticalImputer |
Baseline | Mean/median for numerical, mode for categorical |
KNNImputer |
Baseline | k-nearest neighbors |
MICEImputer |
Baseline | Multiple Imputation by Chained Equations |
MissForestImputer |
Baseline | Random Forest-based iterative imputation |
XGBoostImputer |
Baseline | XGBoost-based iterative imputation |
DAEImputer |
Deep Learning | Denoising AutoEncoder with swap noise |
VAEImputer |
Deep Learning | Variational AutoEncoder (probabilistic latent space) |
GAINImputer |
Deep Learning | Generative Adversarial Imputation Nets |
DecoderOnlyImputer |
LLM | Fine-tuned decoder-only transformer via structured JSON serialization |
Example
from imputify.imputer import DAEImputer
from imputify.missing import introduce_missing, PatternConfig
from imputify.metrics import evaluate
# Create realistic missing data (MNAR pattern)
pattern = PatternConfig(incomplete_vars=['income'], mechanism='MNAR')
X_missing, mask = introduce_missing(X, proportion=0.3, patterns=[pattern])
# Impute
imputer = DAEImputer(hidden_dim=128, epochs=100)
X_imputed = imputer.fit_transform(X_missing)
# Evaluate across all dimensions
results = evaluate(X_original, X_imputed, mask, y=y)
print(f"Overall score: {results.overall_score:.3f}")
Installation
If you don't have uv installed, do yourself a favor and:
# Linux & macOS
curl -LsSf https://astral.sh/uv/install.sh | sh
# macOS (Homebrew)
brew install uv
# Windows
# Well, check their installation page: https://docs.astral.sh/uv/getting-started/installation/
Once installed, simply clone the repo and run uv sync to install dependencies:
git clone https://github.com/gabfssilva/imputify
cd imputify
uv sync
Requires Python 3.12+.
Open the project using your favorite IDE and that's it.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file imputify-0.1.0.tar.gz.
File metadata
- Download URL: imputify-0.1.0.tar.gz
- Upload date:
- Size: 4.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d0f5c483952a4fe6b9075a4d00e0ea88b5ba72acd671aa056a3e674ee4f08d5
|
|
| MD5 |
fcebfe802ef76f4b286ae9260cb66214
|
|
| BLAKE2b-256 |
f79a83c249d63502db38e8dfe84ee9c605b3b3b0447d369a8b64ede5924b1963
|
Provenance
The following attestation bundles were made for imputify-0.1.0.tar.gz:
Publisher:
publish.yml on gabfssilva/imputify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
imputify-0.1.0.tar.gz -
Subject digest:
7d0f5c483952a4fe6b9075a4d00e0ea88b5ba72acd671aa056a3e674ee4f08d5 - Sigstore transparency entry: 938715367
- Sigstore integration time:
-
Permalink:
gabfssilva/imputify@761793b17eaf969b50a95a2e9bdb21f85334896c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/gabfssilva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@761793b17eaf969b50a95a2e9bdb21f85334896c -
Trigger Event:
release
-
Statement type:
File details
Details for the file imputify-0.1.0-py3-none-any.whl.
File metadata
- Download URL: imputify-0.1.0-py3-none-any.whl
- Upload date:
- Size: 55.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ce2e751f5ce3e9c1d407a9ceae8616e5693a6e3e548d5cbba9ac575ae1518e8
|
|
| MD5 |
8442232feaba581d2b47865dffd326c9
|
|
| BLAKE2b-256 |
de9ebdfb6b97cceb75734f7a61c64179709c3a5b838f1ab3a5fae434ab6466d3
|
Provenance
The following attestation bundles were made for imputify-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on gabfssilva/imputify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
imputify-0.1.0-py3-none-any.whl -
Subject digest:
2ce2e751f5ce3e9c1d407a9ceae8616e5693a6e3e548d5cbba9ac575ae1518e8 - Sigstore transparency entry: 938715375
- Sigstore integration time:
-
Permalink:
gabfssilva/imputify@761793b17eaf969b50a95a2e9bdb21f85334896c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/gabfssilva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@761793b17eaf969b50a95a2e9bdb21f85334896c -
Trigger Event:
release
-
Statement type: