A library for imputation of missing data in tabular datasets with comprehensive evaluation metrics

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Imputify

A Python library for evaluating and performing missing data imputation. It measures imputation quality across three dimensions: reconstruction (how close are imputed values to the truth?), distribution preservation (are statistical properties maintained?), and predictive utility (can downstream models still perform well?).

The library is fully compatible with scikit-learn's fit/transform API and provides ready-to-use imputers: KNN, statistical baselines, autoencoders (DAE, VAE), GAIN, and a decoder-only LLM fine-tuned for tabular imputation.

This library is part of my master's research proposal, so apart from scikit-learn compatibility, expect breaking changes. The API will stabilize as the research progresses.

Why missingness matters

Not all missing data is created equal. The mechanism behind missingness determines which imputation methods will work. MCAR (Missing Completely at Random) is the easy case, values disappear randomly with no pattern, like a sensor failing at random times. MAR (Missing at Random) is trickier, missingness depends on other observed values, like high earners being more likely to skip income questions. MNAR (Missing Not at Random) is the hardest, missingness depends on the missing value itself, like very sick patients being unable to complete health surveys.

Most imputation methods assume MCAR or MAR. MNAR breaks these assumptions because the data you're trying to recover is exactly what's causing it to be missing. This is where I think LLMs might help, they can learn complex conditional distributions from the observed data and extrapolate patterns that simpler methods miss.

Evaluation

A good imputation isn't just "close to the true value". Imputify measures quality from three complementary perspectives:

Reconstruction, point-wise accuracy, with MAE, RMSE, NRMSE for numerical features and accuracy for categorical features.

Distribution, statistical properties, as Wasserstein distance, KS statistic, KL divergence (how much distributions shifted), as well as Correlation shift (did we break relationships between variables?)

Predictive utility, downstream impact, by training a model on original vs imputed data and compare the performance gap.

Predictive metrics:

Classification: accuracy, precision, recall, F1

Regression: R², MAE, RMSE

The overall score combines these into a single number in [0, 1]. Reconstruction and distribution are normalized as 1 / (1 + error), predictive as 1 - |Δmetrics|. The final score is the mean of all three.

Imputers

Imputer	Category	Description
`StatisticalImputer`	Baseline	Mean/median for numerical, mode for categorical
`KNNImputer`	Baseline	k-nearest neighbors
`MICEImputer`	Baseline	Multiple Imputation by Chained Equations
`MissForestImputer`	Baseline	Random Forest-based iterative imputation
`XGBoostImputer`	Baseline	XGBoost-based iterative imputation
`DAEImputer`	Deep Learning	Denoising AutoEncoder with swap noise
`VAEImputer`	Deep Learning	Variational AutoEncoder (probabilistic latent space)
`GAINImputer`	Deep Learning	Generative Adversarial Imputation Nets
`DecoderOnlyImputer`	LLM	Fine-tuned decoder-only transformer via structured JSON serialization

Example

from imputify.imputer import DAEImputer
from imputify.missing import introduce_missing, PatternConfig
from imputify.metrics import evaluate

# Create realistic missing data (MNAR pattern)
pattern = PatternConfig(incomplete_vars=['income'], mechanism='MNAR')
X_missing, mask = introduce_missing(X, proportion=0.3, patterns=[pattern])

# Impute
imputer = DAEImputer(hidden_dim=128, epochs=100)
X_imputed = imputer.fit_transform(X_missing)

# Evaluate across all dimensions
results = evaluate(X_original, X_imputed, mask, y=y)
print(f"Overall score: {results.overall_score:.3f}")

Installation

If you don't have uv installed, do yourself a favor and:

# Linux & macOS
curl -LsSf https://astral.sh/uv/install.sh | sh

# macOS (Homebrew)
brew install uv

# Windows
# Well, check their installation page: https://docs.astral.sh/uv/getting-started/installation/

Once installed, simply clone the repo and run uv sync to install dependencies:

git clone https://github.com/gabfssilva/imputify
cd imputify
uv sync

Requires Python 3.12+.

Open the project using your favorite IDE and that's it.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gabfssilva

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imputify-0.1.0.tar.gz (4.6 MB view details)

Uploaded Feb 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

imputify-0.1.0-py3-none-any.whl (55.3 kB view details)

Uploaded Feb 10, 2026 Python 3

File details

Details for the file imputify-0.1.0.tar.gz.

File metadata

Download URL: imputify-0.1.0.tar.gz
Upload date: Feb 10, 2026
Size: 4.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for imputify-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7d0f5c483952a4fe6b9075a4d00e0ea88b5ba72acd671aa056a3e674ee4f08d5`
MD5	`fcebfe802ef76f4b286ae9260cb66214`
BLAKE2b-256	`f79a83c249d63502db38e8dfe84ee9c605b3b3b0447d369a8b64ede5924b1963`

See more details on using hashes here.

Provenance

The following attestation bundles were made for imputify-0.1.0.tar.gz:

Publisher: publish.yml on gabfssilva/imputify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: imputify-0.1.0.tar.gz
- Subject digest: 7d0f5c483952a4fe6b9075a4d00e0ea88b5ba72acd671aa056a3e674ee4f08d5
- Sigstore transparency entry: 938715367
- Sigstore integration time: Feb 10, 2026
Source repository:
- Permalink: gabfssilva/imputify@761793b17eaf969b50a95a2e9bdb21f85334896c
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/gabfssilva
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@761793b17eaf969b50a95a2e9bdb21f85334896c
- Trigger Event: release

File details

Details for the file imputify-0.1.0-py3-none-any.whl.

File metadata

Download URL: imputify-0.1.0-py3-none-any.whl
Upload date: Feb 10, 2026
Size: 55.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for imputify-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2ce2e751f5ce3e9c1d407a9ceae8616e5693a6e3e548d5cbba9ac575ae1518e8`
MD5	`8442232feaba581d2b47865dffd326c9`
BLAKE2b-256	`de9ebdfb6b97cceb75734f7a61c64179709c3a5b838f1ab3a5fae434ab6466d3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for imputify-0.1.0-py3-none-any.whl:

Publisher: publish.yml on gabfssilva/imputify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: imputify-0.1.0-py3-none-any.whl
- Subject digest: 2ce2e751f5ce3e9c1d407a9ceae8616e5693a6e3e548d5cbba9ac575ae1518e8
- Sigstore transparency entry: 938715375
- Sigstore integration time: Feb 10, 2026
Source repository:
- Permalink: gabfssilva/imputify@761793b17eaf969b50a95a2e9bdb21f85334896c
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/gabfssilva
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@761793b17eaf969b50a95a2e9bdb21f85334896c
- Trigger Event: release

imputify 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Imputify

Why missingness matters

Evaluation

Imputers

Example

Installation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance