Skip to main content

A library for imputation of missing data in tabular datasets with comprehensive evaluation metrics

Project description

Imputify

A Python library for evaluating and performing missing data imputation. It measures imputation quality across three dimensions: reconstruction (how close are imputed values to the truth?), distribution preservation (are statistical properties maintained?), and predictive utility (can downstream models still perform well?).

The library is fully compatible with scikit-learn's fit/transform API and provides ready-to-use imputers: KNN, statistical baselines, autoencoders (DAE, VAE), GAIN, and a decoder-only LLM fine-tuned for tabular imputation.

This library is part of my master's research proposal, so apart from scikit-learn compatibility, expect breaking changes. The API will stabilize as the research progresses.


Why missingness matters

Not all missing data is created equal. The mechanism behind missingness determines which imputation methods will work. MCAR (Missing Completely at Random) is the easy case, values disappear randomly with no pattern, like a sensor failing at random times. MAR (Missing at Random) is trickier, missingness depends on other observed values, like high earners being more likely to skip income questions. MNAR (Missing Not at Random) is the hardest, missingness depends on the missing value itself, like very sick patients being unable to complete health surveys.

Most imputation methods assume MCAR or MAR. MNAR breaks these assumptions because the data you're trying to recover is exactly what's causing it to be missing. This is where I think LLMs might help, they can learn complex conditional distributions from the observed data and extrapolate patterns that simpler methods miss.

Evaluation

A good imputation isn't just "close to the true value". Imputify measures quality from three complementary perspectives:

Reconstruction, point-wise accuracy, with MAE, RMSE, NRMSE for numerical features and accuracy for categorical features.

Distribution, statistical properties, as Wasserstein distance, KS statistic, KL divergence (how much distributions shifted), as well as Correlation shift (did we break relationships between variables?)

Predictive utility, downstream impact, by training a model on original vs imputed data and compare the performance gap.

Predictive metrics:

Classification: accuracy, precision, recall, F1

Regression: R², MAE, RMSE

The overall score combines these into a single number in [0, 1]. Reconstruction and distribution are normalized as 1 / (1 + error), predictive as 1 - |Δmetrics|. The final score is the mean of all three.

Imputers

Imputer Category Description
StatisticalImputer Baseline Mean/median for numerical, mode for categorical
KNNImputer Baseline k-nearest neighbors
MICEImputer Baseline Multiple Imputation by Chained Equations
MissForestImputer Baseline Random Forest-based iterative imputation
XGBoostImputer Baseline XGBoost-based iterative imputation
DAEImputer Deep Learning Denoising AutoEncoder with swap noise
VAEImputer Deep Learning Variational AutoEncoder (probabilistic latent space)
GAINImputer Deep Learning Generative Adversarial Imputation Nets
DecoderOnlyImputer LLM Fine-tuned decoder-only transformer via structured JSON serialization

Example

from imputify.imputer import DAEImputer
from imputify.missing import introduce_missing, PatternConfig
from imputify.metrics import evaluate

# Create realistic missing data (MNAR pattern)
pattern = PatternConfig(incomplete_vars=['income'], mechanism='MNAR')
X_missing, mask = introduce_missing(X, proportion=0.3, patterns=[pattern])

# Impute
imputer = DAEImputer(hidden_dim=128, epochs=100)
X_imputed = imputer.fit_transform(X_missing)

# Evaluate across all dimensions
results = evaluate(X_original, X_imputed, mask, y=y)
print(f"Overall score: {results.overall_score:.3f}")

Installation

If you don't have uv installed, do yourself a favor and:

# Linux & macOS
curl -LsSf https://astral.sh/uv/install.sh | sh

# macOS (Homebrew)
brew install uv

# Windows
# Well, check their installation page: https://docs.astral.sh/uv/getting-started/installation/

Once installed, simply clone the repo and run uv sync to install dependencies:

git clone https://github.com/gabfssilva/imputify
cd imputify
uv sync

Requires Python 3.12+.

Open the project using your favorite IDE and that's it.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imputify-0.1.0.tar.gz (4.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imputify-0.1.0-py3-none-any.whl (55.3 kB view details)

Uploaded Python 3

File details

Details for the file imputify-0.1.0.tar.gz.

File metadata

  • Download URL: imputify-0.1.0.tar.gz
  • Upload date:
  • Size: 4.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for imputify-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7d0f5c483952a4fe6b9075a4d00e0ea88b5ba72acd671aa056a3e674ee4f08d5
MD5 fcebfe802ef76f4b286ae9260cb66214
BLAKE2b-256 f79a83c249d63502db38e8dfe84ee9c605b3b3b0447d369a8b64ede5924b1963

See more details on using hashes here.

Provenance

The following attestation bundles were made for imputify-0.1.0.tar.gz:

Publisher: publish.yml on gabfssilva/imputify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file imputify-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: imputify-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for imputify-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2ce2e751f5ce3e9c1d407a9ceae8616e5693a6e3e548d5cbba9ac575ae1518e8
MD5 8442232feaba581d2b47865dffd326c9
BLAKE2b-256 de9ebdfb6b97cceb75734f7a61c64179709c3a5b838f1ab3a5fae434ab6466d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for imputify-0.1.0-py3-none-any.whl:

Publisher: publish.yml on gabfssilva/imputify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page