Skip to main content

A library for NaNs and nulls

Project description

HyperImpute - A library for NaNs and nulls.

Test In Colab Tests Tutorials Documentation Status

arXiv License: MIT Python 3.7+ slack

image

HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.

HyperImpute features

  • :rocket: Fast and extensible dataset imputation algorithms, compatible with sklearn.
  • :key: New iterative imputation method: HyperImpute.
  • :cyclone: Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.
  • :fire: Pluginable architecture.

:rocket: Installation

The library can be installed from PyPI using

$ pip install hyperimpute

or from source, using

$ pip install .

:boom: Sample Usage

List available imputers

from hyperimpute.plugins.imputers import Imputers

imputers = Imputers()

imputers.list()

Impute a dataset using one of the available methods

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

method = "gain"

plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())

print(method, out)

Specify the baseline models for HyperImpute

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get(
    "hyperimpute",
    optimizer="hyperband",
    classifier_seed=["logistic_regression"],
    regression_seed=["linear_regression"],
)

out = plugin.fit_transform(X.copy())
print(out)

Use an imputer with a SKLearn pipeline

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])

imputer = Imputers().get("hyperimpute")

estimator = Pipeline(
    [
        ("imputer", imputer),
        ("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
    ]
)

estimator.fit(X, y)

Write a new imputation plugin

from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

knn_imputer = "custom_knn"

class KNN(ImputerPlugin):
    def __init__(self) -> None:
        super().__init__()
        self._model = KNNImputer(n_neighbors=2, weights="uniform")

    @staticmethod
    def name():
        return knn_imputer

    @staticmethod
    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs):
        self._model.fit(*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

imputers.add(knn_imputer, KNN)

assert imputers.get(knn_imputer) is not None

Benchmark imputation models on a dataset

from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models

X, y = load_iris(as_frame=True, return_X_y=True)

imputer = Imputers().get("hyperimpute")

compare_models(
    name="example",
    evaluated_model=imputer,
    X_raw=X,
    ref_methods=["ice", "missforest"],
    scenarios=["MAR"],
    miss_pct=[0.1, 0.3],
    n_iter=2,
)

📓 Tutorials

:zap: Imputation methods

The following table contains the default imputation plugins:

Strategy Description Code
HyperImpute Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets plugin_hyperimpute.py
Mean Replace the missing values using the mean along each column with SimpleImputer plugin_mean.py
Median Replace the missing values using the median along each column with SimpleImputer plugin_median.py
Most-frequent Replace the missing values using the most frequent value along each column with SimpleImputer plugin_most_freq.py
MissForest Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor plugin_missforest.py
ICE Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge plugin_ice.py
MICE Multiple imputations based on ICE using IterativeImputer and BayesianRidge plugin_mice.py
SoftImpute Low-rank matrix approximation via nuclear-norm regularization plugin_softimpute.py
EM Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm plugin_em.py
Sinkhorn Missing Data Imputation using Optimal Transport plugin_sinkhorn.py
GAIN GAIN: Missing Data Imputation using Generative Adversarial Nets plugin_gain.py
MIRACLE MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms plugin_miracle.py
MIWAE MIWAE: Deep Generative Modelling and Imputation of Incomplete Data plugin_miwae.py

:hammer: Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx

Citing

If you use this code, please cite the associated paper:

@article{Jarrett2022HyperImpute,
  doi = {10.48550/ARXIV.2206.07769},
  url = {https://arxiv.org/abs/2206.07769},
  author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
  year = {2022},
  booktitle={39th International Conference on Machine Learning},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

hyperimpute-0.1.17-py3-none-macosx_10_14_x86_64.whl (92.0 kB view details)

Uploaded Python 3 macOS 10.14+ x86-64

hyperimpute-0.1.17-py3-none-any.whl (92.9 kB view details)

Uploaded Python 3

File details

Details for the file hyperimpute-0.1.17-py3-none-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for hyperimpute-0.1.17-py3-none-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 41389e2f59dddd8edeb8ececb14b0c463d186b2e413debec54a0fb154b5ee43e
MD5 7b62266e1cacbba8578ad69475f90fd2
BLAKE2b-256 eb3492ca733c3966f27b6e527fb94819dee5a4bf2eb7144fd5228d2c2412209a

See more details on using hashes here.

File details

Details for the file hyperimpute-0.1.17-py3-none-any.whl.

File metadata

  • Download URL: hyperimpute-0.1.17-py3-none-any.whl
  • Upload date:
  • Size: 92.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for hyperimpute-0.1.17-py3-none-any.whl
Algorithm Hash digest
SHA256 a856fa2b07592e9edbf69d4370d426b10e75487e2f28923970b6f5bbf391429b
MD5 d3ababcaeeeec7b625b1e47906d9cec7
BLAKE2b-256 2fab2501ab7e2fb51c88f7e27d6251d6a3f9fc7beec8025dcf0ff134644ea726

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page