Skip to main content

A library for NaNs and nulls

Project description

hyperimpute

A library for NaNs and nulls

License: MIT Tests CodeQL Package Release

Dataset imputation is the process of replacing missing data with substituted values.

hyperimpute features:

  • :key: New iterative imputation method: HyperImpute.
  • :cyclone: Classic methods like MICE, MissForest, GAIN etc.
  • :fire: Pluginable architecture.

:rocket: Installation

The library can be installed from PyPI using

$ pip install hyperimpute

or from source, using

$ pip install .

:boom: Sample Usage

List available imputers

from hyperimpute.plugins.imputers import Imputers

imputers = Imputers()

imputers.list()

Impute a dataset using one of the available methods

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get("miracle")
out = plugin.fit_transform(X.copy())
print(method, out)

Specify the baseline models for HyperImpute

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get(
    "hyperimpute",
    optimizer="hyperband",
    classifier_seed=["logistic_regression"],
    regression_seed=["linear_regression"],
)

out = plugin.fit_transform(X.copy())
print(out)

Use an imputer with a SKLearn pipeline

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])

imputer = Imputers().get("hyperimpute")

estimator = Pipeline(
    [
        ("imputer", imputer),
        ("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
    ]
)

estimator.fit(X, y)

Write a new imputation plugin

from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

knn_imputer = "custom_knn"

class KNN(ImputerPlugin):
    def __init__(self) -> None:
        super().__init__()
        self._model = KNNImputer(n_neighbors=2, weights="uniform")

    @staticmethod
    def name():
        return knn_imputer

    @staticmethod
    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs):
        self._model.fit(*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

imputers.add(knn_imputer, KNN)

assert imputers.get(knn_imputer) is not None

Benchmark imputation models on a dataset

from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models

X, y = load_iris(as_frame=True, return_X_y=True)

imputer = Imputers().get("hyperimpute")

compare_models(
    name="example",
    evaluated_model=imputer,
    X_raw=X,
    ref_methods=["ice", "missforest"],
    scenarios=["MAR"],
    miss_pct=[0.1, 0.3],
    n_iter=2,
)

📓 Tutorials

:zap: Imputation methods

The following table contains the default imputation plugins:

Strategy Description Code
HyperImpute Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets plugin_hyperimpute.py
Mean Replace the missing values using the mean along each column with SimpleImputer plugin_mean.py
Median Replace the missing values using the median along each column with SimpleImputer plugin_median.py
Most-frequent Replace the missing values using the most frequent value along each column with SimpleImputer plugin_most_freq.py
MissForest Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor plugin_missforest.py
ICE Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge plugin_ice.py
MICE Multiple imputations based on ICE using IterativeImputer and BayesianRidge plugin_mice.py
SoftImpute Low-rank matrix approximation via nuclear-norm regularization plugin_softimpute.py
EM Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm plugin_em.py
Sinkhorn Missing Data Imputation using Optimal Transport plugin_sinkhorn.py
GAIN GAIN: Missing Data Imputation using Generative Adversarial Nets plugin_gain.py
MIRACLE MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms plugin_miracle.py
MIWAE MIWAE: Deep Generative Modelling and Imputation of Incomplete Data plugin_miwae.py

:hammer: Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

hyperimpute-0.1.2-py3-none-macosx_10_14_x86_64.whl (82.9 kB view details)

Uploaded Python 3 macOS 10.14+ x86-64

hyperimpute-0.1.2-py3-none-any.whl (83.7 kB view details)

Uploaded Python 3

File details

Details for the file hyperimpute-0.1.2-py3-none-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for hyperimpute-0.1.2-py3-none-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 ead730b2b3208ae382b9f7568c2866a29eb97c7d6855ff7d818464adc58e6253
MD5 d7a1686e789a6af234ab2a237e660ede
BLAKE2b-256 551fa1268db589f8e276694561589fcc91d3be50e504d02a6d2b2ec08d05ee4e

See more details on using hashes here.

File details

Details for the file hyperimpute-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: hyperimpute-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 83.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.9

File hashes

Hashes for hyperimpute-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 728eb1b229a04f9eec5594415c299093c08065ae7768ca38f90dfc1a84375d42
MD5 8f6e086e216a185576049524a3ca488a
BLAKE2b-256 6f12d27f864ac4ce143ab3b09393b65e10e45b85be603868a0d7e330454e4090

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page