Skip to main content

A library for NaNs and nulls

Project description

hyperimpute

A library for NaNs and nulls

License: MIT Tests CodeQL Package Release

Dataset imputation is the process of replacing missing data with substituted values.

hyperimpute features:

  • :key: New iterative imputation method: HyperImpute.
  • :cyclone: Classic methods like MICE, MissForest, GAIN etc.
  • :fire: Pluginable architecture.

:rocket: Installation

The library can be installed using

$ pip install .

:boom: Sample Usage

List available imputers

from hyperimpute.plugins.imputers import Imputers

imputers = Imputers()

imputers.list()

Impute a dataset using one of the available methods

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get("miracle")
out = plugin.fit_transform(X.copy())
print(method, out)

Specify the baseline models for HyperImpute

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get(
    "hyperimpute",
    optimizer="hyperband",
    classifier_seed=["logistic_regression"],
    regression_seed=["linear_regression"],
)

out = plugin.fit_transform(X.copy())
print(out)

Use an imputer with a SKLearn pipeline

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])

imputer = Imputers().get("hyperimpute")

estimator = Pipeline(
    [
        ("imputer", imputer),
        ("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
    ]
)

estimator.fit(X, y)

Write a new imputation plugin

from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

knn_imputer = "custom_knn"

class KNN(ImputerPlugin):
    def __init__(self) -> None:
        super().__init__()
        self._model = KNNImputer(n_neighbors=2, weights="uniform")

    @staticmethod
    def name():
        return knn_imputer

    @staticmethod
    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs):
        self._model.fit(*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

imputers.add(knn_imputer, KNN)

assert imputers.get(knn_imputer) is not None

Benchmark imputation models on a dataset

from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models

X, y = load_iris(as_frame=True, return_X_y=True)

imputer = Imputers().get("hyperimpute")

compare_models(
    name="example",
    evaluated_model=imputer,
    X_raw=X,
    ref_methods=["ice", "missforest"],
    scenarios=["MAR"],
    miss_pct=[0.1, 0.3],
    n_iter=2,
)

📓 Tutorials

:zap: Imputation methods

The following table contains the default imputation plugins:

Strategy Description Code
HyperImpute Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets plugin_hyperimpute.py
Mean Replace the missing values using the mean along each column with SimpleImputer plugin_mean.py
Median Replace the missing values using the median along each column with SimpleImputer plugin_median.py
Most-frequent Replace the missing values using the most frequent value along each column with SimpleImputer plugin_most_freq.py
MissForest Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor plugin_missforest.py
ICE Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge plugin_ice.py
MICE Multiple imputations based on ICE using IterativeImputer and BayesianRidge plugin_mice.py
SoftImpute Low-rank matrix approximation via nuclear-norm regularization plugin_softimpute.py
EM Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm plugin_em.py
Sinkhorn Missing Data Imputation using Optimal Transport plugin_sinkhorn.py
GAIN GAIN: Missing Data Imputation using Generative Adversarial Nets plugin_gain.py
MIRACLE MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms plugin_miracle.py
MIWAE MIWAE: Deep Generative Modelling and Imputation of Incomplete Data plugin_miwae.py

:hammer: Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

hyperimpute-0.1.1-py3-none-macosx_10_14_x86_64.whl (82.5 kB view details)

Uploaded Python 3 macOS 10.14+ x86-64

hyperimpute-0.1.1-py3-none-any.whl (83.3 kB view details)

Uploaded Python 3

File details

Details for the file hyperimpute-0.1.1-py3-none-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for hyperimpute-0.1.1-py3-none-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 5044c666b42caa1b789f031b102a59b9609a6522df00dadb90394baefc0e91a5
MD5 43b07d9b51280a7a314a3aad51dd1df8
BLAKE2b-256 618e900be2de8f655229c947be7340718482be9f8f3c519dd23ccf072b90d462

See more details on using hashes here.

File details

Details for the file hyperimpute-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: hyperimpute-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 83.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.4

File hashes

Hashes for hyperimpute-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dc5c525b378651b5cbed378588c464a7609498696f583593f9b979219302b33c
MD5 c4b3aa0e84d4e82af59ed46de13cc957
BLAKE2b-256 7a48262780d62f2a9dd8b936972475b354e737cb4ea8091aa59359ddaafda733

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page