A library for NaNs and nulls
Project description
HyperImpute - A library for NaNs and nulls.
HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.
HyperImpute features
- :rocket: Fast and extensible dataset imputation algorithms, compatible with sklearn.
- :key: New iterative imputation method: HyperImpute.
- :cyclone: Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.
- :fire: Pluginable architecture.
:rocket: Installation
The library can be installed from PyPI using
$ pip install hyperimpute
or from source, using
$ pip install .
:boom: Sample Usage
List available imputers
from hyperimpute.plugins.imputers import Imputers
imputers = Imputers()
imputers.list()
Impute a dataset using one of the available methods
import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
method = "gain"
plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())
print(method, out)
Specify the baseline models for HyperImpute
import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
plugin = Imputers().get(
"hyperimpute",
optimizer="hyperband",
classifier_seed=["logistic_regression"],
regression_seed=["linear_regression"],
)
out = plugin.fit_transform(X.copy())
print(out)
Use an imputer with a SKLearn pipeline
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])
imputer = Imputers().get("hyperimpute")
estimator = Pipeline(
[
("imputer", imputer),
("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
]
)
estimator.fit(X, y)
Write a new imputation plugin
from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin
imputers = Imputers()
knn_imputer = "custom_knn"
class KNN(ImputerPlugin):
def __init__(self) -> None:
super().__init__()
self._model = KNNImputer(n_neighbors=2, weights="uniform")
@staticmethod
def name():
return knn_imputer
@staticmethod
def hyperparameter_space():
return []
def _fit(self, *args, **kwargs):
self._model.fit(*args, **kwargs)
return self
def _transform(self, *args, **kwargs):
return self._model.transform(*args, **kwargs)
imputers.add(knn_imputer, KNN)
assert imputers.get(knn_imputer) is not None
Benchmark imputation models on a dataset
from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models
X, y = load_iris(as_frame=True, return_X_y=True)
imputer = Imputers().get("hyperimpute")
compare_models(
name="example",
evaluated_model=imputer,
X_raw=X,
ref_methods=["ice", "missforest"],
scenarios=["MAR"],
miss_pct=[0.1, 0.3],
n_iter=2,
)
📓 Tutorials
:zap: Imputation methods
The following table contains the default imputation plugins:
Strategy | Description | Code |
---|---|---|
HyperImpute | Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets | plugin_hyperimpute.py |
Mean | Replace the missing values using the mean along each column with SimpleImputer |
plugin_mean.py |
Median | Replace the missing values using the median along each column with SimpleImputer |
plugin_median.py |
Most-frequent | Replace the missing values using the most frequent value along each column with SimpleImputer |
plugin_most_freq.py |
MissForest | Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor |
plugin_missforest.py |
ICE | Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge |
plugin_ice.py |
MICE | Multiple imputations based on ICE using IterativeImputer and BayesianRidge |
plugin_mice.py |
SoftImpute | Low-rank matrix approximation via nuclear-norm regularization |
plugin_softimpute.py |
EM | Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm |
plugin_em.py |
Sinkhorn | Missing Data Imputation using Optimal Transport |
plugin_sinkhorn.py |
GAIN | GAIN: Missing Data Imputation using Generative Adversarial Nets |
plugin_gain.py |
MIRACLE | MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms |
plugin_miracle.py |
MIWAE | MIWAE: Deep Generative Modelling and Imputation of Incomplete Data |
plugin_miwae.py |
:hammer: Tests
Install the testing dependencies using
pip install .[testing]
The tests can be executed using
pytest -vsx
Citing
If you use this code, please cite the associated paper:
@article{Jarrett2022HyperImpute,
doi = {10.48550/ARXIV.2206.07769},
url = {https://arxiv.org/abs/2206.07769},
author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
year = {2022},
booktitle={39th International Conference on Machine Learning},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file hyperimpute-0.1.17-py3-none-macosx_10_14_x86_64.whl
.
File metadata
- Download URL: hyperimpute-0.1.17-py3-none-macosx_10_14_x86_64.whl
- Upload date:
- Size: 92.0 kB
- Tags: Python 3, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41389e2f59dddd8edeb8ececb14b0c463d186b2e413debec54a0fb154b5ee43e |
|
MD5 | 7b62266e1cacbba8578ad69475f90fd2 |
|
BLAKE2b-256 | eb3492ca733c3966f27b6e527fb94819dee5a4bf2eb7144fd5228d2c2412209a |
File details
Details for the file hyperimpute-0.1.17-py3-none-any.whl
.
File metadata
- Download URL: hyperimpute-0.1.17-py3-none-any.whl
- Upload date:
- Size: 92.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a856fa2b07592e9edbf69d4370d426b10e75487e2f28923970b6f5bbf391429b |
|
MD5 | d3ababcaeeeec7b625b1e47906d9cec7 |
|
BLAKE2b-256 | 2fab2501ab7e2fb51c88f7e27d6251d6a3f9fc7beec8025dcf0ff134644ea726 |