Best imputation method.
Project description
MissForest
This project is a Python implementation of the MissForest algorithm, a powerful tool designed to handle missing values in tabular datasets. The primary goal of this project is to provide users with a more accurate method of imputing missing data.
While MissForest may take more time to process datasets compared to simpler imputation methods, it typically yields more accurate results.
Please note that the efficiency of MissForest is a trade-off for its accuracy. It is designed for those who prioritize data accuracy over processing speed. This makes it an excellent choice for projects where the quality of data is paramount.
How MissForest Handles Categorical Variables ?
Categorical variables in argument 'categoricals' will be label encoded for estimators to work properly.
Example
To install MissForest using pip.
pip install MissForest
Imputing a dataset:
from missforest.missforest import MissForest
import pandas as pd
import numpy as np
if __name__ == "__main__":
df = pd.read_csv("insurance.csv")
# default estimators are lgbm classifier and regressor
mf = MissForest()
mf.fit(
X=train,
categorical=["sex", "smoker", "region"]
)
train_imputed = mf.transform(X=train)
test_imputed = mf.transform(X=test)
print(test_imputed)
# or using the 'fit_transform' method
mf = MissForest()
train_imputed = mf.fit_transform(
X=train,
categorical=["sex", "smoker", "region"]
)
test_imputed = mf.transform(X=test)
print(test_imputed)
Imputing with other estimators
from missforest.missforest import MissForest
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
if __name__ == "__main__":
df = pd.read_csv("insurance.csv")
df_or = df.copy()
for c in df.columns:
random_index = np.random.choice(df.index, size=100)
df.loc[random_index, c] = np.nan
clf = RandomForestClassifier(n_jobs=-1)
rgr = RandomForestRegressor(n_jobs=-1)
mf = MissForest(clf, rgr)
df_imputed = mf.fit_transform(df)
Benchmark
Mean Absolute Percentage Error
missForest | mean/mode | Difference
charges 2.65% 9.72% -7.07%
age 1.16% 2.77% -1.61%
bmi 1.18% 1.25% -0.07%
sex 21.21 31.82 -10.61
smoker 4.24 9.90 -5.66
region 46.67 38.96 +7.71
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file MissForest-2.2.2.tar.gz
.
File metadata
- Download URL: MissForest-2.2.2.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.0rc2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44d3943049e9e44983971f516ad30c37f0f9c2af0a20ea59c5575b2ec9d11116 |
|
MD5 | 6932770c681c18e95925580a7a1b6c06 |
|
BLAKE2b-256 | 0cf1a0c794a915a2770c3fc462543bc02b05fa1b3cec38b428bd83d1eaa71b66 |
File details
Details for the file MissForest-2.2.2-py3-none-any.whl
.
File metadata
- Download URL: MissForest-2.2.2-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.0rc2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 86b311d5d7d8cfa836c5b66a1fd971b2eae935096a66662ea44bb01366352ac0 |
|
MD5 | e03cdf797c3f13e835600ea289c725cd |
|
BLAKE2b-256 | 8c6096a0fc7b56247eebbcd59c64ff96c349918cc202b0d5086fe60f955b4a94 |