Skip to main content

Best imputation method.

Project description

MissForest

This project is a Python implementation of the MissForest algorithm, a powerful tool designed to handle missing values in tabular datasets. The primary goal of this project is to provide users with a more accurate method of imputing missing data.

While MissForest may take more time to process datasets compared to simpler imputation methods, it typically yields more accurate results.

Please note that the efficiency of MissForest is a trade-off for its accuracy. It is designed for those who prioritize data accuracy over processing speed. This makes it an excellent choice for projects where the quality of data is paramount.

How MissForest Handles Categorical Variables ?

Categorical variables in argument 'categoricals' will be label encoded for estimators to work properly.

Example

To install MissForest using pip.

pip install MissForest

Imputing a dataset:

import pandas as pd
from sklearn.model_selection import train_test_split
from missforest import MissForest

# Load toy dataset.
df = pd.read_csv("insurance.csv")

# Label encoding.
df["sex"] = df["sex"].map({"male": 0, "female": 1})
df["region"] = df["region"].map({
    "southwest": 0, "southeast": 1, "northwest": 2, "northeast": 3})

# Create missing values.
for c in df.columns:
    n = int(len(df) * 0.1)
    rand_idx = np.random.choice(df.index, n)
    df.loc[rand_idx, c] = np.nan

# Split dataset into train and test sets.
train, test = train_test_split(df, test_size=.3, shuffle=True,
                               random_state=42)

# Default estimators are lgbm classifier and regressor
mf = MissForest()
mf.fit(
    x=train,
    categorical=["sex", "smoker", "region"]
)
train_imputed = mf.transform(x=train)
test_imputed = mf.transform(x=test)

Or using the 'fit_transform' method

mf = MissForest()
train_imputed = mf.fit_transform(
    X=train,
    categorical=["sex", "smoker", "region"]
)
test_imputed = mf.transform(X=test)
print(test_imputed)

Imputing with other estimators

from missforest import MissForest
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

df = pd.read_csv("insurance.csv")

for c in df.columns:
    random_index = np.random.choice(df.index, size=100)
    df.loc[random_index, c] = np.nan

clf = RandomForestClassifier(n_jobs=-1)
rgr = RandomForestRegressor(n_jobs=-1)

mf = MissForest(clf, rgr)
df_imputed = mf.fit_transform(df)

Benchmark

Mean Absolute Percentage Error

missForest mean/mode Difference
charges 2.65% 9.72% -7.07%
age 1.16% 2.77% -1.61%
bmi 1.18% 1.25% -0.07%
sex 21.21 31.82 -10.61
smoker 4.24 9.90 -5.66
region 46.67 38.96 +7.71

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

missforest-3.1.3.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

MissForest-3.1.3-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file missforest-3.1.3.tar.gz.

File metadata

  • Download URL: missforest-3.1.3.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for missforest-3.1.3.tar.gz
Algorithm Hash digest
SHA256 e5710fae375d895114c5ca87503b36c30bea4b47ec5a3fa90fbfd4ee8c75740d
MD5 0ced2329a87622e4f6b7585756d1cdb2
BLAKE2b-256 de4f74c8aa7b67699d08c623fd1ded15bd673da716b0203c66cc5f70f274ca70

See more details on using hashes here.

File details

Details for the file MissForest-3.1.3-py3-none-any.whl.

File metadata

  • Download URL: MissForest-3.1.3-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for MissForest-3.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5869a49a18cb50105bdd3190d542780c7c35e2d02194931c221b8bd26342e3b3
MD5 47b61fb91eb23e96b7c64567e65b100e
BLAKE2b-256 2da059bb2fd03289ab8fe11dac05f58f5492980c43617a3caa46309a0495f7a5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page