Best imputation method.

These details have not been verified by PyPI

Project links

Project description

MissForest

This project is a Python implementation of the MissForest algorithm, a powerful tool designed to handle missing values in tabular datasets. The primary goal of this project is to provide users with a more accurate method of imputing missing data.

While MissForest may take more time to process datasets compared to simpler imputation methods, it typically yields more accurate results.

Please note that the efficiency of MissForest is a trade-off for its accuracy. It is designed for those who prioritize data accuracy over processing speed. This makes it an excellent choice for projects where the quality of data is paramount.

How MissForest Handles Categorical Variables ?

Categorical variables in argument 'categoricals' will be label encoded for estimators to work properly.

Example

To install MissForest using pip.

pip install MissForest

Imputing a dataset:

import pandas as pd
from sklearn.model_selection import train_test_split
from missforest import MissForest

# Load toy dataset.
df = pd.read_csv("insurance.csv")

# Label encoding.
df["sex"] = df["sex"].map({"male": 0, "female": 1})
df["region"] = df["region"].map({
    "southwest": 0, "southeast": 1, "northwest": 2, "northeast": 3})

# Create missing values.
for c in df.columns:
    n = int(len(df) * 0.1)
    rand_idx = np.random.choice(df.index, n)
    df.loc[rand_idx, c] = np.nan

# Split dataset into train and test sets.
train, test = train_test_split(df, test_size=.3, shuffle=True,
                               random_state=42)

# Default estimators are lgbm classifier and regressor
mf = MissForest()
mf.fit(
    x=train,
    categorical=["sex", "smoker", "region"]
)
train_imputed = mf.transform(x=train)
test_imputed = mf.transform(x=test)

Or using the 'fit_transform' method

mf = MissForest()
train_imputed = mf.fit_transform(
    X=train,
    categorical=["sex", "smoker", "region"]
)
test_imputed = mf.transform(X=test)
print(test_imputed)

Imputing with other estimators

from missforest import MissForest
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

df = pd.read_csv("insurance.csv")

for c in df.columns:
    random_index = np.random.choice(df.index, size=100)
    df.loc[random_index, c] = np.nan

clf = RandomForestClassifier(n_jobs=-1)
rgr = RandomForestRegressor(n_jobs=-1)

mf = MissForest(clf, rgr)
df_imputed = mf.fit_transform(df)

Benchmark

Mean Absolute Percentage Error

	missForest	mean/mode	Difference
charges	2.65%	9.72%	-7.07%
age	1.16%	2.77%	-1.61%
bmi	1.18%	1.25%	-0.07%
sex	21.21	31.82	-10.61
smoker	4.24	9.90	-5.66
region	46.67	38.96	+7.71

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.1.3

Aug 8, 2024

3.1.2

Aug 8, 2024

3.1.1

Jul 16, 2024

3.1.0

Jul 16, 2024

3.0.0

Jul 15, 2024

2.5.5

Mar 17, 2024

2.4.4

Mar 16, 2024

2.4.2

Dec 31, 2023

2.4.1

Dec 31, 2023

2.3.2

Dec 21, 2023

2.3.1

Dec 8, 2023

2.3.0

Dec 7, 2023

2.2.3

Nov 4, 2023

2.2.2

Nov 4, 2023

2.2.1

Nov 4, 2023

2.0.0

Jul 21, 2023

1.1.3

Feb 25, 2022

1.1.1

Dec 10, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

missforest-3.1.3.tar.gz (9.2 kB view details)

Uploaded Aug 8, 2024 Source

Built Distribution

MissForest-3.1.3-py3-none-any.whl (11.4 kB view details)

Uploaded Aug 8, 2024 Python 3

File details

Details for the file missforest-3.1.3.tar.gz.

File metadata

Download URL: missforest-3.1.3.tar.gz
Upload date: Aug 8, 2024
Size: 9.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for missforest-3.1.3.tar.gz
Algorithm	Hash digest
SHA256	`e5710fae375d895114c5ca87503b36c30bea4b47ec5a3fa90fbfd4ee8c75740d`
MD5	`0ced2329a87622e4f6b7585756d1cdb2`
BLAKE2b-256	`de4f74c8aa7b67699d08c623fd1ded15bd673da716b0203c66cc5f70f274ca70`

See more details on using hashes here.

File details

Details for the file MissForest-3.1.3-py3-none-any.whl.

File metadata

Download URL: MissForest-3.1.3-py3-none-any.whl
Upload date: Aug 8, 2024
Size: 11.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for MissForest-3.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5869a49a18cb50105bdd3190d542780c7c35e2d02194931c221b8bd26342e3b3`
MD5	`47b61fb91eb23e96b7c64567e65b100e`
BLAKE2b-256	`2da059bb2fd03289ab8fe11dac05f58f5492980c43617a3caa46309a0495f7a5`