Skip to main content

Best imputation method.

Project description

MissForest

This project is a Python implementation of the MissForest algorithm, a powerful tool designed to handle missing values in tabular datasets. The primary goal of this project is to provide users with a more accurate method of imputing missing data.

While MissForest may take more time to process datasets compared to simpler imputation methods, it typically yields more accurate results.

Please note that the efficiency of MissForest is a trade-off for its accuracy. It is designed for those who prioritize data accuracy over processing speed. This makes it an excellent choice for projects where the quality of data is paramount.

How MissForest Handles Categorical Variables ?

Categorical variables in argument 'categoricals' will be label encoded for estimators to work properly.

Example

To install MissForest using pip.

pip install MissForest

Imputing a dataset:

from missforest.missforest import MissForest
import pandas as pd
import numpy as np


if __name__ == "__main__":
    df = pd.read_csv("insurance.csv")

    # default estimators are lgbm classifier and regressor
    mf = MissForest()
    mf.fit(
        X=train,
        categorical=["sex", "smoker", "region"]
    )
    train_imputed = mf.transform(X=train)
    test_imputed = mf.transform(X=test)
    print(test_imputed)

    # or using the 'fit_transform' method
    mf = MissForest()
    train_imputed = mf.fit_transform(
        X=train,
        categorical=["sex", "smoker", "region"]
    )
    test_imputed = mf.transform(X=test)
    print(test_imputed)

Imputing with other estimators

from missforest.missforest import MissForest
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier


if __name__ == "__main__":
    df = pd.read_csv("insurance.csv")
    df_or = df.copy()
    for c in df.columns:
        random_index = np.random.choice(df.index, size=100)
        df.loc[random_index, c] = np.nan

clf = RandomForestClassifier(n_jobs=-1)
rgr = RandomForestRegressor(n_jobs=-1)

mf = MissForest(clf, rgr)
df_imputed = mf.fit_transform(df)

Benchmark

            Mean Absolute Percentage Error
           missForest | mean/mode | Difference
 charges        2.65%       9.72%       -7.07%
     age        1.16%       2.77%       -1.61%
     bmi        1.18%       1.25%       -0.07%
     sex        21.21       31.82       -10.61
  smoker         4.24        9.90        -5.66
  region        46.67       38.96        +7.71

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MissForest-2.2.2.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

MissForest-2.2.2-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file MissForest-2.2.2.tar.gz.

File metadata

  • Download URL: MissForest-2.2.2.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.0rc2

File hashes

Hashes for MissForest-2.2.2.tar.gz
Algorithm Hash digest
SHA256 44d3943049e9e44983971f516ad30c37f0f9c2af0a20ea59c5575b2ec9d11116
MD5 6932770c681c18e95925580a7a1b6c06
BLAKE2b-256 0cf1a0c794a915a2770c3fc462543bc02b05fa1b3cec38b428bd83d1eaa71b66

See more details on using hashes here.

File details

Details for the file MissForest-2.2.2-py3-none-any.whl.

File metadata

  • Download URL: MissForest-2.2.2-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.0rc2

File hashes

Hashes for MissForest-2.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 86b311d5d7d8cfa836c5b66a1fd971b2eae935096a66662ea44bb01366352ac0
MD5 e03cdf797c3f13e835600ea289c725cd
BLAKE2b-256 8c6096a0fc7b56247eebbcd59c64ff96c349918cc202b0d5086fe60f955b4a94

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page