Skip to main content

A package for missing data imputation

Project description

MissEnsemble

Release PyPI Unittests License: Apache 2.0

MissEnsemble is a generalization of the popular MissForest algorithm (Stekhoven et al., 2012) for missing value imputation. It extends MissForest by supporting multiple ensemble methods and provides a scikit-learn compatible API. Currently supported ensemble methods:

  • Random Forests
  • XGBoost

MissEnsemble natively handles different types of input values (e.g., strings, numbers, etc). You only need to specify which column names belong to which variable type (numerical, categorical, or ordinal).

In addition, MissEnsemble provides built-in visualization functions for convergence and imputation validation (when true values are available).

Setup

Install from PyPI:

pip install missensemble

Usage Example

You must specify whether each column in your DataFrame is categorical, ordinal, or numerical. This ensures the imputation method treats each variable appropriately. Assign every column to one of these types. Example with a DataFrame of five variables in total:

import numpy as np
import pandas as pd
from missensemble import MissEnsemble

# Create example dataframe (100 x 5)
data = pd.DataFrame({
    "col1": np.random.choice(['A', 'B', 'C'], size=100),
    "col2": np.random.choice(['X', 'Y'], size=100),
    "col3": np.random.randint(1, 5, size=100),
    "col4": np.random.randn(100),
    "col5": np.random.randn(100)
})

# Create NAs for col1 and col4
for col in ['col1', 'col4']:
    to_be_nas = data.sample(30)  # 30 values missing at random
    to_be_nas[col] = np.nan
    data.loc[to_be_nas.index] = to_be_nas

# Initialize the MissEnsemble class
estimator = MissEnsemble(
    categorical_vars=['col1', 'col2'],
    ordinal_vars=['col3'],
    numerical_vars=['col4', 'col5'],
)

# Fit and transform the data
imputed_data = estimator.fit_transform(data)

For an extended usage example, see the example.ipynb notebook.

Parameters

The MissEnsemble class accepts the following parameters:

  • n_iter (int): Number of iterations to perform for imputation.
  • categorical_vars (list of str): List of column names representing categorical variables.
  • ordinal_vars (list of str): List of column names representing ordinal variables.
  • numerical_vars (list of str): List of column names representing numerical variables.
  • ens_method (str, optional): Ensemble method to use for imputation. Default is 'forest'. 'xgb' also supported.
  • n_estimators (int, optional): Number of estimators to use in the ensemble method. Default is 100.
  • tol (float, optional): Tolerance for convergence. Default is 1e-4.
  • random_state (int, optional): Random state for reproducibility. Default is 42.
  • print_criteria (bool, optional): Whether to print the imputation criteria during fitting. Default is False.

If the converge criterion change is lower than tol for three rounds, the algorithm terminates earlier.

Requirements

MissEnsemble requires Python 3.11+ and the following packages:

  • numpy
  • pandas
  • scikit-learn
  • xgboost
  • seaborn
  • matplotlib

The requirements are taken care of by pip automatically during the installation of the package.

Parameter specification of MissEnsemble

Supported Ensemble Methods

You can select the ensemble method using the ens_method parameter:

  • ens_method='forest' for Random Forests (default)
  • ens_method='xgb' for XGBoost

Error Handling

  • Each column must be assigned to exactly one variable type: categorical, ordinal, or numerical.
  • If a column is assigned to multiple types or omitted, MissEnsemble will raise an error.

API Reference

The MissEnsemble class inherits from the scikit-learn API. Public methods:

  • fit(X): Fit the imputer to the data.
  • transform(X): Impute missing values in new data.
  • fit_transform(X): Fit and transform in one step.
  • plot_criteria(plot_final=False): Visualize convergence criteria.
  • check_imputation_fit(var_name, true_values, error_type, plot_type): Visualize and assess imputation quality.

Visualization Methods

MissEnsemble offers visualization functionalities for convergence and imputation checks (the latter only if true values are available).

Convergence Criteria

After fitting, use the plot_criteria method to show the minimization path of the stopping criteria:

estimator.plot_criteria(plot_final=False)

which results in the following plot:

imputation criteria

Imputation check

The check_imputation_fit method plots divergence of the imputed values as compared to the true values. In the following code, we check the imputation of mean texture (see example.ipynb notebook):

estimator.check_imputation_fit(
    var_name='mean texture',
    true_values=data.loc[:, 'mean texture'],
    error_type='std_diff',
    plot_type='hist'
)

which results in the following plot:

imputation check

Different divergence and plot types are offered in this method.

Contact

For questions or support, please open an issue on GitHub.

Literature

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

missensemble-0.2.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

missensemble-0.2.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file missensemble-0.2.0.tar.gz.

File metadata

  • Download URL: missensemble-0.2.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for missensemble-0.2.0.tar.gz
Algorithm Hash digest
SHA256 089eb74b89e1639d6fc69a40a1933e11cb7c49abb20fec13b1f82ad25b90bf95
MD5 0c39be5db95b7b8c8c3b816aec5e52c7
BLAKE2b-256 01fe142767a737818bad0a59c17adaf5a5871d9c1c0f12da4464ab92352a050b

See more details on using hashes here.

Provenance

The following attestation bundles were made for missensemble-0.2.0.tar.gz:

Publisher: release.yml on dkatsimpokis/MissEnsemble

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file missensemble-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: missensemble-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for missensemble-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 334c4ad4cdaf489d64ee301a18eccea25a85d2890dcc6a0825d65ec472dacf2f
MD5 4c6e2eac96672cfb32412498e32398e2
BLAKE2b-256 b25bb45c9f3a8ada4e36d5b8fa517e837a4f1b06bbed1f2f850c3f060bc784b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for missensemble-0.2.0-py3-none-any.whl:

Publisher: release.yml on dkatsimpokis/MissEnsemble

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page