A package for missing data imputation
Project description
MissEnsemble
MissEnsemble is a generalization of the popular MissForest algorithm (Stekhoven et al., 2012) for missing value imputation. It extends MissForest by supporting multiple ensemble methods and provides a scikit-learn compatible API. Currently supported ensemble methods:
- Random Forests
- XGBoost
MissEnsemble natively handles different types of input values (e.g., strings, numbers, etc). You only need to specify which column names belong to which variable type (numerical, categorical, or ordinal).
In addition, MissEnsemble provides built-in visualization functions for convergence and imputation validation (when true values are available).
Setup
Install from PyPI:
pip install missensemble
Usage Example
You must specify whether each column in your DataFrame is categorical, ordinal, or numerical. This ensures the imputation method treats each variable appropriately. Assign every column to one of these types. Example with a DataFrame of five variables in total:
import numpy as np
import pandas as pd
from missensemble import MissEnsemble
# Create example dataframe (100 x 5)
data = pd.DataFrame({
"col1": np.random.choice(['A', 'B', 'C'], size=100),
"col2": np.random.choice(['X', 'Y'], size=100),
"col3": np.random.randint(1, 5, size=100),
"col4": np.random.randn(100),
"col5": np.random.randn(100)
})
# Create NAs for col1 and col4
for col in ['col1', 'col4']:
to_be_nas = data.sample(30) # 30 values missing at random
to_be_nas[col] = np.nan
data.loc[to_be_nas.index] = to_be_nas
# Initialize the MissEnsemble class
estimator = MissEnsemble(
categorical_vars=['col1', 'col2'],
ordinal_vars=['col3'],
numerical_vars=['col4', 'col5'],
)
# Fit and transform the data
imputed_data = estimator.fit_transform(data)
For an extended usage example, see the example.ipynb notebook.
Parameters
The MissEnsemble class accepts the following parameters:
n_iter(int): Number of iterations to perform for imputation.categorical_vars(list of str): List of column names representing categorical variables.ordinal_vars(list of str): List of column names representing ordinal variables.numerical_vars(list of str): List of column names representing numerical variables.ens_method(str, optional): Ensemble method to use for imputation. Default is 'forest'. 'xgb' also supported.n_estimators(int, optional): Number of estimators to use in the ensemble method. Default is 100.tol(float, optional): Tolerance for convergence. Default is 1e-4.random_state(int, optional): Random state for reproducibility. Default is 42.print_criteria(bool, optional): Whether to print the imputation criteria during fitting. Default is False.
If the converge criterion change is lower than tol for three rounds, the algorithm terminates earlier.
Requirements
MissEnsemble requires Python 3.11+ and the following packages:
- numpy
- pandas
- scikit-learn
- xgboost
- seaborn
- matplotlib
The requirements are taken care of by pip automatically during the installation of the package.
Parameter specification of MissEnsemble
Supported Ensemble Methods
You can select the ensemble method using the ens_method parameter:
ens_method='forest'for Random Forests (default)ens_method='xgb'for XGBoost
Error Handling
- Each column must be assigned to exactly one variable type: categorical, ordinal, or numerical.
- If a column is assigned to multiple types or omitted, MissEnsemble will raise an error.
API Reference
The MissEnsemble class inherits from the scikit-learn API. Public methods:
fit(X): Fit the imputer to the data.transform(X): Impute missing values in new data.fit_transform(X): Fit and transform in one step.plot_criteria(plot_final=False): Visualize convergence criteria.check_imputation_fit(var_name, true_values, error_type, plot_type): Visualize and assess imputation quality.
Visualization Methods
MissEnsemble offers visualization functionalities for convergence and imputation checks (the latter only if true values are available).
Convergence Criteria
After fitting, use the plot_criteria method to show the minimization path of the stopping criteria:
estimator.plot_criteria(plot_final=False)
which results in the following plot:
Imputation check
The check_imputation_fit method plots divergence of the imputed values as compared to the true values. In the following code, we check the imputation of mean texture (see example.ipynb notebook):
estimator.check_imputation_fit(
var_name='mean texture',
true_values=data.loc[:, 'mean texture'],
error_type='std_diff',
plot_type='hist'
)
which results in the following plot:
Different divergence and plot types are offered in this method.
Contact
For questions or support, please open an issue on GitHub.
Literature
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file missensemble-0.2.0.tar.gz.
File metadata
- Download URL: missensemble-0.2.0.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
089eb74b89e1639d6fc69a40a1933e11cb7c49abb20fec13b1f82ad25b90bf95
|
|
| MD5 |
0c39be5db95b7b8c8c3b816aec5e52c7
|
|
| BLAKE2b-256 |
01fe142767a737818bad0a59c17adaf5a5871d9c1c0f12da4464ab92352a050b
|
Provenance
The following attestation bundles were made for missensemble-0.2.0.tar.gz:
Publisher:
release.yml on dkatsimpokis/MissEnsemble
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
missensemble-0.2.0.tar.gz -
Subject digest:
089eb74b89e1639d6fc69a40a1933e11cb7c49abb20fec13b1f82ad25b90bf95 - Sigstore transparency entry: 1133598352
- Sigstore integration time:
-
Permalink:
dkatsimpokis/MissEnsemble@02eec2f141958b6206d446f8e1f2cd808cfa2947 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/dkatsimpokis
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@02eec2f141958b6206d446f8e1f2cd808cfa2947 -
Trigger Event:
push
-
Statement type:
File details
Details for the file missensemble-0.2.0-py3-none-any.whl.
File metadata
- Download URL: missensemble-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
334c4ad4cdaf489d64ee301a18eccea25a85d2890dcc6a0825d65ec472dacf2f
|
|
| MD5 |
4c6e2eac96672cfb32412498e32398e2
|
|
| BLAKE2b-256 |
b25bb45c9f3a8ada4e36d5b8fa517e837a4f1b06bbed1f2f850c3f060bc784b3
|
Provenance
The following attestation bundles were made for missensemble-0.2.0-py3-none-any.whl:
Publisher:
release.yml on dkatsimpokis/MissEnsemble
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
missensemble-0.2.0-py3-none-any.whl -
Subject digest:
334c4ad4cdaf489d64ee301a18eccea25a85d2890dcc6a0825d65ec472dacf2f - Sigstore transparency entry: 1133598978
- Sigstore integration time:
-
Permalink:
dkatsimpokis/MissEnsemble@02eec2f141958b6206d446f8e1f2cd808cfa2947 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/dkatsimpokis
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@02eec2f141958b6206d446f8e1f2cd808cfa2947 -
Trigger Event:
push
-
Statement type: