Mixed-type MICE-style iterative imputer for scikit-learn — handles both numerical and categorical (string) columns in DataFrames.
Project description
MixedImputer
MixedImputer is a wrapper around scikit-learn's
IterativeImputer, which implements an iterative imputation strategy inspired by MICE (Multivariate Imputation by Chained Equations). It extends it to seamlessly handle DataFrames containing both numerical and categorical (string) columns.
It automatically detects column types, encodes categoricals with OrdinalEncoder, runs the MixedImputer (powered by sklearn's IterativeImputer) with a regressor or classifier chosen per column, and decodes categoricals back to their original string values.
Note:
IterativeImputeris an experimental feature in scikit-learn.MixedImputerhandles the requiredenable_iterative_imputerimport internally — just importMixedImputerand use it. If you are also importingIterativeImputerdirectly in your own code, make sure to importMixedImputerbeforeIterativeImputer, or explicitly runfrom sklearn.experimental import enable_iterative_imputerfirst.
Features
- Wrapper for IterativeImputer — builds on sklearn's experimental
IterativeImputer(inspired by MICE), inheriting its battle-tested convergence logic while adding mixed-type support - Mixed-type support — handles numeric and string categorical columns in the same DataFrame
- Binary & multiclass classification — categorical columns with 2, 3, or more classes are supported; the classifier automatically adapts to any number of unique categories
- Auto-detection — automatically identifies categorical columns by dtype (object/category/string)
- Iterative imputation (MICE-style) — models each column as a function of all others
- Posterior sampling — supports stochastic imputation via
sample_posterior=True - scikit-learn compatible —
fit/transform/fit_transformAPI, works inPipeline - DataFrame-native — input a DataFrame, get a DataFrame back
- Data corruption utility — includes a
DataCorrupterclass for benchmarking: introduce MCAR, MAR, or MNAR missing values into datasets (supports .csv, .xlsx, .arff)
Installation
pip install mixedimputer
Or for development:
git clone https://github.com/dnsupp/mixedimputer.git
cd mixedimputer
pip install -e ".[dev]"
Quick Start
import pandas as pd
import numpy as np
from mixedimputer import MixedImputer
# Create sample data with missing values (binary & multiclass categoricals)
data = pd.DataFrame({
'age': [25, 30, np.nan, 40],
'city': ['paris', 'london', np.nan, 'paris'],
'income': [50000, np.nan, 70000, 60000],
'gender': ['M', 'F', 'M', np.nan],
'education': ['bachelor', 'master', 'bachelor', np.nan], # multiclass (3+ categories)
})
# Auto-detect categorical columns (str, object, or category dtype)
# or specify them manually via categorical_features.
imputer = MixedImputer(
max_iter=5,
random_state=42,
)
imputed = imputer.fit_transform(data)
print(imputed)
# age city income gender education
# 0 25.000 paris 50000.0 M bachelor
# 1 30.000 london 57500.0 F master
# 2 32.500 london 70000.0 M bachelor
# 3 40.000 paris 60000.0 F bachelor
Using a custom regressor / classifier
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import RandomForestClassifier
imputer = MixedImputer(
regressor=BayesianRidge(),
classifier=RandomForestClassifier(random_state=0),
sample_posterior=True,
random_state=42,
)
imputed = imputer.fit_transform(data)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
categorical_features |
list of int/str or None | None |
Column indices or names that are categorical. If None and input is a DataFrame, columns with object, string, or category dtype are auto-detected. |
numeric_features |
list of int/str or None | None |
Numeric column indices or names. If None, auto-detected as all columns whose dtype passes pd.api.types.is_numeric_dtype and are not listed in categorical_features. |
regressor |
estimator or None | HistGradientBoostingRegressor() |
Regressor used for numerical target columns. Any sklearn regressor implementing .fit(X, y) and .predict(X) works (e.g. Ridge, RandomForestRegressor, BayesianRidge). For posterior sampling, a return_std-capable model (e.g. BayesianRidge) gives better uncertainty estimates, but other models fall back to unit standard deviation. |
classifier |
estimator or None | HistGradientBoostingClassifier() |
Classifier used for categorical target columns. Any sklearn classifier implementing .fit(X, y) and .predict(X) works (e.g. LogisticRegression, RandomForestClassifier, GaussianNB). For posterior sampling, the classifier must also implement .predict_proba(X) and expose .classes_ (most do — RidgeClassifier is a notable exception). |
max_iter |
int | 10 |
Maximum number of imputation rounds. |
tol |
float | 1e-3 |
Tolerance for early stopping. |
initial_strategy |
str | "mean" |
Initial imputation strategy ("mean", "median", "most_frequent", "constant"). |
sample_posterior |
bool | False |
If True, sample from predictive posterior for stochastic imputation. |
random_state |
int, RandomState or None | None |
Seed for reproducibility. |
verbose |
int | 0 |
Verbosity level. |
add_indicator |
bool | False |
If True, add missing indicator columns. |
keep_empty_features |
bool | False |
If True, keep features that are all-missing at fit time. |
How It Works
MixedImputer is a thin wrapper that orchestrates sklearn's IterativeImputer for mixed-type DataFrames:
- Column detection — object/string/category dtype columns are identified as categorical; the rest as numeric.
- Encoding — categorical columns are encoded to integers using
OrdinalEncoder, with NaN replaced by a sentinel value so the imputer sees them as truly missing. - Iterative imputation (MICE-style) — a customized
IterativeImputerusesHistGradientBoostingRegressorfor numeric columns andHistGradientBoostingClassifierfor categorical columns (binary or multiclass). The classifier handles any number of unique categories automatically. - Decoding — imputed integer values are inverse-transformed back to their original string categories.
Compatible Estimators
MixedImputer accepts any scikit-learn regressor or classifier — you are not
limited to the defaults. Tested and known to work:
| Category | Estimators |
|---|---|
| Regressors | LinearRegression, BayesianRidge, Ridge, Lasso, ElasticNet, HuberRegressor, SGDRegressor, HistGradientBoostingRegressor, RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, DecisionTreeRegressor |
| Classifiers | HistGradientBoostingClassifier, RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, LogisticRegression, RidgeClassifier, KNeighborsClassifier, DecisionTreeClassifier, GaussianNB |
Posterior sampling notes
- Regressors with
return_std(e.g.BayesianRidge) provide per-prediction uncertainty, yielding better posterior draws. Regressors without it fall back to unit standard deviation — stochastic imputation still works but assumes constant variance. - Classifiers with
predict_proba(almost all sklearn classifiers) draw from the predicted class distribution.RidgeClassifierlackspredict_probaand will use its plainpredictoutput instead.
Requirements
- Python ≥ 3.9
- numpy
- pandas
- scipy
- scikit-learn ≥ 1.0
Running Tests
pip install -e ".[dev]"
pytest
The test suite includes:
- Imputer unit tests — verify core functionality, edge cases, reproducibility, and pipeline compatibility.
- Corruption tests — verify the
DataCorrupterwith all three missing-data mechanisms (MCAR, MAR, MNAR) on synthetic and real datasets. - End-to-end benchmarks — corrupt real datasets (
titanic.csv,credit-g.arff) and evaluate imputation accuracy against naive baselines (mean/mode).
All tests pass on the included data/titanic.csv and data/credit-g.arff datasets.
Benchmarking with Corrupted Data
To evaluate imputation quality on real datasets, use the bundled DataCorrupter
to introduce controlled missing values before imputing:
from mixedimputer import DataCorrupter, MixedImputer
# Load and corrupt a dataset — 2 numeric + 2 nominal columns, 10% MCAR
corrupter = DataCorrupter(
mechanism="MCAR",
corruption_fraction=0.10,
num_numeric=2,
num_nominal=2,
random_state=42,
)
corrupted, mask, original = corrupter.corrupt("data/titanic.csv")
# corrupted : DataFrame with NaN at chosen positions
# mask : boolean DataFrame marking corrupted cells
# original : pristine copy for comparison
# Impute with the MixedImputer
imputer = MixedImputer(max_iter=10, random_state=42)
imputed = imputer.fit_transform(corrupted)
# Compare imputed values against ground truth
from sklearn.metrics import mean_squared_error
numeric_col = "Age"
rmse = mean_squared_error(
original.loc[mask[numeric_col], numeric_col],
imputed.loc[mask[numeric_col], numeric_col],
) ** 0.5
print(f"RMSE on corrupted Age values: {rmse:.2f}")
The DataCorrupter supports:
- Three mechanisms: MCAR (uniform random), MAR (dependent on other columns), MNAR (dependent on the column's own values)
- Column selection: by count (
num_numeric=3, num_nominal=2), by name (numeric_columns=["Age", "Fare"]), or random (num_random_columns=5) - Per-column fractions: e.g.
corruption_fraction={"Age": 0.05, "Fare": 0.20} - File formats:
.csv,.xlsx,.arff, or an existing DataFrame
For a complete pipeline example see examples/example.py.
Examples
See examples/examples.py for a runnable script that demonstrates all major
features including auto-detection, posterior sampling, custom estimators, array
input, pipeline integration, and edge cases.
Known Limitations
- Unseen categories at transform time are encoded to the
unknown_valueof theOrdinalEncoder. If the imputer assigns a value that decodes to an unknown category it will appear asNaNin the output. Ensure the training set covers the full vocabulary of each categorical column when possible. keep_empty_features=False(the default) drops columns that are entirelyNaNduringfit. The dropped columns are removed from the output DataFrame.
License
MIT — see LICENSE for details.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mixed_imputer-0.1.0.post1.tar.gz.
File metadata
- Download URL: mixed_imputer-0.1.0.post1.tar.gz
- Upload date:
- Size: 32.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2b8340549fff7636dc89e2fe685f320e0c1e86d85726753e5fef4eabf4f33ad
|
|
| MD5 |
2e059d721548e73b62beb99abcd9bfa0
|
|
| BLAKE2b-256 |
89cf65c66d26e75eeef385a4db9434ab2c95f43d3de86e838208c6a217fb9580
|
File details
Details for the file mixed_imputer-0.1.0.post1-py3-none-any.whl.
File metadata
- Download URL: mixed_imputer-0.1.0.post1-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32b51db6cc366bc1f825ae8a71a48844e935246dc984f3570e6ca79b1876913d
|
|
| MD5 |
cbc7dcfed83204061286eafbe8584f0d
|
|
| BLAKE2b-256 |
073dd331dab70bbbbee46db882cc472ce91dd353b052e1373a9587d97cec1937
|