Regularized mean imputation functions for imputing numerical data by grouped categoricals
Project description
Regularized Mean Imputer
Introduction
Handling missing data is a common challenge in data analysis and machine learning. Regularized mean imputation offers a technique to fill missing values using a regularized mean based on specific grouping columns.
This package provides one main utility for this purpose: a standalone function impute_column
which can be used for imputation on a pandas DataFrame. This function is meant to be used in a machine learning preprocessing pipeline to impute missing values in both the training and testing datasets with averages only from the train set, preventing information leakage.
How it Works
The imputation process works by grouping the data based on the specified columns and computing a regularized mean for each group. This regularized mean is a weighted average of the group mean and the global mean, adjusted by a regularization parameter. The regularization parameter is tuned using cross-validation.
Installation
pip install regmean-imputer
Usage
Imputation using impute_column
with separate train and test data:
from regmean_imputer import impute_column
# Sample train data
train_data = {
'Age': [25, None, 30, 35, 40],
'Title': ['Mr', 'Mrs', 'Mr', 'Miss', 'Miss'],
'Pclass': [1, 2, 1, 3, 3]
}
# Sample test data
test_data = {
'Age': [None, 45],
'Title': ['Mrs', 'Mr'],
'Pclass': [2, 1]
}
train_df = pd.DataFrame(data=train_data)
test_df = pd.DataFrame(data=test_data)
# Impute the 'Age' column using 'Title' and 'Pclass' as group by columns
imputed_train_data, imputed_test_data = impute_column(train_data=train_df, test_data=test_df, impute_col='Age', group_by_cols=['Title', 'Pclass'])
print(imputed_train_data)
print(imputed_test_data)
This approach of separating the train and test data before imputation is crucial in a machine learning preprocessing pipeline to prevent information leakage. Information leakage happens when information that would not be available at prediction time is used when building the model. This can lead to overly optimistic performance estimates. For example, if we impute missing values in the entire dataset using the mean of a column, the mean is influenced by the test set values, which wouldn't be available at prediction time in a real-world scenario.
Parameters
train_data
(pd.DataFrame): The training dataset.test_data
(pd.DataFrame): The testing dataset.impute_col
(str): Column to be imputed.group_by_cols
(list): Columns used for grouping to compute the regularized mean.m_values
(list, optional): List of regularization parameters to be tested for optimal performance. Default is [1,2,3,4,5,6,7,8,9,10].n_splits
(int, optional): Number of splits for cross-validation during regularization evaluation. Default is 5.verbose
(bool, optional): Whether to print the best regularization parameter. Default is False.random_state
(int, optional): Random state for cross-validation. Default is None.
Conclusion
Regularized mean imputation provides an efficient way to handle missing data, especially when certain columns can provide context on how the imputation should be done. The provided utilities in this package make it easier to apply this method.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file regmean_imputer-0.1.0.tar.gz
.
File metadata
- Download URL: regmean_imputer-0.1.0.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.0 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e129e421367329b570769e0fd667c39ba2d0b94967aa61895bdaab63644f84c |
|
MD5 | a75de8057465c2bdb3b37449792e9fcd |
|
BLAKE2b-256 | 405f9d86cf03180b7686eca5cce7a01bdad8e57c722b2ba44f25fa06c550b5a0 |
File details
Details for the file regmean_imputer-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: regmean_imputer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.0 CPython/3.11.5 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b82ed76d0dc7f594b15e2eeb14d94bf83efe5b2d5230d1b780c1125ee1284a91 |
|
MD5 | f9a63355e6aeb663afe0ff4b56917c66 |
|
BLAKE2b-256 | 688cb3ae6899183be41ee7951bede838d5661048de86661834a9446e1ea95906 |