Skip to main content

Regularized mean imputation functions for imputing numerical data by grouped categoricals

Project description

Regularized Mean Imputer

Introduction

Handling missing data is a common challenge in data analysis and machine learning. Regularized mean imputation offers a technique to fill missing values using a regularized mean based on specific grouping columns.

This package provides one main utility for this purpose: a standalone function impute_column which can be used for imputation on a pandas DataFrame. This function is meant to be used in a machine learning preprocessing pipeline to impute missing values in both the training and testing datasets with averages only from the train set, preventing information leakage.

How it Works

The imputation process works by grouping the data based on the specified columns and computing a regularized mean for each group. This regularized mean is a weighted average of the group mean and the global mean, adjusted by a regularization parameter. The regularization parameter is tuned using cross-validation.

Installation

pip install regmean-imputer

Usage

Imputation using impute_column with separate train and test data:

from regmean_imputer import impute_column

# Sample train data
train_data = {
    'Age': [25, None, 30, 35, 40],
    'Title': ['Mr', 'Mrs', 'Mr', 'Miss', 'Miss'],
    'Pclass': [1, 2, 1, 3, 3]
}

# Sample test data
test_data = {
    'Age': [None, 45],
    'Title': ['Mrs', 'Mr'],
    'Pclass': [2, 1]
}

train_df = pd.DataFrame(data=train_data)
test_df = pd.DataFrame(data=test_data)

# Impute the 'Age' column using 'Title' and 'Pclass' as group by columns
imputed_train_data, imputed_test_data = impute_column(train_data=train_df, test_data=test_df, impute_col='Age', group_by_cols=['Title', 'Pclass'])
print(imputed_train_data)
print(imputed_test_data)

This approach of separating the train and test data before imputation is crucial in a machine learning preprocessing pipeline to prevent information leakage. Information leakage happens when information that would not be available at prediction time is used when building the model. This can lead to overly optimistic performance estimates. For example, if we impute missing values in the entire dataset using the mean of a column, the mean is influenced by the test set values, which wouldn't be available at prediction time in a real-world scenario.

Parameters

  • train_data (pd.DataFrame): The training dataset.
  • test_data (pd.DataFrame): The testing dataset.
  • impute_col (str): Column to be imputed.
  • group_by_cols (list): Columns used for grouping to compute the regularized mean.
  • m_values (list, optional): List of regularization parameters to be tested for optimal performance. Default is [1,2,3,4,5,6,7,8,9,10].
  • n_splits (int, optional): Number of splits for cross-validation during regularization evaluation. Default is 5.
  • verbose (bool, optional): Whether to print the best regularization parameter. Default is False.
  • random_state (int, optional): Random state for cross-validation. Default is None.

Conclusion

Regularized mean imputation provides an efficient way to handle missing data, especially when certain columns can provide context on how the imputation should be done. The provided utilities in this package make it easier to apply this method.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

regmean_imputer-0.1.0.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

regmean_imputer-0.1.0-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file regmean_imputer-0.1.0.tar.gz.

File metadata

  • Download URL: regmean_imputer-0.1.0.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.0 CPython/3.11.5 Windows/10

File hashes

Hashes for regmean_imputer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3e129e421367329b570769e0fd667c39ba2d0b94967aa61895bdaab63644f84c
MD5 a75de8057465c2bdb3b37449792e9fcd
BLAKE2b-256 405f9d86cf03180b7686eca5cce7a01bdad8e57c722b2ba44f25fa06c550b5a0

See more details on using hashes here.

File details

Details for the file regmean_imputer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for regmean_imputer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b82ed76d0dc7f594b15e2eeb14d94bf83efe5b2d5230d1b780c1125ee1284a91
MD5 f9a63355e6aeb663afe0ff4b56917c66
BLAKE2b-256 688cb3ae6899183be41ee7951bede838d5661048de86661834a9446e1ea95906

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page