Skip to main content

A Python package for efficient scaling and outlier handling of pandas DataFrames using the some of the most popular outlier elimination approaches.

Project description

IdentifyOutliers

A Python package for efficient scaling and outlier handling of pandas DataFrames using the some of the most popular outlier elimination approaches.

IdentifyOutliers is designed to provide a seamless experience in preprocessing pandas DataFrames by ensuring data normalization and outlier handling in one step.

Features

  • Data Scaling: Utilizes the standard scaler, min_max scaler and robust scaler methods for data normalization.
  • Outlier Detection: Provides an option to set thresholds for outlier detection.
  • Multiple Outputs: Returns the original data, the scaled data without outliers, a separate DataFrame for detected outliers, and scaled outliers.

Installation

Install the package using pip:

pip install IdentifyOutliers

Usage

import pandas as pd
from IdentifyOutliers.CustomZscoreScaler import CustomZscoreScaler
from IdentifyOutliers.CustomMinMaxScaler import CustomMinMaxScaler
from IdentifyOutliers.CustomRobustScaler import CustomRobustScaler
from IdentifyOutliers.CustomIQRScaler import CustomIQRScaler

# Sample DataFrame
data = {
    'A': [1, 2, 3, 100, 5],
    'B': [5, 6, 7, 8, 500]
}
df = pd.DataFrame(data)

# Initialize the scalers with attributes. The default values are shown below.
scaler_czs = CustomZscoreScaler(threshold=3.0)
scaler_cms = CustomMinMaxScaler(lower_bound=0.05, upper_bound=0.95)
scaler_crs = CustomRobustScaler(threshold=3.5, mad_multiplier=0.6745)
scaler_cis = CustomIQRScaler(lower_bound=1.5, upper_bound=1.5)


# Transform the data
df_no_outliers_czs, df_scaled_no_outliers_czs, df_outliers_czs, df_scaled_outliers_czs = scaler_czs.transform(df)
df_no_outliers_cms, df_scaled_no_outliers_cms, df_outliers_cms, df_scaled_outliers_cms = scaler_cms.transform(df)
df_no_outliers_crs, df_scaled_no_outliers_crs, df_outliers_crs, df_scaled_outliers_crs = scaler_crs.transform(df)
df_no_outliers_cis, df_scaled_no_outliers_cis, df_outliers_cis, df_scaled_outliers_cis = scaler_cis.transform(df)


# Print the results for CustomZscoreScaler
print(df_no_outliers_czs)
#    A  B
# 0  1  5
# 1  2  6
# 2  3  7

print(df_scaled_no_outliers_czs)
#           A         B
# 0 -0.544672 -0.507592
# 1 -0.518980 -0.502526
# 2 -0.493288 -0.497461

print(df_outliers_czs)
#      A    B
# 3  100    8
# 4    5  500

print(df_outliers_czs)
#      A    B
# 3  100    8
# 4    5  500

print(df_scaled_outliers_czs)
#           A         B
# 3  1.998845 -0.492395
# 4 -0.441904  1.999974

Parameters

CustomZscoreScaler:
  • threshold: The z-score threshold for outlier detection. Data points exceeding threshold standard deviations away from the mean are considered outliers. The default value is 3.0.
CustomMinMaxScaler:
  • lower_bound: The lower bound for outlier detection. Data points below the lower bound are considered outliers. The default value is 0.05.
  • upper_bound: The upper bound for outlier detection. Data points above the upper bound are considered outliers. The default value is 0.95.
CustomRobustScaler:
  • threshold: The z-score threshold for outlier detection. Data points exceeding threshold standard deviations away from the mean are considered outliers. The default value is 3.5.
  • mad_multiplier: The MAD multiplier for outlier detection. Data points exceeding the MAD multiplied by the threshold are considered outliers. The default value is 0.6745.
CustomIQRScaler:
  • lower_bound: Multiplier applied to Interquartile Range (IQR) for identifying lower bound. The default value is 1.5 or Q1 - 1.5*IRQ..
  • upper_bound: Multiplier applied to Interquartile Range (IQR) for identifying upper bound. The default value is 1.5 or Q3 + 1.5*IRQ..

Contributions

Contributions are welcome! Please create an issue or submit a pull request.

License

This project is licensed under the [MIT License] (https://github.com/amithpdn/IdentifyOutliers/blob/master/LICENSE.TXT).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

IdentifyOutliers-0.1.0.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

IdentifyOutliers-0.1.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file IdentifyOutliers-0.1.0.tar.gz.

File metadata

  • Download URL: IdentifyOutliers-0.1.0.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for IdentifyOutliers-0.1.0.tar.gz
Algorithm Hash digest
SHA256 89d2771febfd83cfc12b932fdb14269f64d594fb0d9542241ab6bcdd0b2d0c47
MD5 5b5d54da1c8ac3d89b97fae6c33ae0bb
BLAKE2b-256 4dd43684b3ee6cedbb0ca0ac2b0a56ff28234e2b958b304c5e08eb956cdd8094

See more details on using hashes here.

File details

Details for the file IdentifyOutliers-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for IdentifyOutliers-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed47d738e464aff13e3ac461a0999c9ea9bef19635c10248a5ab0eb500883cf8
MD5 4581cdbdcfc3df99aacd09ef5355f97d
BLAKE2b-256 bc4ea9b6ac1111cd7c84c1b981cd4306c9171b995489fbd8e6425dd2eb0e2a00

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page