Skip to main content

Automatic outlier detection and handling for Python.

Project description

autooutlier

Automatic Outlier Detection and Handling for Python

Python 3.8+ License: MIT PyPI version


Overview

autooutlier is a Python package that automatically detects, analyzes, and handles outliers in numerical data. It intelligently selects the best detection and handling methods based on data distribution — requiring zero configuration from the user.

Simply pass your DataFrame and column name, and autooutlier handles the rest.


Features

  • Automatic Detection — Selects the optimal outlier detection method (Z-Score, Modified Z-Score, IQR, Percentile) based on data skewness.
  • Automatic Handling — Chooses the best outlier replacement strategy (winsorization, mean/median/mode replacement, interpolation, etc.) based on data characteristics.
  • Statistical Analysis — Provides mean, median, mode, standard deviation, variance, skewness, kurtosis, and distribution classification.
  • Pre-Cleaning Summary — Generates a comprehensive report before cleaning, including detection method, handling strategy, outlier count, and percentage.
  • Post-Cleaning Report — Returns both the cleaned dataset and an after-cleaning summary report.
  • Flexible Manual Control — Override automatic selections with manual detection and handling methods when needed.
  • Visualization — Built-in box plot support via Seaborn.

Installation

pip install autooutlier

Or install from source:

git clone https://github.com/suruthika-cd/autooutlier.git
cd autooutlier
pip install -e .

Dependencies

  • Python >= 3.8
  • NumPy >= 1.21.0
  • Pandas >= 1.3.0
  • SciPy >= 1.7.0
  • Seaborn >= 0.11.0
  • Matplotlib >= 3.4.0

Quick Start

import pandas as pd
from autooutlier import handle_outliers, before_cleaning_summary, detect_outliers

# Load your data
df = pd.DataFrame({"values": [10, 12, 14, 11, 13, 100, 15, 12, 14, 11]})

# Get a pre-cleaning summary report
summary = before_cleaning_summary(df, "values")
print(summary)

# Automatically detect and handle outliers
cleaned_data, report = handle_outliers(df, "values")
print(report)
print(cleaned_data)

Usage Examples

Automatic Outlier Detection

from autooutlier import detect_outliers

outlier_mask = detect_outliers(df, "column_name")
print(f"Outliers found: {outlier_mask.sum()}")

Automatic Outlier Handling

from autooutlier import handle_outliers

# Fully automatic — detection and handling methods are chosen for you
cleaned_df, report = handle_outliers(df, "column_name")

Manual Detection Method

# Use a specific detection method
cleaned_df, report = handle_outliers(df, "column_name", detection_method="z_score")

Available detection methods: 'auto', 'Iqr_method', 'z_score', 'modified_z_score', 'percentile'

Manual Handling Method

# Use a specific replacement strategy
cleaned_df, report = handle_outliers(df, "column_name", replacement="median")

Available replacement methods: 'auto', 'interpolate', 'winsorization', 'median', 'mean', 'mode', 'custom', 'remove', 'bfill', 'ffill'

Custom Value Replacement

cleaned_df, report = handle_outliers(df, "column_name", replacement="custom", value=0)

Pre-Cleaning Summary

from autooutlier import before_cleaning_summary

summary = before_cleaning_summary(df, "column_name")
print(summary)

Output includes: suggested detection method, handling method, skewness, distribution type, outlier count, and outlier percentage.


API Overview

Public API

Function Description
handle_outliers(data, column, detection_method='auto', replacement='auto', value=None) Detect and handle outliers. Returns (cleaned_data, report).
detect_outliers(data, column) Detect outliers automatically. Returns a boolean mask.
detect_outlier_method(data, column) Returns the suggested detection method name.
before_cleaning_summary(data, column) Returns a DataFrame summary report before cleaning.

Module Reference

Module Contents
autooutlier.statistics mean, median, mode, std, var, data_range, q1, q3, iqr, skew, skew_measurment, is_normal, kurtosis, kurtosis_measurement
autooutlier.detection Iqr_method, z_score_method, modified_z_score, percentile_method, detect_outlier_method, detect_outliers
autooutlier.handling winsorization, interpolate, replace_with_mean, replace_with_median, replace_with_mode, replace_with_custom_value, replace_with_forward_fill, replace_with_backward_fill, remove_outliers, detect_handler, handle_outliers
autooutlier.summary before_cleaning_summary
autooutlier.visualization box_plot
autooutlier.utils is_numeric, is_time_series, is_continous, outlier_count, outlier_percentage

License

This project is licensed under the MIT License — see the LICENSE file for details.


Changelog

See CHANGELOG.md for all notable changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autooutlier-0.1.0.tar.gz (36.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autooutlier-0.1.0-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file autooutlier-0.1.0.tar.gz.

File metadata

  • Download URL: autooutlier-0.1.0.tar.gz
  • Upload date:
  • Size: 36.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for autooutlier-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1af028ef01c50d961503e4d60be413485156673b2387c73e846c8ba865aad87c
MD5 709d9abc696547b320a08845030c19de
BLAKE2b-256 24ccd0d0e205161e7686b7bcd19d688256399a91b2ac645ad945fb5f8b68fa5e

See more details on using hashes here.

File details

Details for the file autooutlier-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: autooutlier-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for autooutlier-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e3ad99204de54e46afeb010f17c28cccaed0ea45ea1296f24a6890efbe6d12b
MD5 8e2133a6a0a25530683060858d77fb8c
BLAKE2b-256 df9977066b64d02e5c49749dd4fe11af8a3b321c3d3dc7c31eabc0574c5aeb43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page