A Python package for detecting and removing outliers in data using various statistical methods and advanced distribution analysis

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

noir1112

These details have not been verified by PyPI

Project description

OutlierCleaner

A Python package for detecting and removing outliers in data using various statistical methods and advanced distribution analysis.

Features

Type Safety: Comprehensive type hints for enhanced IDE support and code reliability
Automatic method selection based on data distribution
Multiple outlier detection methods:
- IQR (Interquartile Range)
- Z-score
- Modified Z-score (robust to non-normal distributions)
Advanced distribution analysis and method recommendations
Comprehensive visualization tools:
- Standalone plotting functions (scatter, distribution, box, Q-Q plots)
- Integrated analysis plots with 2x2 dashboard view
- Distribution visualization with KDE
- Box plots with outlier highlighting
- Q-Q plots for normality assessment
- Combined analysis dashboard
Progress tracking for batch operations
Index preservation options
Outlier tracking and statistics
Method comparison and agreement analysis
Robust handling of edge cases (zero MAD, constant columns)

Installation

pip install outlier-cleaner

Usage

Basic Usage

import pandas as pd
from outlier_cleaner import OutlierCleaner, plot_outliers, plot_distribution

# Create or load your DataFrame
df = pd.DataFrame({'column1': [1, 2, 3, 100, 4, 5, 6]})

# Using standalone visualization functions
outliers = [False, False, False, True, False, False, False]
plot_outliers(df['column1'], outliers)
plot_distribution(df['column1'], outliers)

# Using OutlierCleaner
cleaner = OutlierCleaner(df)

# Generate comprehensive analysis plots for all numeric columns
figures = cleaner.plot_outlier_analysis()

# Or analyze specific columns
figures = cleaner.plot_outlier_analysis(['column1'])

# Clean the data
cleaned_df, info = cleaner.clean_columns(['column1'], method='auto')

Advanced Example

Here's a comprehensive example using the California Housing dataset:

import pandas as pd
from sklearn.datasets import fetch_california_housing
from outlier_cleaner import OutlierCleaner, plot_outliers, plot_distribution

# Load California Housing dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['PRICE'] = housing.target

# Initialize cleaner with index preservation
cleaner = OutlierCleaner(df, preserve_index=True)

# Analyze distributions and get method recommendations
for column in ['MedInc', 'AveRooms', 'PRICE']:
    analysis = cleaner.analyze_distribution(column)
    print(f"\n{column} Analysis:")
    print(f"- Skewness: {analysis['skewness']:.2f}")
    print(f"- Recommended method: {analysis['recommended_method']}")

# Get outlier statistics
stats = cleaner.get_outlier_stats(['MedInc', 'AveRooms', 'PRICE'])
print(f"\nPotential outliers in MedInc: {stats.loc[stats['Column'] == 'MedInc', 'Potential Outliers'].values[0]}")

# Clean data with automatic method selection
cleaned_df, info = cleaner.clean_columns(
    columns=['MedInc', 'AveRooms', 'PRICE'],
    method='auto',
    show_progress=True
)

# Get outlier indices
outliers = cleaner.get_outlier_indices('MedInc')
print(f"\nOutlier indices for MedInc: {outliers['MedInc'][:5]}")

# Generate comprehensive analysis plots
figures = cleaner.plot_outlier_analysis(['MedInc', 'AveRooms', 'PRICE'])

# Compare methods
comparison = cleaner.compare_methods(['MedInc', 'PRICE'])
print(comparison['MedInc']['summary'])

Visualization Tools

Standalone Functions

plot_outliers(data, outliers)

Create a scatter plot highlighting outliers in the data.

Blue points: Normal data points
Red points: Outlier points
Customizable figure size and title

from outlier_cleaner import plot_outliers
plot_outliers(data=df['column'], outliers=outlier_mask, title='My Data')

plot_distribution(data, outliers)

Plot the distribution of data with optional outlier highlighting.

Shows kernel density estimation (KDE)
Separate distributions for normal and outlier points
Customizable figure size and title

from outlier_cleaner import plot_distribution
plot_distribution(data=df['column'], outliers=outlier_mask)

plot_boxplot(data, outliers)

Create a box plot with optional outlier highlighting.

Shows quartiles, median, and whiskers
Highlights outliers in red
Customizable figure size and title

from outlier_cleaner import plot_boxplot
plot_boxplot(data=df['column'], outliers=outlier_mask)

plot_qq(data, outliers)

Create a Q-Q plot to assess normality of the data distribution.

Compares data quantiles against theoretical normal distribution
Highlights outliers in red
Helps identify deviations from normality

from outlier_cleaner import plot_qq
plot_qq(data=df['column'], outliers=outlier_mask)

plot_outlier_analysis(data, outliers)

Generate a comprehensive 2x2 dashboard combining all plots.

Scatter plot with outliers (top-left)
Distribution plot (top-right)
Box plot (bottom-left)
Q-Q plot (bottom-right)
Automatic layout adjustment

from outlier_cleaner import plot_outlier_analysis
plot_outlier_analysis(data=df['column'], outliers=outlier_mask)

Integrated Analysis Plots

plot_outlier_analysis(columns=None)

Generate comprehensive outlier analysis plots for specified columns.

Box Plot: Shows quartiles and outlier points
Distribution Plot: Shows data distribution with KDE
Q-Q Plot: Assesses normality of the data
Automatically analyzes all numeric columns if none specified

cleaner = OutlierCleaner(df)
# Analyze all numeric columns
figures = cleaner.plot_outlier_analysis()
# Or specific columns
figures = cleaner.plot_outlier_analysis(['column1', 'column2'])

Methods

analyze_distribution(column)

Analyze the distribution of a column and recommend the best outlier detection method.

Calculates skewness, kurtosis, and normality tests
Recommends the most appropriate method and thresholds
Returns detailed distribution analysis

clean_columns(columns=None, method='auto', show_progress=True)

Clean multiple columns using the most appropriate method for each column.

Automatic method selection based on distribution analysis
Progress bar for tracking cleaning operations
Returns cleaned DataFrame and outlier information

remove_outliers_modified_zscore(column, threshold=3.5)

Remove outliers using the Modified Z-score method (robust to non-normal distributions).

Uses Median Absolute Deviation (MAD) instead of standard deviation
Automatically handles zero MAD cases
Returns cleaned DataFrame and outlier information

get_outlier_indices(column=None)

Get the indices of outliers for specified column(s).

Returns dictionary mapping columns to outlier indices
Handles missing columns gracefully by returning empty lists
Useful for tracking and analyzing removed data points
Can retrieve indices for a specific column or all processed columns

get_outlier_stats()

Get comprehensive outlier statistics without removing data points.

Provides potential outlier counts and percentages
Calculates bounds and thresholds for each method
Returns detailed statistics for analysis and comparison

Additional Methods

compare_methods(): Compare different detection methods
add_zscore_columns(): Add Z-score columns for analysis
clean_zscore_columns(): Clean using Z-score thresholds
remove_outliers_iqr(): Clean using IQR method
remove_outliers_zscore(): Clean using Z-score method

Requirements

numpy>=1.20.0
pandas>=1.3.0
matplotlib>=3.4.0
seaborn>=0.11.0
scipy>=1.7.0
scikit-learn>=0.24.0 (for examples)
tqdm>=4.62.0

Changelog

Version 1.1.4 (2025-08-06)

Comprehensive Type Hints: Added complete type annotations to all methods and functions
- Enhanced IDE support with better autocomplete and error detection
- Improved code documentation through type annotations
- MyPy compatibility for static type checking
- Better developer experience and code maintainability
Updated dependency management with complete requirements specification
Enhanced null safety with proper error handling
Improved code quality and professional standards
Full backward compatibility maintained
Updated requirements.txt with scipy and tqdm dependencies
Synchronized dependency versions across setup.py and requirements.txt

Version 1.0.8 (2024-03-24)

Improved outlier_analysis method with enhanced visualization capabilities
Added robust error handling for missing columns and non-numeric data
Optimized plot layout and styling for better readability
Fixed memory management in visualization functions
Added comprehensive test suite for visualization features

Version 1.0.7 (2024-03-24)

Enhanced visualization tools with improved plot customization
Added comprehensive docstrings to all visualization functions
Improved error handling in plotting functions
Updated documentation with detailed usage examples
Code optimization and performance improvements

Version 1.0.6

Added new standalone visualization functions:
- plot_boxplot: Box plot with outlier highlighting
- plot_qq: Q-Q plot for normality assessment
- plot_outlier_analysis: Comprehensive 2x2 dashboard
Enhanced visualization features:
- Improved outlier highlighting in all plots
- Added grid lines for better readability
- Automatic layout adjustment in dashboard view
Updated documentation with new visualization examples
Improved type hints and error handling

Version 1.0.5

Fixed boxplot visualization in plot_outlier_analysis
Enhanced automatic column handling for visualization functions
Improved error messages and user feedback
Updated documentation with clearer examples

Version 1.0.4

Added standalone visualization functions in utils.py
Added comprehensive plot_outlier_analysis method
Enhanced distribution visualization with KDE
Added Q-Q plots for normality assessment
Improved error handling and user feedback
Updated documentation with visualization examples

Version 1.0.2

Enhanced outlier indices tracking in all removal methods
Improved get_outlier_indices() to handle missing columns gracefully
Optimized outlier statistics calculation
Removed redundant outlier indices from get_outlier_stats() output

Version 1.0.1

Fixed author name spelling
Updated documentation and examples
Added comprehensive test coverage

Version 1.0.0

Initial release with core functionality
Added distribution analysis and automatic method selection
Implemented visualization tools and progress tracking

Author

Subashanan Nair

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

noir1112

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.4

Aug 6, 2025

1.1.2

Aug 6, 2025

1.1.1

Aug 5, 2025

1.1.0

Aug 5, 2025

1.0.8

Apr 26, 2025

1.0.7

Apr 24, 2025

1.0.6

Apr 24, 2025

1.0.5

Apr 24, 2025

1.0.4

Apr 24, 2025

1.0.3

Apr 2, 2025

1.0.1

Apr 2, 2025

1.0.0

Apr 2, 2025

0.1.7

Apr 2, 2025

0.1.6

Apr 2, 2025

0.1.5

Apr 2, 2025

0.1.4

Apr 2, 2025

0.1.3

Apr 2, 2025

0.1.2

Apr 2, 2025

0.1.1

Mar 28, 2025

0.1.0

Mar 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

outlier_cleaner-1.1.4.tar.gz (20.1 kB view details)

Uploaded Aug 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

outlier_cleaner-1.1.4-py3-none-any.whl (17.9 kB view details)

Uploaded Aug 6, 2025 Python 3

File details

Details for the file outlier_cleaner-1.1.4.tar.gz.

File metadata

Download URL: outlier_cleaner-1.1.4.tar.gz
Upload date: Aug 6, 2025
Size: 20.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for outlier_cleaner-1.1.4.tar.gz
Algorithm	Hash digest
SHA256	`40370d2f3304dbd308826d00ca638897f6f17e68ffa68aa5fdbacdce54d54256`
MD5	`8aa9d45cb231ac7b74ff44c49d6a8328`
BLAKE2b-256	`bc2b6af9cd87c545fff6c5c429aa8c7a8c3a2e3f1366e0ee34bfa4d0b9528526`

See more details on using hashes here.

Provenance

The following attestation bundles were made for outlier_cleaner-1.1.4.tar.gz:

Publisher: publish.yml on SubaashNair/OutlierCleaner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: outlier_cleaner-1.1.4.tar.gz
- Subject digest: 40370d2f3304dbd308826d00ca638897f6f17e68ffa68aa5fdbacdce54d54256
- Sigstore transparency entry: 355603565
- Sigstore integration time: Aug 6, 2025
Source repository:
- Permalink: SubaashNair/OutlierCleaner@e47b76fd8a33c5b16fdaae8f8a977cdc401733e6
- Branch / Tag: refs/tags/v1.1.4
- Owner: https://github.com/SubaashNair
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e47b76fd8a33c5b16fdaae8f8a977cdc401733e6
- Trigger Event: push

File details

Details for the file outlier_cleaner-1.1.4-py3-none-any.whl.

File metadata

Download URL: outlier_cleaner-1.1.4-py3-none-any.whl
Upload date: Aug 6, 2025
Size: 17.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for outlier_cleaner-1.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f7d435f3fad35d3955c870878899ff609d6963e091c1331ca9998df2da68072`
MD5	`7e1e4153f21ad12c42257f1f9ce8e042`
BLAKE2b-256	`22309ca51ef2b45c0f7ce4163405186a66fade89b34970e87e466ae1d6ec83e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for outlier_cleaner-1.1.4-py3-none-any.whl:

Publisher: publish.yml on SubaashNair/OutlierCleaner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: outlier_cleaner-1.1.4-py3-none-any.whl
- Subject digest: 9f7d435f3fad35d3955c870878899ff609d6963e091c1331ca9998df2da68072
- Sigstore transparency entry: 355603630
- Sigstore integration time: Aug 6, 2025
Source repository:
- Permalink: SubaashNair/OutlierCleaner@e47b76fd8a33c5b16fdaae8f8a977cdc401733e6
- Branch / Tag: refs/tags/v1.1.4
- Owner: https://github.com/SubaashNair
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e47b76fd8a33c5b16fdaae8f8a977cdc401733e6
- Trigger Event: push

outlier-cleaner 1.1.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

OutlierCleaner

Features

Installation

Usage

Basic Usage

Advanced Example

Visualization Tools

Standalone Functions

plot_outliers(data, outliers)

plot_distribution(data, outliers)

plot_boxplot(data, outliers)

plot_qq(data, outliers)

plot_outlier_analysis(data, outliers)

Integrated Analysis Plots

plot_outlier_analysis(columns=None)

Methods

analyze_distribution(column)

clean_columns(columns=None, method='auto', show_progress=True)

remove_outliers_modified_zscore(column, threshold=3.5)

get_outlier_indices(column=None)

get_outlier_stats()

Additional Methods

Requirements

Changelog

Version 1.1.4 (2025-08-06)

Version 1.0.8 (2024-03-24)

Version 1.0.7 (2024-03-24)

Version 1.0.6

Version 1.0.5

Version 1.0.4

Version 1.0.2

Version 1.0.1

Version 1.0.0

Author

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance