A comprehensive statistical data preprocessing and outlier detection library with formal statistical testing and publication-quality reporting
Project description
StatClean
A comprehensive statistical data preprocessing and outlier detection library with formal statistical testing and publication-quality reporting.
StatClean provides advanced statistical methods for data cleaning including formal statistical tests (Grubbs' test, Dixon's Q-test), multivariate outlier detection, data transformations, and publication-quality reporting with p-values and effect sizes. Designed for academic research, data science, and statistical analysis where rigorous statistical methods and reproducible results are essential.
Features
🔬 Statistical Testing & Analysis
- Formal Statistical Tests: Grubbs' test and Dixon's Q-test with p-values and critical values
- Distribution Analysis: Automatic normality testing, skewness/kurtosis calculation
- Method Comparison: Statistical agreement analysis between different detection methods
- Publication-Quality Reporting: P-values, confidence intervals, and effect sizes
📊 Detection Methods
- Univariate Methods: IQR, Z-score, Modified Z-score (MAD-based)
- Multivariate Methods: Mahalanobis distance with chi-square thresholds
- Batch Processing: Detect outliers across multiple columns with progress tracking
- Automatic Method Selection: Based on statistical distribution analysis
🛠️ Treatment Options
- Outlier Removal: Remove detected outliers with statistical validation
- Winsorizing: Cap outliers at specified bounds instead of removal
- Data Transformations: Box-Cox, logarithmic, and square-root transformations
- Transformation Recommendations: Automatic selection based on distribution characteristics
📈 Advanced Visualization
- Comprehensive Analysis Plots: 3-in-1 analysis (boxplot, distribution, Q-Q plot)
- Standalone Plotting Functions: Individual scatter, distribution, box, and Q-Q plots
- Interactive Dashboards: 2x2 comprehensive analysis grid
- Publication-Ready Figures: Professional styling with customizable parameters
🚀 Developer Experience
- Method Chaining: Fluent API for streamlined workflows
- Type Safety: Comprehensive type hints for enhanced IDE support
- Progress Tracking: Built-in progress bars for batch operations
- Flexible Configuration: Customizable thresholds and statistical parameters
- Memory Efficient: Statistics caching and lazy evaluation
Installation
pip install statclean
Quick Start
import pandas as pd
from statclean import StatClean
# Load your data
df = pd.DataFrame({
'income': [25000, 30000, 35000, 40000, 500000, 45000, 50000], # Contains outlier
'age': [25, 30, 35, 40, 35, 45, 50]
})
"""
Note: As of v0.1.3, remover methods return the cleaner instance for method chaining.
Access cleaned data via `cleaner.clean_df` and details via `cleaner.outlier_info`.
"""
# Initialize StatClean
cleaner = StatClean(df)
# Automatic analysis and cleaning
cleaned_df, info = cleaner.clean_columns(['income'], method='auto', show_progress=True)
print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {cleaned_df.shape}")
print(f"Outliers removed: {info['income']['outliers_removed']}")
Advanced Usage
Formal Statistical Testing
# Grubbs' test for outliers with statistical significance
result = cleaner.grubbs_test('income', alpha=0.05)
print(f"Test statistic: {result['statistic']:.3f}")
print(f"P-value: {result['p_value']:.6f}")
print(f"Outlier detected: {result['is_outlier']}")
# Dixon's Q-test for small samples
result = cleaner.dixon_q_test('age', alpha=0.05)
print(f"Q statistic: {result['statistic']:.3f}")
print(f"Critical value: {result['critical_value']:.3f}")
Multivariate Outlier Detection
# Mahalanobis distance for multivariate outliers
# chi2_threshold can be a percentile (0<val<=1) or absolute chi-square statistic
# use_shrinkage=True uses Ledoit–Wolf shrinkage covariance if scikit-learn is installed
outliers = cleaner.detect_outliers_mahalanobis(['income', 'age'], chi2_threshold=0.95, use_shrinkage=True)
print(f"Multivariate outliers detected: {outliers.sum()}")
# Remove multivariate outliers
cleaned_df = cleaner.remove_outliers_mahalanobis(['income', 'age'])
Data Transformations
# Automatic transformation recommendation
recommendation = cleaner.recommend_transformation('income')
print(f"Recommended transformation: {recommendation['recommended_method']}")
print(f"Improvement in skewness: {recommendation['expected_improvement']:.3f}")
# Apply Box-Cox transformation
_, info = cleaner.transform_boxcox('income')
print(f"Optimal lambda: {info['lambda']:.3f}")
# Method chaining for complex workflows
result = (cleaner
.set_thresholds(zscore_threshold=2.5)
.add_zscore_columns(['income'])
.winsorize_outliers_iqr('income', lower_factor=1.5, upper_factor=1.5)
.clean_df)
Comprehensive Analysis
# Distribution analysis with recommendations
analysis = cleaner.analyze_distribution('income')
print(f"Skewness: {analysis['skewness']:.3f}")
print(f"Kurtosis: {analysis['kurtosis']:.3f}")
print(f"Normality test p-value: {analysis['normality_test']['p_value']:.6f}")
print(f"Recommended method: {analysis['recommended_method']}")
# Compare different detection methods
comparison = cleaner.compare_methods(['income'],
methods=['iqr', 'zscore', 'modified_zscore'])
print("Method Agreement Analysis:")
for method, stats in comparison['income']['method_stats'].items():
print(f" {method}: {stats['outliers_detected']} outliers")
Advanced Visualization
# Comprehensive analysis plots
figures = cleaner.plot_outlier_analysis(['income', 'age'])
# Individual visualization components
from statclean.utils import plot_outliers, plot_distribution, plot_qq
# Custom outlier highlighting
outliers = cleaner.detect_outliers_zscore('income')
plot_outliers(df['income'], outliers, title='Income Distribution')
plot_distribution(df['income'], outliers, title='Income KDE')
plot_qq(df['income'], outliers, title='Income Normality')
Batch Processing with Progress Tracking
# Process multiple columns with detailed reporting
columns_to_clean = ['income', 'age', 'score', 'rating']
cleaned_df, detailed_info = cleaner.clean_columns(
columns=columns_to_clean,
method='auto',
show_progress=True,
include_indices=True
)
# Access detailed statistics
for column, info in detailed_info.items():
print(f"\n{column}:")
print(f" Method used: {info['method_used']}")
print(f" Outliers removed: {info['outliers_removed']}")
print(f" Percentage removed: {info['percentage_removed']:.2f}%")
if 'p_value' in info:
print(f" Statistical significance: p = {info['p_value']:.6f}")
Statistical Methods Reference
Detection Methods
detect_outliers_iqr(): Interquartile Range method with configurable factorsdetect_outliers_zscore(): Standard Z-score methoddetect_outliers_modified_zscore(): Modified Z-score using MAD (robust to skewness)detect_outliers_mahalanobis(): Multivariate detection using Mahalanobis distance
Formal Statistical Tests
grubbs_test(): Grubbs' test for single outliers with p-valuesdixon_q_test(): Dixon's Q-test for small samples (n < 30)
Treatment Methods
remove_outliers_*(): Remove detected outlierswinsorize_outliers_*(): Cap outliers at specified boundstransform_boxcox(): Box-Cox transformation with optimal lambdatransform_log(): Logarithmic transformation (natural, base 10, base 2)transform_sqrt(): Square root transformation
Analysis and Reporting
analyze_distribution(): Comprehensive distribution analysiscompare_methods(): Statistical agreement between methodsget_outlier_stats(): Detailed outlier statistics without removalget_summary_report(): Publication-quality summary report
Real-World Example
import pandas as pd
from sklearn.datasets import fetch_california_housing
from statclean import StatClean
# Load California Housing dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['PRICE'] = housing.target
print(f"Dataset shape: {df.shape}")
print("Features:", list(df.columns))
# Initialize with index preservation
cleaner = StatClean(df, preserve_index=True)
# Analyze key features
features = ['MedInc', 'AveRooms', 'PRICE']
for feature in features:
analysis = cleaner.analyze_distribution(feature)
print(f"\n{feature} Analysis:")
print(f" Skewness: {analysis['skewness']:.3f}")
print(f" Recommended method: {analysis['recommended_method']}")
# Statistical significance test
if analysis['skewness'] > 1: # Highly skewed
grubbs_result = cleaner.grubbs_test(feature, alpha=0.05)
print(f" Grubbs test p-value: {grubbs_result['p_value']:.6f}")
# Comprehensive cleaning with statistical validation
cleaned_df, cleaning_info = cleaner.clean_columns(
columns=features,
method='auto',
show_progress=True,
include_indices=True
)
print(f"\nCleaning Results:")
print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {cleaned_df.shape}")
for feature, info in cleaning_info.items():
print(f"\n{feature}:")
print(f" Method: {info['method_used']}")
print(f" Outliers removed: {info['outliers_removed']}")
print(f" Percentage: {info['percentage_removed']:.2f}%")
# Generate comprehensive visualizations
figures = cleaner.plot_outlier_analysis(features)
# Method comparison analysis
comparison = cleaner.compare_methods(features)
for feature in features:
print(f"\n{feature} Method Comparison:")
print(comparison[feature]['summary'])
Requirements
- Python: ≥3.7
- numpy: ≥1.19.0
- pandas: ≥1.2.0
- matplotlib: ≥3.3.0
- seaborn: ≥0.11.0
- scipy: ≥1.6.0 (for statistical tests)
- tqdm: ≥4.60.0 (for progress bars)
- scikit-learn: ≥0.24.0 (optional, for shrinkage covariance in Mahalanobis)
Changelog
Version 0.1.3 (2025-08-08)
- Align docs/examples with actual API: remover methods return
self; usecleaner.clean_dfandcleaner.outlier_info. - Grubbs/Dixon result keys clarified:
statistic,is_outlier. - Mahalanobis
chi2_thresholdaccepts percentile (0<val<=1) or absolute chi-square statistic; addeduse_shrinkageoption. - Transformations preserve NaNs; Box-Cox computed on non-NA values only.
- Seaborn plotting calls updated for compatibility; analysis functions made NaN-safe.
- Added GitHub Actions workflow to publish to PyPI on releases.
Version 0.1.0 (2025-08-06)
🎉 Initial Release of StatClean
Complete rebranding from OutlierCleaner to StatClean with expanded statistical capabilities:
New Features
- Formal Statistical Testing: Grubbs' test and Dixon's Q-test with p-values
- Multivariate Analysis: Mahalanobis distance outlier detection
- Data Transformations: Box-Cox, logarithmic, square-root with automatic recommendations
- Method Chaining: Fluent API for streamlined statistical workflows
- Publication-Quality Reporting: Statistical significance testing and effect sizes
Enhanced Functionality
- Advanced Distribution Analysis: Automatic normality testing and method recommendations
- Batch Processing: Multi-column processing with progress tracking and detailed reporting
- Statistical Validation: P-values, confidence intervals, and critical value calculations
- Comprehensive Visualization: 3-in-1 analysis plots and standalone plotting functions
Technical Improvements
- Type Safety: Complete type annotations for enhanced IDE support
- Memory Efficiency: Statistics caching and lazy evaluation
- Robust Error Handling: Edge case handling for statistical computations
- Flexible Configuration: Customizable thresholds and statistical parameters
API Changes
- Package renamed from
outlier-cleanertostatclean - Main class renamed from
OutlierCleanertoStatClean - Backward compatibility alias maintained:
OutlierCleaner = StatClean - Enhanced method signatures with comprehensive parameter documentation
This release transforms the package from a basic outlier detection tool into a comprehensive statistical preprocessing library suitable for academic research and professional data science applications.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. Areas of particular interest:
- Additional statistical tests and methods
- Performance optimizations for large datasets
- Enhanced visualization capabilities
- Documentation improvements and examples
License
MIT License
Author
Subashanan Nair
StatClean: Where statistical rigor meets practical data science.
Development: Run Tests in Headless Mode and Capture Logs
# Ensure a headless matplotlib backend and run tests quietly
export MPLBACKEND=Agg
pytest -q
# Save a timestamped test log (example)
LOG=cursor_logs/test_log.md
mkdir -p cursor_logs
echo "==== $(date) ====\n" >> "$LOG"
MPLBACKEND=Agg pytest -q 2>&1 | tee -a "$LOG"
## Continuous Delivery: Publish to PyPI (Trusted Publisher)
This repository includes a GitHub Actions workflow using PyPI Trusted Publisher (OIDC).
Setup (one-time on PyPI):
- Add this GitHub repo as a Trusted Publisher in the PyPI project settings.
Release steps:
1. Bump version in `statclean/__init__.py` and `setup.py` (already `0.1.3`).
2. Push a tag matching the version, e.g., `git tag v0.1.3 && git push origin v0.1.3`.
3. Workflow will run tests, build, and publish to PyPI without storing credentials.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file statclean-0.1.3.tar.gz.
File metadata
- Download URL: statclean-0.1.3.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3806fff42fe393d131f97379ba981b48a5cbdd69e070adb2a5e557f2bd6ea709
|
|
| MD5 |
b5b211f337a53b4efe87570fa8d2e2b0
|
|
| BLAKE2b-256 |
61a32ae3b215c22edbcf0c8c5a41ca4b9da893bea6344123f9c0f43935c15801
|
Provenance
The following attestation bundles were made for statclean-0.1.3.tar.gz:
Publisher:
publish.yml on SubaashNair/StatClean
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
statclean-0.1.3.tar.gz -
Subject digest:
3806fff42fe393d131f97379ba981b48a5cbdd69e070adb2a5e557f2bd6ea709 - Sigstore transparency entry: 365739754
- Sigstore integration time:
-
Permalink:
SubaashNair/StatClean@6558f1490056922f719c61567c87f38107de7640 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/SubaashNair
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6558f1490056922f719c61567c87f38107de7640 -
Trigger Event:
push
-
Statement type:
File details
Details for the file statclean-0.1.3-py3-none-any.whl.
File metadata
- Download URL: statclean-0.1.3-py3-none-any.whl
- Upload date:
- Size: 26.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a38a07c132055fb5d1c8cfe921905f48577adfce99d879952584d15c9aa33380
|
|
| MD5 |
357165369a9b3fd7063019192ef02f6f
|
|
| BLAKE2b-256 |
66977d072c2556976425c1c3c66c8ba196b93aa55aaee3eaec83073d432205d6
|
Provenance
The following attestation bundles were made for statclean-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on SubaashNair/StatClean
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
statclean-0.1.3-py3-none-any.whl -
Subject digest:
a38a07c132055fb5d1c8cfe921905f48577adfce99d879952584d15c9aa33380 - Sigstore transparency entry: 365739780
- Sigstore integration time:
-
Permalink:
SubaashNair/StatClean@6558f1490056922f719c61567c87f38107de7640 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/SubaashNair
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6558f1490056922f719c61567c87f38107de7640 -
Trigger Event:
push
-
Statement type: