Automated dataset cleaner for machine learning
Project description
๐ Dataset Cleaner for ML โ Documentation & Implementation Guide
1. ๐ Overview
Dataset Cleaner for ML (datacleaner_ai) is a Python library designed to simplify the preprocessing pipeline by automatically detecting and fixing common data quality issues.
๐น Features:
-
Detects:
- Missing values
- Duplicate rows
- Class imbalance
- Outliers
-
Cleans:
- Handles missing values (drop, mean/median fill, forward/backward fill)
- Removes duplicates
- Balances classes (oversampling/undersampling)
- Handles outliers (clip, remove, replace)
-
Easy to use:
from datacleaner_ai import Cleaner cleaner = Cleaner(strategy="auto") df_clean = cleaner.fit_transform(df)
-
Generates summary reports about detected issues.
2. ๐ฏ Problem It Solves
- Data scientists spend 60โ80% of time cleaning data before model training.
- Beginners often forget key steps (like handling imbalance or outliers).
- Current tools (like
pandas) are flexible but not opinionated or automated.
This library makes preprocessing 1-line simple, while still allowing custom strategies.
3. ๐๏ธ Library Architecture
datacleaner_ai/
โ
โโโ datacleaner_ai/
โ โโโ __init__.py
โ โโโ cleaner.py # Main Cleaner class
โ โโโ detectors.py # Functions to detect issues
โ โโโ transformers.py # Functions to clean data
โ โโโ reports.py # Summary report generator
โ โโโ utils.py # Helper functions
โ
โโโ tests/ # Unit tests
โ โโโ test_cleaner.py
โ โโโ test_detectors.py
โ โโโ test_transformers.py
โ
โโโ examples/ # Example notebooks/scripts
โ
โโโ setup.py or pyproject.toml
โโโ README.md
โโโ LICENSE
โโโ requirements.txt
4. โ๏ธ Core Components
๐น Cleaner (Main API class)
-
Methods:
fit(df)โ analyzes datasettransform(df)โ applies fixesfit_transform(df)โ runs bothreport()โ generates summary of issues
-
Parameters:
strategy:"auto" | "manual"missing_values:"drop" | "mean" | "median" | "ffill" | "bfill"duplicates:True | Falseimbalance:"smote" | "undersample" | "oversample" | Noneoutliers:"clip" | "remove" | None
๐น detectors.py
-
Functions:
detect_missing(df)detect_duplicates(df)detect_imbalance(df, target)detect_outliers(df, method="zscore" | "iqr")
๐น transformers.py
-
Functions:
handle_missing(df, method="mean")remove_duplicates(df)balance_classes(df, target, method="smote")handle_outliers(df, method="clip")
๐น reports.py
-
Generates a readable summary:
=== Dataset Cleaner Report === Missing values: 12% (Handled with median fill) Duplicates: 350 rows removed Class imbalance: SMOTE applied (positive class upsampled) Outliers: 2.3% clipped using IQR ==============================
5. ๐ Example Usage
import pandas as pd
from datacleaner_ai import Cleaner
# Load data
df = pd.read_csv("data.csv")
# Create cleaner with auto strategy
cleaner = Cleaner(strategy="auto", missing_values="median", imbalance="smote")
# Clean dataset
df_clean = cleaner.fit_transform(df)
# Get summary
print(cleaner.report())
6. ๐ง Implementation Roadmap
โ MVP (Minimum Viable Product)
- Basic
Cleanerclass. - Handle missing values (drop, mean, median).
- Remove duplicates.
- Generate basic report.
๐ Phase 2
- Add class imbalance handling (SMOTE via
imblearn). - Add outlier detection & treatment.
- Expand missing value strategies (ffill, bfill).
- Add plotting (missing value heatmaps, class balance bar chart).
๐ Phase 3 (Advanced Features)
- Integration with Scikit-learn pipelines (
TransformerMixin). - GUI/CLI tool for non-coders.
- Save/load cleaning strategies as JSON.
- Parallel processing for large datasets.
7. ๐ฆ Tech Stack & Dependencies
- Core:
pandas,numpy - Optional (for imbalance):
imblearn(SMOTE, RandomOverSampler, RandomUnderSampler) - Visualization (optional):
matplotlib,seaborn
8. ๐งช Testing Strategy
- Unit tests with
pytest. - Test datasets (with known issues) stored in
tests/data/. - CI/CD integration (GitHub Actions) to auto-test on push.
9. ๐ License & Publishing
- Use MIT License (widely adopted for open-source).
- Publish to PyPI via
twine. - Provide docs in README + Example Jupyter notebooks.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacleanerx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datacleanerx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
773c6c98641c88b4457a64872579b3c1f51d27a10cf1b72d96cf7331153909d7
|
|
| MD5 |
bc51cfee90619e6ee936eadb5b94cc64
|
|
| BLAKE2b-256 |
a52dbb9f6abbd8f6cccf90dd04526b347d0b9b59aa0ec552b27f50bf0897496d
|