Automated dataset cleaner for machine learning

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

📘 Dataset Cleaner for ML – Documentation & Implementation Guide

1. 📌 Overview

Dataset Cleaner for ML (datacleaner_ai) is a Python library designed to simplify the preprocessing pipeline by automatically detecting and fixing common data quality issues.

🔹 Features:

Detects:
- Missing values
- Duplicate rows
- Class imbalance
- Outliers
Cleans:
- Handles missing values (drop, mean/median fill, forward/backward fill)
- Removes duplicates
- Balances classes (oversampling/undersampling)
- Handles outliers (clip, remove, replace)

Easy to use:

from datacleaner_ai import Cleaner
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)

Generates summary reports about detected issues.

2. 🎯 Problem It Solves

Data scientists spend 60–80% of time cleaning data before model training.
Beginners often forget key steps (like handling imbalance or outliers).
Current tools (like pandas) are flexible but not opinionated or automated.

This library makes preprocessing 1-line simple, while still allowing custom strategies.

3. 🏗️ Library Architecture

datacleaner_ai/
│
├── datacleaner_ai/
│   ├── __init__.py
│   ├── cleaner.py          # Main Cleaner class
│   ├── detectors.py        # Functions to detect issues
│   ├── transformers.py     # Functions to clean data
│   ├── reports.py          # Summary report generator
│   └── utils.py            # Helper functions
│
├── tests/                  # Unit tests
│   ├── test_cleaner.py
│   ├── test_detectors.py
│   └── test_transformers.py
│
├── examples/               # Example notebooks/scripts
│
├── setup.py or pyproject.toml
├── README.md
├── LICENSE
└── requirements.txt

4. ⚙️ Core Components

🔹 `Cleaner` (Main API class)

Methods:
- fit(df) → analyzes dataset
- transform(df) → applies fixes
- fit_transform(df) → runs both
- report() → generates summary of issues
Parameters:
- strategy : "auto" | "manual"
- missing_values : "drop" | "mean" | "median" | "ffill" | "bfill"
- duplicates : True | False
- imbalance : "smote" | "undersample" | "oversample" | None
- outliers : "clip" | "remove" | None

🔹 `detectors.py`

Functions:
- detect_missing(df)
- detect_duplicates(df)
- detect_imbalance(df, target)
- detect_outliers(df, method="zscore" | "iqr")

🔹 `transformers.py`

Functions:
- handle_missing(df, method="mean")
- remove_duplicates(df)
- balance_classes(df, target, method="smote")
- handle_outliers(df, method="clip")

🔹 `reports.py`

Generates a readable summary:

=== Dataset Cleaner Report ===
Missing values: 12% (Handled with median fill)
Duplicates: 350 rows removed
Class imbalance: SMOTE applied (positive class upsampled)
Outliers: 2.3% clipped using IQR
==============================

5. 🚀 Example Usage

import pandas as pd
from datacleaner_ai import Cleaner

# Load data
df = pd.read_csv("data.csv")

# Create cleaner with auto strategy
cleaner = Cleaner(strategy="auto", missing_values="median", imbalance="smote")

# Clean dataset
df_clean = cleaner.fit_transform(df)

# Get summary
print(cleaner.report())

6. 🔧 Implementation Roadmap

✅ MVP (Minimum Viable Product)

Basic Cleaner class.
Handle missing values (drop, mean, median).
Remove duplicates.
Generate basic report.

🚀 Phase 2

Add class imbalance handling (SMOTE via imblearn).
Add outlier detection & treatment.
Expand missing value strategies (ffill, bfill).
Add plotting (missing value heatmaps, class balance bar chart).

🌟 Phase 3 (Advanced Features)

Integration with Scikit-learn pipelines (TransformerMixin).
GUI/CLI tool for non-coders.
Save/load cleaning strategies as JSON.
Parallel processing for large datasets.

7. 📦 Tech Stack & Dependencies

Core: pandas, numpy
Optional (for imbalance): imblearn (SMOTE, RandomOverSampler, RandomUnderSampler)
Visualization (optional): matplotlib, seaborn

8. 🧪 Testing Strategy

Unit tests with pytest.
Test datasets (with known issues) stored in tests/data/.
CI/CD integration (GitHub Actions) to auto-test on push.

9. 📄 License & Publishing

Use MIT License (widely adopted for open-source).
Publish to PyPI via twine.
Provide docs in README + Example Jupyter notebooks.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.1

Sep 30, 2025

0.2.0

Sep 30, 2025

This version

0.1.1

Sep 30, 2025

0.1.0

Sep 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacleanerx-0.1.1.tar.gz (7.9 kB view details)

Uploaded Sep 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datacleanerx-0.1.1-py3-none-any.whl (8.7 kB view details)

Uploaded Sep 30, 2025 Python 3

File details

Details for the file datacleanerx-0.1.1.tar.gz.

File metadata

Download URL: datacleanerx-0.1.1.tar.gz
Upload date: Sep 30, 2025
Size: 7.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`75a261511f64eea05bce0bda2396d8520ac30c8ee0f57741da55506ef95ca301`
MD5	`8b19a6f96b06d26d12eb773e0a932264`
BLAKE2b-256	`91c4074db26c09cee0aa6b9f9914c2e36dbb5589db951d35967ff7bde2bf4407`

See more details on using hashes here.

File details

Details for the file datacleanerx-0.1.1-py3-none-any.whl.

File metadata

Download URL: datacleanerx-0.1.1-py3-none-any.whl
Upload date: Sep 30, 2025
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4e5a42ab4aba6b196aa4d3c1f875179a1e4f6c5611001857fefa1943a54c2a60`
MD5	`59e87d3d881728f2b896b788e02b239c`
BLAKE2b-256	`1c13be1dab04dbdb153900123a9714757ede91aacd3580d4bb54afafbf92d09c`

See more details on using hashes here.

datacleanerx 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📘 Dataset Cleaner for ML – Documentation & Implementation Guide

1. 📌 Overview

🔹 Features:

2. 🎯 Problem It Solves

3. 🏗️ Library Architecture

4. ⚙️ Core Components

🔹 Cleaner (Main API class)

🔹 detectors.py

🔹 transformers.py

🔹 reports.py

5. 🚀 Example Usage

6. 🔧 Implementation Roadmap

✅ MVP (Minimum Viable Product)

🚀 Phase 2

🌟 Phase 3 (Advanced Features)

7. 📦 Tech Stack & Dependencies

8. 🧪 Testing Strategy

9. 📄 License & Publishing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

🔹 `Cleaner` (Main API class)

🔹 `detectors.py`

🔹 `transformers.py`

🔹 `reports.py`