Skip to main content

Automated dataset cleaner for machine learning

Project description

๐Ÿ“˜ Dataset Cleaner for ML โ€“ Documentation & Implementation Guide

1. ๐Ÿ“Œ Overview

Dataset Cleaner for ML (datacleaner_ai) is a Python library designed to simplify the preprocessing pipeline by automatically detecting and fixing common data quality issues.

๐Ÿ”น Features:

  • Detects:

    • Missing values
    • Duplicate rows
    • Class imbalance
    • Outliers
  • Cleans:

    • Handles missing values (drop, mean/median fill, forward/backward fill)
    • Removes duplicates
    • Balances classes (oversampling/undersampling)
    • Handles outliers (clip, remove, replace)
  • Easy to use:

    from datacleaner_ai import Cleaner
    cleaner = Cleaner(strategy="auto")
    df_clean = cleaner.fit_transform(df)
    
  • Generates summary reports about detected issues.


2. ๐ŸŽฏ Problem It Solves

  • Data scientists spend 60โ€“80% of time cleaning data before model training.
  • Beginners often forget key steps (like handling imbalance or outliers).
  • Current tools (like pandas) are flexible but not opinionated or automated.

This library makes preprocessing 1-line simple, while still allowing custom strategies.


3. ๐Ÿ—๏ธ Library Architecture

datacleaner_ai/
โ”‚
โ”œโ”€โ”€ datacleaner_ai/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ cleaner.py          # Main Cleaner class
โ”‚   โ”œโ”€โ”€ detectors.py        # Functions to detect issues
โ”‚   โ”œโ”€โ”€ transformers.py     # Functions to clean data
โ”‚   โ”œโ”€โ”€ reports.py          # Summary report generator
โ”‚   โ””โ”€โ”€ utils.py            # Helper functions
โ”‚
โ”œโ”€โ”€ tests/                  # Unit tests
โ”‚   โ”œโ”€โ”€ test_cleaner.py
โ”‚   โ”œโ”€โ”€ test_detectors.py
โ”‚   โ””โ”€โ”€ test_transformers.py
โ”‚
โ”œโ”€โ”€ examples/               # Example notebooks/scripts
โ”‚
โ”œโ”€โ”€ setup.py or pyproject.toml
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ LICENSE
โ””โ”€โ”€ requirements.txt

4. โš™๏ธ Core Components

๐Ÿ”น Cleaner (Main API class)

  • Methods:

    • fit(df) โ†’ analyzes dataset
    • transform(df) โ†’ applies fixes
    • fit_transform(df) โ†’ runs both
    • report() โ†’ generates summary of issues
  • Parameters:

    • strategy : "auto" | "manual"
    • missing_values : "drop" | "mean" | "median" | "ffill" | "bfill"
    • duplicates : True | False
    • imbalance : "smote" | "undersample" | "oversample" | None
    • outliers : "clip" | "remove" | None

๐Ÿ”น detectors.py

  • Functions:

    • detect_missing(df)
    • detect_duplicates(df)
    • detect_imbalance(df, target)
    • detect_outliers(df, method="zscore" | "iqr")

๐Ÿ”น transformers.py

  • Functions:

    • handle_missing(df, method="mean")
    • remove_duplicates(df)
    • balance_classes(df, target, method="smote")
    • handle_outliers(df, method="clip")

๐Ÿ”น reports.py

  • Generates a readable summary:

    === Dataset Cleaner Report ===
    Missing values: 12% (Handled with median fill)
    Duplicates: 350 rows removed
    Class imbalance: SMOTE applied (positive class upsampled)
    Outliers: 2.3% clipped using IQR
    ==============================
    

5. ๐Ÿš€ Example Usage

import pandas as pd
from datacleaner_ai import Cleaner

# Load data
df = pd.read_csv("data.csv")

# Create cleaner with auto strategy
cleaner = Cleaner(strategy="auto", missing_values="median", imbalance="smote")

# Clean dataset
df_clean = cleaner.fit_transform(df)

# Get summary
print(cleaner.report())

6. ๐Ÿ”ง Implementation Roadmap

โœ… MVP (Minimum Viable Product)

  1. Basic Cleaner class.
  2. Handle missing values (drop, mean, median).
  3. Remove duplicates.
  4. Generate basic report.

๐Ÿš€ Phase 2

  1. Add class imbalance handling (SMOTE via imblearn).
  2. Add outlier detection & treatment.
  3. Expand missing value strategies (ffill, bfill).
  4. Add plotting (missing value heatmaps, class balance bar chart).

๐ŸŒŸ Phase 3 (Advanced Features)

  1. Integration with Scikit-learn pipelines (TransformerMixin).
  2. GUI/CLI tool for non-coders.
  3. Save/load cleaning strategies as JSON.
  4. Parallel processing for large datasets.

7. ๐Ÿ“ฆ Tech Stack & Dependencies

  • Core: pandas, numpy
  • Optional (for imbalance): imblearn (SMOTE, RandomOverSampler, RandomUnderSampler)
  • Visualization (optional): matplotlib, seaborn

8. ๐Ÿงช Testing Strategy

  • Unit tests with pytest.
  • Test datasets (with known issues) stored in tests/data/.
  • CI/CD integration (GitHub Actions) to auto-test on push.

9. ๐Ÿ“„ License & Publishing

  • Use MIT License (widely adopted for open-source).
  • Publish to PyPI via twine.
  • Provide docs in README + Example Jupyter notebooks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacleanerx-0.1.1.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacleanerx-0.1.1-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file datacleanerx-0.1.1.tar.gz.

File metadata

  • Download URL: datacleanerx-0.1.1.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.1.1.tar.gz
Algorithm Hash digest
SHA256 75a261511f64eea05bce0bda2396d8520ac30c8ee0f57741da55506ef95ca301
MD5 8b19a6f96b06d26d12eb773e0a932264
BLAKE2b-256 91c4074db26c09cee0aa6b9f9914c2e36dbb5589db951d35967ff7bde2bf4407

See more details on using hashes here.

File details

Details for the file datacleanerx-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: datacleanerx-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4e5a42ab4aba6b196aa4d3c1f875179a1e4f6c5611001857fefa1943a54c2a60
MD5 59e87d3d881728f2b896b788e02b239c
BLAKE2b-256 1c13be1dab04dbdb153900123a9714757ede91aacd3580d4bb54afafbf92d09c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page