Skip to main content

Automated dataset cleaner for machine learning

Project description

DataCleanerX

PyPI version Python 3.7+ License: MIT

DataCleanerX is an intelligent Python library that automates the data preprocessing pipeline for machine learning projects. Spend less time cleaning data and more time building models.

🚀 Quick Start

pip install datacleanerx
from datacleaner_ai import Cleaner
import pandas as pd

# Load your data
df = pd.read_csv("data.csv")

# Clean in one line
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)

# View cleaning report
print(cleaner.report())

✨ Features

DataCleanerX automatically detects and resolves common data quality issues:

Detection Capabilities

  • Missing Values - Identifies null, NaN, and empty entries
  • Duplicate Rows - Finds exact and near-duplicate records
  • Class Imbalance - Detects imbalanced target distributions
  • Outliers - Identifies statistical anomalies using Z-score or IQR methods

Cleaning Operations

  • Missing Data Handling - Drop, mean/median imputation, forward/backward fill
  • Deduplication - Removes duplicate entries while preserving data integrity
  • Class Balancing - SMOTE oversampling, random under/oversampling
  • Outlier Treatment - Clipping, removal, or replacement strategies

Reporting

  • Comprehensive summary of all detected issues
  • Actionable insights on cleaning operations performed
  • Statistics on data quality improvements

📖 Why DataCleanerX?

Data scientists spend 60-80% of their time on data cleaning before model training. DataCleanerX addresses this by:

  • Automating repetitive tasks - One-line data cleaning with intelligent defaults
  • Preventing common mistakes - Catches issues beginners often miss
  • Maintaining flexibility - Fully customizable strategies for advanced users
  • Providing transparency - Detailed reports on all operations performed

🎯 Use Cases

  • Rapid Prototyping - Quickly clean datasets for exploratory analysis
  • Production Pipelines - Integrate into automated ML workflows
  • Educational Projects - Learn data cleaning best practices
  • Competition Prep - Fast preprocessing for Kaggle competitions

📚 Documentation

Basic Usage

from datacleaner_ai import Cleaner
import pandas as pd

df = pd.read_csv("messy_data.csv")

# Automatic cleaning with defaults
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)

Custom Configuration

# Fine-tune cleaning strategies
cleaner = Cleaner(
    strategy="manual",
    missing_values="median",
    duplicates=True,
    imbalance="smote",
    outliers="clip"
)

df_clean = cleaner.fit_transform(df, target_column="label")
print(cleaner.report())

API Reference

Cleaner Class

Parameters:

  • strategy (str): "auto" or "manual" - Cleaning approach
  • missing_values (str): "drop", "mean", "median", "ffill", "bfill"
  • duplicates (bool): Whether to remove duplicate rows
  • imbalance (str): "smote", "oversample", "undersample", or None
  • outliers (str): "clip", "remove", or None

Methods:

  • fit(df) - Analyze dataset and identify issues
  • transform(df) - Apply cleaning operations
  • fit_transform(df) - Analyze and clean in one step
  • report() - Generate detailed cleaning summary

Example Report Output

=== DataCleanerX Cleaning Report ===
Dataset: 10,000 rows × 15 columns

Issues Detected:
✓ Missing values: 12.3% (1,845 cells)
✓ Duplicate rows: 350 (3.5%)
✓ Class imbalance: 1:7 ratio detected
✓ Outliers: 2.3% (345 values)

Actions Taken:
→ Missing values: Median imputation applied
→ Duplicates: 350 rows removed
→ Class balance: SMOTE oversampling (minority class: 1,200 → 6,850)
→ Outliers: Clipped using IQR method

Final Dataset: 9,650 rows × 15 columns
====================================

🏗️ Architecture

datacleanerx/
├── cleaner.py          # Main Cleaner class
├── detectors.py        # Issue detection functions
├── transformers.py     # Data cleaning operations
├── reports.py          # Report generation
└── utils.py            # Helper utilities

🛠️ Advanced Features

Integration with Scikit-learn for future version

from sklearn.pipeline import Pipeline
from datacleaner_ai import Cleaner
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('cleaner', Cleaner(strategy="auto")),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)

Handling Specific Data Types

# For time series data
cleaner = Cleaner(missing_values="ffill")

# For categorical data
cleaner = Cleaner(missing_values="mode")

# For mixed data types
cleaner = Cleaner(strategy="auto")  # Automatically detects types

📦 Dependencies

Core Requirements:

  • pandas >= 1.0.0
  • numpy >= 1.18.0

Optional (for advanced features):

  • imbalanced-learn >= 0.8.0 (for SMOTE)
  • scikit-learn >= 0.24.0 (for pipeline integration)

🧪 Development

Running Tests

git clone https://github.com/SatyamSingh8306/datacleanerx.git
cd datacleanerx
pip install -e ".[dev]"
pytest tests/

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📝 Roadmap

Version 0.2.0 (Upcoming)

  • Visualization tools (missing value heatmaps, distribution plots)
  • Save/load cleaning configurations as JSON
  • Enhanced outlier detection algorithms
  • Support for text data preprocessing

Version 0.3.0 (Future)

  • CLI tool for non-Python users
  • Parallel processing for large datasets
  • Auto-tuning of cleaning strategies
  • Integration with popular ML frameworks

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with inspiration from the data science community's need for faster, more reliable preprocessing tools.

📧 Contact & Support


Made with Satyam Singh

Star ⭐ this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacleanerx-0.2.1.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacleanerx-0.2.1-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file datacleanerx-0.2.1.tar.gz.

File metadata

  • Download URL: datacleanerx-0.2.1.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.2.1.tar.gz
Algorithm Hash digest
SHA256 790c26cb972300608e85f8339f81f227d83b6c046627782e32f49364fe50caed
MD5 09ec6ee18137883d44c6a96e64987542
BLAKE2b-256 52799fe6ca9e8eec1f28d5b2868c409db12755f2d10ed63b3a7571c8eff2874c

See more details on using hashes here.

File details

Details for the file datacleanerx-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: datacleanerx-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 869ebe5b6120982a082eaea6dabeeb384035a52559bf045dd6086052746821e0
MD5 eda0e4a651a28f953f17395032f5421e
BLAKE2b-256 2e746283a1c96d634f56212681f2ed1e76bdb3a6ab7d45fa721090509dae6d37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page