Skip to main content

Automated dataset cleaner for machine learning

Project description

DataCleanerX

PyPI version Python 3.7+ License: MIT

DataCleanerX is an intelligent Python library that automates the data preprocessing pipeline for machine learning projects. Spend less time cleaning data and more time building models.

🚀 Quick Start

pip install datacleanerx==0.1.1
from datacleanerx import Cleaner
import pandas as pd

# Load your data
df = pd.read_csv("data.csv")

# Clean in one line
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)

# View cleaning report
print(cleaner.report())

✨ Features

DataCleanerX automatically detects and resolves common data quality issues:

Detection Capabilities

  • Missing Values - Identifies null, NaN, and empty entries
  • Duplicate Rows - Finds exact and near-duplicate records
  • Class Imbalance - Detects imbalanced target distributions
  • Outliers - Identifies statistical anomalies using Z-score or IQR methods

Cleaning Operations

  • Missing Data Handling - Drop, mean/median imputation, forward/backward fill
  • Deduplication - Removes duplicate entries while preserving data integrity
  • Class Balancing - SMOTE oversampling, random under/oversampling
  • Outlier Treatment - Clipping, removal, or replacement strategies

Reporting

  • Comprehensive summary of all detected issues
  • Actionable insights on cleaning operations performed
  • Statistics on data quality improvements

📖 Why DataCleanerX?

Data scientists spend 60-80% of their time on data cleaning before model training. DataCleanerX addresses this by:

  • Automating repetitive tasks - One-line data cleaning with intelligent defaults
  • Preventing common mistakes - Catches issues beginners often miss
  • Maintaining flexibility - Fully customizable strategies for advanced users
  • Providing transparency - Detailed reports on all operations performed

🎯 Use Cases

  • Rapid Prototyping - Quickly clean datasets for exploratory analysis
  • Production Pipelines - Integrate into automated ML workflows
  • Educational Projects - Learn data cleaning best practices
  • Competition Prep - Fast preprocessing for Kaggle competitions

📚 Documentation

Basic Usage

from datacleanerx import Cleaner
import pandas as pd

df = pd.read_csv("messy_data.csv")

# Automatic cleaning with defaults
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)

Custom Configuration

# Fine-tune cleaning strategies
cleaner = Cleaner(
    strategy="manual",
    missing_values="median",
    duplicates=True,
    imbalance="smote",
    outliers="clip"
)

df_clean = cleaner.fit_transform(df, target_column="label")
print(cleaner.report())

API Reference

Cleaner Class

Parameters:

  • strategy (str): "auto" or "manual" - Cleaning approach
  • missing_values (str): "drop", "mean", "median", "ffill", "bfill"
  • duplicates (bool): Whether to remove duplicate rows
  • imbalance (str): "smote", "oversample", "undersample", or None
  • outliers (str): "clip", "remove", or None

Methods:

  • fit(df) - Analyze dataset and identify issues
  • transform(df) - Apply cleaning operations
  • fit_transform(df) - Analyze and clean in one step
  • report() - Generate detailed cleaning summary

Example Report Output

=== DataCleanerX Cleaning Report ===
Dataset: 10,000 rows × 15 columns

Issues Detected:
✓ Missing values: 12.3% (1,845 cells)
✓ Duplicate rows: 350 (3.5%)
✓ Class imbalance: 1:7 ratio detected
✓ Outliers: 2.3% (345 values)

Actions Taken:
→ Missing values: Median imputation applied
→ Duplicates: 350 rows removed
→ Class balance: SMOTE oversampling (minority class: 1,200 → 6,850)
→ Outliers: Clipped using IQR method

Final Dataset: 9,650 rows × 15 columns
====================================

🏗️ Architecture

datacleanerx/
├── cleaner.py          # Main Cleaner class
├── detectors.py        # Issue detection functions
├── transformers.py     # Data cleaning operations
├── reports.py          # Report generation
└── utils.py            # Helper utilities

🛠️ Advanced Features

Integration with Scikit-learn for future version

from sklearn.pipeline import Pipeline
from datacleanerx import Cleaner
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('cleaner', Cleaner(strategy="auto")),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)

Handling Specific Data Types

# For time series data
cleaner = Cleaner(missing_values="ffill")

# For categorical data
cleaner = Cleaner(missing_values="mode")

# For mixed data types
cleaner = Cleaner(strategy="auto")  # Automatically detects types

📦 Dependencies

Core Requirements:

  • pandas >= 1.0.0
  • numpy >= 1.18.0

Optional (for advanced features):

  • imbalanced-learn >= 0.8.0 (for SMOTE)
  • scikit-learn >= 0.24.0 (for pipeline integration)

🧪 Development

Running Tests

git clone https://github.com/SatyamSingh8306/datacleanerx.git
cd datacleanerx
pip install -e ".[dev]"
pytest tests/

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📝 Roadmap

Version 0.2.0 (Upcoming)

  • Visualization tools (missing value heatmaps, distribution plots)
  • Save/load cleaning configurations as JSON
  • Enhanced outlier detection algorithms
  • Support for text data preprocessing

Version 0.3.0 (Future)

  • CLI tool for non-Python users
  • Parallel processing for large datasets
  • Auto-tuning of cleaning strategies
  • Integration with popular ML frameworks

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with inspiration from the data science community's need for faster, more reliable preprocessing tools.

📧 Contact & Support


Made with Satyam Singh

Star ⭐ this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacleanerx-0.2.0.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacleanerx-0.2.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file datacleanerx-0.2.0.tar.gz.

File metadata

  • Download URL: datacleanerx-0.2.0.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 13583811d5a9b8bbb639f28324d146c56b00e07c884bac2f5f6ee6829cfc5e7a
MD5 4917a7d7c0958eccf0b6ce526bf974d2
BLAKE2b-256 4829d7b62ec7b602aae06b5680b83caff31ccddbb468ca19401f047cb7a5ab30

See more details on using hashes here.

File details

Details for the file datacleanerx-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: datacleanerx-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 20be2703153644301674e1650b45b9ef5743ff6a4c25f1f6908d479cfd1b8f4d
MD5 9e4f6ad438204c0f5af04bf386e50b23
BLAKE2b-256 44cf8be7b694bd4e9c0f2b00374b4c8961b76234986dff69ce9cf969af12c0ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page