Automated dataset cleaner for machine learning

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

DataCleanerX

DataCleanerX is an intelligent Python library that automates the data preprocessing pipeline for machine learning projects. Spend less time cleaning data and more time building models.

🚀 Quick Start

pip install datacleanerx==0.1.1

from datacleanerx import Cleaner
import pandas as pd

# Load your data
df = pd.read_csv("data.csv")

# Clean in one line
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)

# View cleaning report
print(cleaner.report())

✨ Features

DataCleanerX automatically detects and resolves common data quality issues:

Detection Capabilities

Missing Values - Identifies null, NaN, and empty entries
Duplicate Rows - Finds exact and near-duplicate records
Class Imbalance - Detects imbalanced target distributions
Outliers - Identifies statistical anomalies using Z-score or IQR methods

Cleaning Operations

Missing Data Handling - Drop, mean/median imputation, forward/backward fill
Deduplication - Removes duplicate entries while preserving data integrity
Class Balancing - SMOTE oversampling, random under/oversampling
Outlier Treatment - Clipping, removal, or replacement strategies

Reporting

Comprehensive summary of all detected issues
Actionable insights on cleaning operations performed
Statistics on data quality improvements

📖 Why DataCleanerX?

Data scientists spend 60-80% of their time on data cleaning before model training. DataCleanerX addresses this by:

Automating repetitive tasks - One-line data cleaning with intelligent defaults
Preventing common mistakes - Catches issues beginners often miss
Maintaining flexibility - Fully customizable strategies for advanced users
Providing transparency - Detailed reports on all operations performed

🎯 Use Cases

Rapid Prototyping - Quickly clean datasets for exploratory analysis
Production Pipelines - Integrate into automated ML workflows
Educational Projects - Learn data cleaning best practices
Competition Prep - Fast preprocessing for Kaggle competitions

📚 Documentation

Basic Usage

from datacleanerx import Cleaner
import pandas as pd

df = pd.read_csv("messy_data.csv")

# Automatic cleaning with defaults
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)

Custom Configuration

# Fine-tune cleaning strategies
cleaner = Cleaner(
    strategy="manual",
    missing_values="median",
    duplicates=True,
    imbalance="smote",
    outliers="clip"
)

df_clean = cleaner.fit_transform(df, target_column="label")
print(cleaner.report())

API Reference

`Cleaner` Class

Parameters:

strategy (str): "auto" or "manual" - Cleaning approach
missing_values (str): "drop", "mean", "median", "ffill", "bfill"
duplicates (bool): Whether to remove duplicate rows
imbalance (str): "smote", "oversample", "undersample", or None
outliers (str): "clip", "remove", or None

Methods:

fit(df) - Analyze dataset and identify issues
transform(df) - Apply cleaning operations
fit_transform(df) - Analyze and clean in one step
report() - Generate detailed cleaning summary

Example Report Output

=== DataCleanerX Cleaning Report ===
Dataset: 10,000 rows × 15 columns

Issues Detected:
✓ Missing values: 12.3% (1,845 cells)
✓ Duplicate rows: 350 (3.5%)
✓ Class imbalance: 1:7 ratio detected
✓ Outliers: 2.3% (345 values)

Actions Taken:
→ Missing values: Median imputation applied
→ Duplicates: 350 rows removed
→ Class balance: SMOTE oversampling (minority class: 1,200 → 6,850)
→ Outliers: Clipped using IQR method

Final Dataset: 9,650 rows × 15 columns
====================================

🏗️ Architecture

datacleanerx/
├── cleaner.py          # Main Cleaner class
├── detectors.py        # Issue detection functions
├── transformers.py     # Data cleaning operations
├── reports.py          # Report generation
└── utils.py            # Helper utilities

🛠️ Advanced Features

Integration with Scikit-learn for future version

from sklearn.pipeline import Pipeline
from datacleanerx import Cleaner
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('cleaner', Cleaner(strategy="auto")),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)

Handling Specific Data Types

# For time series data
cleaner = Cleaner(missing_values="ffill")

# For categorical data
cleaner = Cleaner(missing_values="mode")

# For mixed data types
cleaner = Cleaner(strategy="auto")  # Automatically detects types

📦 Dependencies

Core Requirements:

pandas >= 1.0.0
numpy >= 1.18.0

Optional (for advanced features):

imbalanced-learn >= 0.8.0 (for SMOTE)
scikit-learn >= 0.24.0 (for pipeline integration)

🧪 Development

Running Tests

git clone https://github.com/SatyamSingh8306/datacleanerx.git
cd datacleanerx
pip install -e ".[dev]"
pytest tests/

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 Roadmap

Version 0.2.0 (Upcoming)

Visualization tools (missing value heatmaps, distribution plots)
Save/load cleaning configurations as JSON
Enhanced outlier detection algorithms
Support for text data preprocessing

Version 0.3.0 (Future)

CLI tool for non-Python users
Parallel processing for large datasets
Auto-tuning of cleaning strategies
Integration with popular ML frameworks

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with inspiration from the data science community's need for faster, more reliable preprocessing tools.

📧 Contact & Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: satyamsingh7734@gmail.com

Made with Satyam Singh

Star ⭐ this repo if you find it useful!

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.1

Sep 30, 2025

This version

0.2.0

Sep 30, 2025

0.1.1

Sep 30, 2025

0.1.0

Sep 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacleanerx-0.2.0.tar.gz (11.3 kB view details)

Uploaded Sep 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datacleanerx-0.2.0-py3-none-any.whl (9.6 kB view details)

Uploaded Sep 30, 2025 Python 3

File details

Details for the file datacleanerx-0.2.0.tar.gz.

File metadata

Download URL: datacleanerx-0.2.0.tar.gz
Upload date: Sep 30, 2025
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`13583811d5a9b8bbb639f28324d146c56b00e07c884bac2f5f6ee6829cfc5e7a`
MD5	`4917a7d7c0958eccf0b6ce526bf974d2`
BLAKE2b-256	`4829d7b62ec7b602aae06b5680b83caff31ccddbb468ca19401f047cb7a5ab30`

See more details on using hashes here.

File details

Details for the file datacleanerx-0.2.0-py3-none-any.whl.

File metadata

Download URL: datacleanerx-0.2.0-py3-none-any.whl
Upload date: Sep 30, 2025
Size: 9.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for datacleanerx-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`20be2703153644301674e1650b45b9ef5743ff6a4c25f1f6908d479cfd1b8f4d`
MD5	`9e4f6ad438204c0f5af04bf386e50b23`
BLAKE2b-256	`44cf8be7b694bd4e9c0f2b00374b4c8961b76234986dff69ce9cf969af12c0ad`

See more details on using hashes here.

datacleanerx 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataCleanerX

🚀 Quick Start

✨ Features

Detection Capabilities

Cleaning Operations

Reporting

📖 Why DataCleanerX?

🎯 Use Cases

📚 Documentation

Basic Usage

Custom Configuration

API Reference

Cleaner Class

Example Report Output

🏗️ Architecture

🛠️ Advanced Features

Integration with Scikit-learn for future version

Handling Specific Data Types

📦 Dependencies

🧪 Development

Running Tests

Contributing

📝 Roadmap

Version 0.2.0 (Upcoming)

Version 0.3.0 (Future)

📄 License

🙏 Acknowledgments

📧 Contact & Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Cleaner` Class