Automated dataset cleaner for machine learning
Project description
DataCleanerX
DataCleanerX is an intelligent Python library that automates the data preprocessing pipeline for machine learning projects. Spend less time cleaning data and more time building models.
🚀 Quick Start
pip install datacleanerx
from datacleaner_ai import Cleaner
import pandas as pd
# Load your data
df = pd.read_csv("data.csv")
# Clean in one line
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)
# View cleaning report
print(cleaner.report())
✨ Features
DataCleanerX automatically detects and resolves common data quality issues:
Detection Capabilities
- Missing Values - Identifies null, NaN, and empty entries
- Duplicate Rows - Finds exact and near-duplicate records
- Class Imbalance - Detects imbalanced target distributions
- Outliers - Identifies statistical anomalies using Z-score or IQR methods
Cleaning Operations
- Missing Data Handling - Drop, mean/median imputation, forward/backward fill
- Deduplication - Removes duplicate entries while preserving data integrity
- Class Balancing - SMOTE oversampling, random under/oversampling
- Outlier Treatment - Clipping, removal, or replacement strategies
Reporting
- Comprehensive summary of all detected issues
- Actionable insights on cleaning operations performed
- Statistics on data quality improvements
📖 Why DataCleanerX?
Data scientists spend 60-80% of their time on data cleaning before model training. DataCleanerX addresses this by:
- Automating repetitive tasks - One-line data cleaning with intelligent defaults
- Preventing common mistakes - Catches issues beginners often miss
- Maintaining flexibility - Fully customizable strategies for advanced users
- Providing transparency - Detailed reports on all operations performed
🎯 Use Cases
- Rapid Prototyping - Quickly clean datasets for exploratory analysis
- Production Pipelines - Integrate into automated ML workflows
- Educational Projects - Learn data cleaning best practices
- Competition Prep - Fast preprocessing for Kaggle competitions
📚 Documentation
Basic Usage
from datacleaner_ai import Cleaner
import pandas as pd
df = pd.read_csv("messy_data.csv")
# Automatic cleaning with defaults
cleaner = Cleaner(strategy="auto")
df_clean = cleaner.fit_transform(df)
Custom Configuration
# Fine-tune cleaning strategies
cleaner = Cleaner(
strategy="manual",
missing_values="median",
duplicates=True,
imbalance="smote",
outliers="clip"
)
df_clean = cleaner.fit_transform(df, target_column="label")
print(cleaner.report())
API Reference
Cleaner Class
Parameters:
strategy(str):"auto"or"manual"- Cleaning approachmissing_values(str):"drop","mean","median","ffill","bfill"duplicates(bool): Whether to remove duplicate rowsimbalance(str):"smote","oversample","undersample", orNoneoutliers(str):"clip","remove", orNone
Methods:
fit(df)- Analyze dataset and identify issuestransform(df)- Apply cleaning operationsfit_transform(df)- Analyze and clean in one stepreport()- Generate detailed cleaning summary
Example Report Output
=== DataCleanerX Cleaning Report ===
Dataset: 10,000 rows × 15 columns
Issues Detected:
✓ Missing values: 12.3% (1,845 cells)
✓ Duplicate rows: 350 (3.5%)
✓ Class imbalance: 1:7 ratio detected
✓ Outliers: 2.3% (345 values)
Actions Taken:
→ Missing values: Median imputation applied
→ Duplicates: 350 rows removed
→ Class balance: SMOTE oversampling (minority class: 1,200 → 6,850)
→ Outliers: Clipped using IQR method
Final Dataset: 9,650 rows × 15 columns
====================================
🏗️ Architecture
datacleanerx/
├── cleaner.py # Main Cleaner class
├── detectors.py # Issue detection functions
├── transformers.py # Data cleaning operations
├── reports.py # Report generation
└── utils.py # Helper utilities
🛠️ Advanced Features
Integration with Scikit-learn for future version
from sklearn.pipeline import Pipeline
from datacleaner_ai import Cleaner
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('cleaner', Cleaner(strategy="auto")),
('classifier', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)
Handling Specific Data Types
# For time series data
cleaner = Cleaner(missing_values="ffill")
# For categorical data
cleaner = Cleaner(missing_values="mode")
# For mixed data types
cleaner = Cleaner(strategy="auto") # Automatically detects types
📦 Dependencies
Core Requirements:
- pandas >= 1.0.0
- numpy >= 1.18.0
Optional (for advanced features):
- imbalanced-learn >= 0.8.0 (for SMOTE)
- scikit-learn >= 0.24.0 (for pipeline integration)
🧪 Development
Running Tests
git clone https://github.com/SatyamSingh8306/datacleanerx.git
cd datacleanerx
pip install -e ".[dev]"
pytest tests/
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📝 Roadmap
Version 0.2.0 (Upcoming)
- Visualization tools (missing value heatmaps, distribution plots)
- Save/load cleaning configurations as JSON
- Enhanced outlier detection algorithms
- Support for text data preprocessing
Version 0.3.0 (Future)
- CLI tool for non-Python users
- Parallel processing for large datasets
- Auto-tuning of cleaning strategies
- Integration with popular ML frameworks
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
Built with inspiration from the data science community's need for faster, more reliable preprocessing tools.
📧 Contact & Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: satyamsingh7734@gmail.com
Made with Satyam Singh
Star ⭐ this repo if you find it useful!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacleanerx-0.2.1.tar.gz.
File metadata
- Download URL: datacleanerx-0.2.1.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
790c26cb972300608e85f8339f81f227d83b6c046627782e32f49364fe50caed
|
|
| MD5 |
09ec6ee18137883d44c6a96e64987542
|
|
| BLAKE2b-256 |
52799fe6ca9e8eec1f28d5b2868c409db12755f2d10ed63b3a7571c8eff2874c
|
File details
Details for the file datacleanerx-0.2.1-py3-none-any.whl.
File metadata
- Download URL: datacleanerx-0.2.1-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
869ebe5b6120982a082eaea6dabeeb384035a52559bf045dd6086052746821e0
|
|
| MD5 |
eda0e4a651a28f953f17395032f5421e
|
|
| BLAKE2b-256 |
2e746283a1c96d634f56212681f2ed1e76bdb3a6ab7d45fa721090509dae6d37
|