Automated Data Cleaning, Validation and Analytics Toolkit
Project description
๐ MasterClean
Automated Data Cleaning, Validation & Analytics Toolkit for Python.
MasterClean is a professional Python package that automates:
- data cleaning
- preprocessing
- validation
- profiling
- visualization
- reporting
- analytics
using simple CLI commands or Python API.
Designed for:
- Data Analysts
- Data Scientists
- ML Engineers
- Researchers
- Students
- Automation workflows
โจ Features
๐งน Advanced Data Cleaning
- Missing value handling
- Duplicate row removal
- Empty string cleanup
- Whitespace cleanup
- Column standardization
- Datetime conversion
- Smart categorical filling
- Automatic preprocessing pipeline
โก Datatype Optimization
- Integer optimization
- Float optimization
- Boolean conversion
- Category optimization
- Datetime detection
- Memory usage reduction
๐ก Advanced Validation Engine
- Negative value detection
- Outlier detection
- Invalid boolean detection
- Email validation
- Phone validation
- Duplicate percentage warnings
- Missing value percentage analysis
- Mixed datatype detection
๐ Advanced Profiling
- Dataset health score
- Missing value summaries
- Datatype analytics
- Memory usage analysis
- Numeric statistics
- Categorical summaries
- Dataset overview metrics
๐ Interactive Visualization Engine
- Plotly dashboards
- Histograms
- Boxplots
- Pie charts
- Correlation heatmaps
- Missing value charts
- Interactive analytics dashboards
๐ Reporting System
- Unified HTML analytics dashboard
- Validation summaries
- Dataset overview cards
- Interactive visualizations
- Automated report generation
๐ฅ Professional CLI Toolkit
MasterClean now supports multiple commands.
Full Automated Pipeline
masterclean clean data.csv
Runs:
- cleaning
- optimization
- validation
- profiling
- visualization
- reporting
- exporting
Validation Only
masterclean validate data.csv
Dataset Profiling
masterclean profile data.csv
Dashboard Generation
masterclean dashboard data.csv
Show Version
masterclean version
๐ฆ Installation
Install from PyPI
pip install masterclean
Upgrade to Latest Version
pip install --upgrade masterclean
๐ Python Usage
from masterclean import *
df, file_extension = read_file("data.csv")
df = clean_data(df)
df = optimize_dtypes(df)
warnings = validate_data(df)
profile = generate_profile(df)
charts = generate_charts(df)
generate_report(
df=df,
warnings=warnings,
profile=profile,
charts=charts
)
export_data(
df,
"cleaned_data",
file_extension
)
๐ Supported File Formats
| Format | Supported |
|---|---|
| CSV | โ |
| XLSX | โ |
| XLS | โ |
๐ Same-Format Export System
MasterClean automatically preserves output format.
| Input | Output |
|---|---|
| CSV | cleaned_data.csv |
| XLSX | cleaned_data.xlsx |
| XLS | cleaned_data.xlsx |
๐ Example Validation Output
VALIDATION WARNINGS
========================================
โ Negative values found in 'salary' (3 rows)
โ Invalid email values found in 'email' (5 rows)
โ High duplicate rows detected (14.2%)
โ Mixed datatypes detected in 'age'
๐ Architecture
Read
โ
Clean
โ
Optimize
โ
Validate
โ
Profile
โ
Visualize
โ
Report
โ
Export
๐ Project Structure
masterclean/
โ
โโโ cleaner.py
โโโ validator.py
โโโ datatypes.py
โโโ profiler.py
โโโ visualizer.py
โโโ report.py
โโโ exporter.py
โโโ reader.py
โโโ cli.py
โโโ __init__.py
โ
tests/
โ
โโโ test_cleaner.py
โโโ test_validator.py
โโโ test_reader.py
โโโ test_report.py
โโโ test_visualizer.py
โ
.github/workflows/
โ
โโโ tests.yml
๐งช Testing
Run tests using:
python -m pytest
๐ CI/CD
MasterClean uses GitHub Actions for:
- automated testing
- dependency validation
- continuous integration
๐ฃ Roadmap
Future improvements planned:
- Streamlit web application
- AI-powered cleaning suggestions
- Large dataset optimization
- Schema validation engine
- Cloud deployment support
- Plugin architecture
- Real-time analytics dashboards
๐ค Contributing
Contributions are welcome.
You can:
- report bugs
- suggest features
- improve documentation
- submit pull requests
๐ License
MIT License
๐จโ๐ป Author
Mohamed Faisal Maraicar N
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file masterclean-1.3.0.tar.gz.
File metadata
- Download URL: masterclean-1.3.0.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
946fe3a2a72435d5d0951b90ad69dcaabd2bb5001fc11b4d5790ef417492bcc2
|
|
| MD5 |
e36ff49cba88f4ca4046cf8c337b9c1d
|
|
| BLAKE2b-256 |
e4ac191b7673ba343dd3f7ede0a187bf32f3b23d31a89f2f4a1bc69abb82b8bd
|
File details
Details for the file masterclean-1.3.0-py3-none-any.whl.
File metadata
- Download URL: masterclean-1.3.0-py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
754a4e12fb1266666f2f0941e34780b066979603e96307cba3d8d4f8a5b0d417
|
|
| MD5 |
8a7292d19b9160fa0bf5f7809b4334a5
|
|
| BLAKE2b-256 |
b65f0af23b5aa16c9aad59b53ddc7d7e295b6a82501a9796cd7397dcd7417677
|