Automated Data Cleaning, Validation and Analytics Toolkit
Project description
๐ MasterClean
Automated Data Cleaning, Validation & Analytics Toolkit for Python.
MasterClean is a professional Python package that automates dataset cleaning, preprocessing, validation, profiling, visualization, and reporting using a single command.
It is designed for:
- Data Analysts
- Data Scientists
- ML Engineers
- Researchers
- Students
- Automation workflows
โจ Features
Data Cleaning
- Automatic missing value handling
- Duplicate row removal
- Column standardization
- String cleanup
- Encoding-aware file loading
Datatype Optimization
- Automatic numeric conversion
- Datetime detection
- Integer optimization
- Mixed datatype handling
Validation Engine
- Negative value detection
- Outlier detection
- Invalid boolean detection
- Dataset quality warnings
Analytics & Profiling
- Automated dataset profiling
- Numeric statistics
- Categorical summaries
- Memory usage analysis
Visualization Engine
- Interactive Plotly dashboards
- Histograms
- Pie charts
- Boxplots
- Distribution analysis
- Category analytics
Reporting
- Unified HTML analytics dashboard
- Validation summaries
- Interactive charts
- Automated report generation
Developer Features
- Command Line Interface (CLI)
- Automated testing with pytest
- GitHub Actions CI/CD pipeline
- Modular package architecture
๐ฆ Installation
Install From PyPI
pip install masterclean
Development Installation
Clone Repository
git clone https://github.com/MohamedFaisal-11/masterclean.git
cd masterclean
Create Virtual Environment
python -m venv venv
Activate Environment
macOS / Linux
source venv/bin/activate
Windows
venv\Scripts\activate
Install Package
pip install -e .
๐ CLI Usage
Clean Dataset
masterclean clean sample.csv
MasterClean automatically:
- Reads datasets
- Cleans missing values
- Removes duplicates
- Optimizes datatypes
- Detects validation issues
- Generates dashboards
- Exports cleaned data
- Creates HTML reports
Show Version
masterclean version
๐ Supported File Types
Currently supported:
- CSV (.csv)
Upcoming support:
- Excel (.xlsx)
- JSON
- Parquet
๐ Generated Outputs
MasterClean automatically generates:
cleaned_data.csv
report.html
These files contain:
- cleaned datasets
- validation summaries
- interactive analytics dashboards
- profiling insights
๐ Python Usage
from masterclean import (
read_file,
clean_data,
optimize_dtypes,
validate_data,
generate_profile,
generate_charts,
generate_report,
export_data
)
# Read dataset
df = read_file("sample.csv")
# Clean dataset
df = clean_data(df)
# Optimize datatypes
df = optimize_dtypes(df)
# Validate dataset
warnings = validate_data(df)
# Generate profile
profile = generate_profile(df)
# Generate charts
charts = generate_charts(df)
# Generate report
generate_report(
df=df,
warnings=warnings,
profile=profile,
charts=charts
)
# Export cleaned dataset
export_data(df)
print("MasterClean pipeline completed successfully")
๐ Examples
Example files are available inside:
examples/
Run CLI example:
masterclean clean examples/sample.csv
Run Python example:
python examples/python_example.py
๐ Example Validation Output
VALIDATION WARNINGS
========================================
โ Negative values found in 'age' (1 rows)
โ Possible outliers detected in 'salary' (1 rows)
โ Invalid boolean-like values found in 'active': {'maybe'}
๐ Dashboard Features
MasterClean generates a unified interactive HTML dashboard containing:
- Dataset summaries
- Validation warnings
- Profiling statistics
- Pie charts
- Histograms
- Boxplots
- Category analytics
- Interactive Plotly visualizations
๐ผ Dashboard Preview
๐ Architecture
Read
โ
Clean
โ
Optimize
โ
Validate
โ
Profile
โ
Visualize
โ
Report
โ
Export
๐ Project Structure
masterclean/
โ
โโโ cleaner.py
โโโ validator.py
โโโ datatypes.py
โโโ profiler.py
โโโ visualizer.py
โโโ report.py
โโโ exporter.py
โโโ reader.py
โโโ cli.py
โ
examples/
โ
โโโ sample.csv
โโโ python_example.py
โโโ cli_example.md
โ
tests/
โ
โโโ test_cleaner.py
โโโ test_validator.py
โโโ test_reader.py
โโโ test_report.py
โโโ test_visualizer.py
โ
.github/workflows/
โ
โโโ tests.yml
๐งช Testing
Run tests using:
python -m pytest
Current Status:
- โ Automated tests passing
- โ GitHub Actions CI/CD passing
๐ CI/CD
MasterClean uses GitHub Actions for:
- automated testing
- dependency validation
- continuous integration
๐ Current Version
v1.0.0
๐ฃ Roadmap
Future improvements planned:
- Advanced schema validation
- Large dataset optimization
- Plugin architecture
- AI-powered cleaning suggestions
- Cloud deployment support
- Streamlit dashboard integration
๐ค Contributing
Contributions are welcome.
You can:
- report bugs
- suggest features
- improve documentation
- submit pull requests
๐ License
MIT License
๐จโ๐ป Author
Mohamed Faisal Maraicar N
GitHub:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file masterclean-1.1.0.tar.gz.
File metadata
- Download URL: masterclean-1.1.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a513ee87d416534c1a511eb00e111a23497bccb0c9a29a79ccfbd577da8dac4e
|
|
| MD5 |
6de2800910ad184e7810ab558faf2da5
|
|
| BLAKE2b-256 |
64a405e20d413de9fe932e002600d9029a7a2fe5ea6253800c1d64bf4013eee5
|
File details
Details for the file masterclean-1.1.0-py3-none-any.whl.
File metadata
- Download URL: masterclean-1.1.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ea8fb7ad7d9121d426b78d44dc64deb7012eea9de7c1a267906a3607b4894ae
|
|
| MD5 |
cc8f869b88967e4a89749f4f51951f47
|
|
| BLAKE2b-256 |
adb1b7f373d6e98c309ada832134332ad993fa11840c7df4d49884bfe6cb5228
|