Skip to main content

Automated Data Cleaning, Validation and Analytics Toolkit

Project description

๐Ÿš€ MasterClean

PyPI

Python

License

Automated Data Cleaning, Validation & Analytics Toolkit for Python.

MasterClean is a professional Python package that automates:

  • data cleaning
  • preprocessing
  • validation
  • profiling
  • visualization
  • reporting
  • analytics

using simple CLI commands or Python API.

Designed for:

  • Data Analysts
  • Data Scientists
  • ML Engineers
  • Researchers
  • Students
  • Automation workflows

โœจ Features

๐Ÿงน Advanced Data Cleaning

  • Missing value handling
  • Duplicate row removal
  • Empty string cleanup
  • Whitespace cleanup
  • Column standardization
  • Datetime conversion
  • Smart categorical filling
  • Automatic preprocessing pipeline

โšก Datatype Optimization

  • Integer optimization
  • Float optimization
  • Boolean conversion
  • Category optimization
  • Datetime detection
  • Memory usage reduction

๐Ÿ›ก Advanced Validation Engine

  • Negative value detection
  • Outlier detection
  • Invalid boolean detection
  • Email validation
  • Phone validation
  • Duplicate percentage warnings
  • Missing value percentage analysis
  • Mixed datatype detection

๐Ÿ“Š Advanced Profiling

  • Dataset health score
  • Missing value summaries
  • Datatype analytics
  • Memory usage analysis
  • Numeric statistics
  • Categorical summaries
  • Dataset overview metrics

๐Ÿ“ˆ Interactive Visualization Engine

  • Plotly dashboards
  • Histograms
  • Boxplots
  • Pie charts
  • Correlation heatmaps
  • Missing value charts
  • Interactive analytics dashboards

๐Ÿ“„ Reporting System

  • Unified HTML analytics dashboard
  • Validation summaries
  • Dataset overview cards
  • Interactive visualizations
  • Automated report generation

๐Ÿ–ฅ Professional CLI Toolkit

MasterClean now supports multiple commands.

Full Automated Pipeline

masterclean clean data.csv

Runs:

  • cleaning
  • optimization
  • validation
  • profiling
  • visualization
  • reporting
  • exporting

Validation Only

masterclean validate data.csv

Dataset Profiling

masterclean profile data.csv

Dashboard Generation

masterclean dashboard data.csv

Show Version

masterclean version

๐Ÿ“ฆ Installation

Install from PyPI

pip install masterclean

Upgrade to Latest Version

pip install --upgrade masterclean

๐Ÿ Python Usage

from masterclean import *

df, file_extension = read_file("data.csv")

df = clean_data(df)

df = optimize_dtypes(df)

warnings = validate_data(df)

profile = generate_profile(df)

charts = generate_charts(df)

generate_report(
    df=df,
    warnings=warnings,
    profile=profile,
    charts=charts
)

export_data(
    df,
    "cleaned_data",
    file_extension
)

๐Ÿ“‚ Supported File Formats

Format Supported
CSV โœ…
XLSX โœ…
XLS โœ…

๐Ÿ”„ Same-Format Export System

MasterClean automatically preserves output format.

Input Output
CSV cleaned_data.csv
XLSX cleaned_data.xlsx
XLS cleaned_data.xlsx

๐Ÿ“Š Example Validation Output

VALIDATION WARNINGS
========================================

โš  Negative values found in 'salary' (3 rows)

โš  Invalid email values found in 'email' (5 rows)

โš  High duplicate rows detected (14.2%)

โš  Mixed datatypes detected in 'age'

๐Ÿ— Architecture

Read
   โ†“
Clean
   โ†“
Optimize
   โ†“
Validate
   โ†“
Profile
   โ†“
Visualize
   โ†“
Report
   โ†“
Export

๐Ÿ“ Project Structure

masterclean/
โ”‚
โ”œโ”€โ”€ cleaner.py
โ”œโ”€โ”€ validator.py
โ”œโ”€โ”€ datatypes.py
โ”œโ”€โ”€ profiler.py
โ”œโ”€โ”€ visualizer.py
โ”œโ”€โ”€ report.py
โ”œโ”€โ”€ exporter.py
โ”œโ”€โ”€ reader.py
โ”œโ”€โ”€ cli.py
โ”œโ”€โ”€ __init__.py
โ”‚
tests/
โ”‚
โ”œโ”€โ”€ test_cleaner.py
โ”œโ”€โ”€ test_validator.py
โ”œโ”€โ”€ test_reader.py
โ”œโ”€โ”€ test_report.py
โ”œโ”€โ”€ test_visualizer.py
โ”‚
.github/workflows/
โ”‚
โ””โ”€โ”€ tests.yml

๐Ÿงช Testing

Run tests using:

python -m pytest

๐Ÿ”„ CI/CD

MasterClean uses GitHub Actions for:

  • automated testing
  • dependency validation
  • continuous integration

๐Ÿ›ฃ Roadmap

Future improvements planned:

  • Streamlit web application
  • AI-powered cleaning suggestions
  • Large dataset optimization
  • Schema validation engine
  • Cloud deployment support
  • Plugin architecture
  • Real-time analytics dashboards

๐Ÿค Contributing

Contributions are welcome.

You can:

  • report bugs
  • suggest features
  • improve documentation
  • submit pull requests

๐Ÿ“„ License

MIT License


๐Ÿ‘จโ€๐Ÿ’ป Author

Mohamed Faisal Maraicar N

GitHub: https://github.com/MohamedFaisal-11/masterclean

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

masterclean-1.3.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

masterclean-1.3.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file masterclean-1.3.0.tar.gz.

File metadata

  • Download URL: masterclean-1.3.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for masterclean-1.3.0.tar.gz
Algorithm Hash digest
SHA256 946fe3a2a72435d5d0951b90ad69dcaabd2bb5001fc11b4d5790ef417492bcc2
MD5 e36ff49cba88f4ca4046cf8c337b9c1d
BLAKE2b-256 e4ac191b7673ba343dd3f7ede0a187bf32f3b23d31a89f2f4a1bc69abb82b8bd

See more details on using hashes here.

File details

Details for the file masterclean-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: masterclean-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for masterclean-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 754a4e12fb1266666f2f0941e34780b066979603e96307cba3d8d4f8a5b0d417
MD5 8a7292d19b9160fa0bf5f7809b4334a5
BLAKE2b-256 b65f0af23b5aa16c9aad59b53ddc7d7e295b6a82501a9796cd7397dcd7417677

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page