Skip to main content

Automated Data Cleaning, Validation and Analytics Toolkit

Project description

๐Ÿš€ MasterClean

PyPI

Python

Tests

License

Automated Data Cleaning, Validation & Analytics Toolkit for Python.

MasterClean is a professional Python package that automates dataset cleaning, preprocessing, validation, profiling, visualization, and reporting using a single command.

It is designed for:

  • Data Analysts
  • Data Scientists
  • ML Engineers
  • Researchers
  • Students
  • Automation workflows

โœจ Features

Data Cleaning

  • Automatic missing value handling
  • Duplicate row removal
  • Column standardization
  • String cleanup
  • Encoding-aware file loading

Datatype Optimization

  • Automatic numeric conversion
  • Datetime detection
  • Integer optimization
  • Mixed datatype handling

Validation Engine

  • Negative value detection
  • Outlier detection
  • Invalid boolean detection
  • Dataset quality warnings

Analytics & Profiling

  • Automated dataset profiling
  • Numeric statistics
  • Categorical summaries
  • Memory usage analysis

Visualization Engine

  • Interactive Plotly dashboards
  • Histograms
  • Pie charts
  • Boxplots
  • Distribution analysis
  • Category analytics

Reporting

  • Unified HTML analytics dashboard
  • Validation summaries
  • Interactive charts
  • Automated report generation

Developer Features

  • Command Line Interface (CLI)
  • Automated testing with pytest
  • GitHub Actions CI/CD pipeline
  • Modular package architecture

๐Ÿ“ฆ Installation

Install From PyPI

pip install masterclean

Development Installation

Clone Repository

git clone https://github.com/MohamedFaisal-11/masterclean.git
cd masterclean

Create Virtual Environment

python -m venv venv

Activate Environment

macOS / Linux

source venv/bin/activate

Windows

venv\Scripts\activate

Install Package

pip install -e .

๐Ÿš€ CLI Usage

Clean Dataset

masterclean clean sample.csv

MasterClean automatically:

  • Reads datasets
  • Cleans missing values
  • Removes duplicates
  • Optimizes datatypes
  • Detects validation issues
  • Generates dashboards
  • Exports cleaned data
  • Creates HTML reports

Show Version

masterclean version

๐Ÿ“ Supported File Types

Currently supported:

  • CSV (.csv)

Upcoming support:

  • Excel (.xlsx)
  • JSON
  • Parquet

๐Ÿ“‚ Generated Outputs

MasterClean automatically generates:

cleaned_data.csv
report.html

These files contain:

  • cleaned datasets
  • validation summaries
  • interactive analytics dashboards
  • profiling insights

๐Ÿ Python Usage

from masterclean import (
    read_file,
    clean_data,
    optimize_dtypes,
    validate_data,
    generate_profile,
    generate_charts,
    generate_report,
    export_data
)

# Read dataset
df = read_file("sample.csv")

# Clean dataset
df = clean_data(df)

# Optimize datatypes
df = optimize_dtypes(df)

# Validate dataset
warnings = validate_data(df)

# Generate profile
profile = generate_profile(df)

# Generate charts
charts = generate_charts(df)

# Generate report
generate_report(
    df=df,
    warnings=warnings,
    profile=profile,
    charts=charts
)

# Export cleaned dataset
export_data(df)

print("MasterClean pipeline completed successfully")

๐Ÿ“š Examples

Example files are available inside:

examples/

Run CLI example:

masterclean clean examples/sample.csv

Run Python example:

python examples/python_example.py

๐Ÿ“Š Example Validation Output

VALIDATION WARNINGS
========================================

โš  Negative values found in 'age' (1 rows)

โš  Possible outliers detected in 'salary' (1 rows)

โš  Invalid boolean-like values found in 'active': {'maybe'}

๐Ÿ“ˆ Dashboard Features

MasterClean generates a unified interactive HTML dashboard containing:

  • Dataset summaries
  • Validation warnings
  • Profiling statistics
  • Pie charts
  • Histograms
  • Boxplots
  • Category analytics
  • Interactive Plotly visualizations

๐Ÿ–ผ Dashboard Preview

Dashboard


๐Ÿ— Architecture

Read
   โ†“
Clean
   โ†“
Optimize
   โ†“
Validate
   โ†“
Profile
   โ†“
Visualize
   โ†“
Report
   โ†“
Export

๐Ÿ“‚ Project Structure

masterclean/
โ”‚
โ”œโ”€โ”€ cleaner.py
โ”œโ”€โ”€ validator.py
โ”œโ”€โ”€ datatypes.py
โ”œโ”€โ”€ profiler.py
โ”œโ”€โ”€ visualizer.py
โ”œโ”€โ”€ report.py
โ”œโ”€โ”€ exporter.py
โ”œโ”€โ”€ reader.py
โ”œโ”€โ”€ cli.py
โ”‚
examples/
โ”‚
โ”œโ”€โ”€ sample.csv
โ”œโ”€โ”€ python_example.py
โ”œโ”€โ”€ cli_example.md
โ”‚
tests/
โ”‚
โ”œโ”€โ”€ test_cleaner.py
โ”œโ”€โ”€ test_validator.py
โ”œโ”€โ”€ test_reader.py
โ”œโ”€โ”€ test_report.py
โ”œโ”€โ”€ test_visualizer.py
โ”‚
.github/workflows/
โ”‚
โ””โ”€โ”€ tests.yml

๐Ÿงช Testing

Run tests using:

python -m pytest

Current Status:

  • โœ… Automated tests passing
  • โœ… GitHub Actions CI/CD passing

๐Ÿ”„ CI/CD

MasterClean uses GitHub Actions for:

  • automated testing
  • dependency validation
  • continuous integration

๐Ÿ“Œ Current Version

v1.0.0

๐Ÿ›ฃ Roadmap

Future improvements planned:

  • Advanced schema validation
  • Large dataset optimization
  • Plugin architecture
  • AI-powered cleaning suggestions
  • Cloud deployment support
  • Streamlit dashboard integration

๐Ÿค Contributing

Contributions are welcome.

You can:

  • report bugs
  • suggest features
  • improve documentation
  • submit pull requests

๐Ÿ“„ License

MIT License


๐Ÿ‘จโ€๐Ÿ’ป Author

Mohamed Faisal Maraicar N

GitHub:

https://github.com/MohamedFaisal-11/masterclean

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

masterclean-1.2.0.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

masterclean-1.2.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file masterclean-1.2.0.tar.gz.

File metadata

  • Download URL: masterclean-1.2.0.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for masterclean-1.2.0.tar.gz
Algorithm Hash digest
SHA256 831f2521bcabb3a897a58d85d8a04d7e234454fef954dae029acce9858b55052
MD5 a8f4149707060da1fadbb5f1918a09ac
BLAKE2b-256 2fdd38c04db50cdaaaa82f06fb50cb67fc4e34eb988c98a64825f301c87bafac

See more details on using hashes here.

File details

Details for the file masterclean-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: masterclean-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for masterclean-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b566c9657cbc9df45b2427a4fed1704f66945b81f162269459b914a4d51b4629
MD5 7ae9b63429d98aeb3550285a0dd2ab55
BLAKE2b-256 ef2f5b6f2ce86a511675308dbaf46b0ede9cb88aff5438cba133353ea93b6e96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page