Skip to main content

Automated Data Cleaning, Validation and Analytics Toolkit

Project description

๐Ÿš€ MasterClean

Python

Tests

License

Automated Data Cleaning, Validation & Analytics Toolkit for Python.

MasterClean is a professional Python package that automates dataset cleaning, preprocessing, validation, profiling, visualization, and reporting using a single command.

It is designed for:

  • Data Analysts
  • Data Scientists
  • ML Engineers
  • Researchers
  • Students
  • Automation workflows

โœจ Features

Data Cleaning

  • Automatic missing value handling
  • Duplicate row removal
  • Column standardization
  • String cleanup
  • Encoding-aware file loading

Datatype Optimization

  • Automatic numeric conversion
  • Datetime detection
  • Integer optimization
  • Mixed datatype handling

Validation Engine

  • Negative value detection
  • Outlier detection
  • Invalid boolean detection
  • Dataset quality warnings

Analytics & Profiling

  • Automated dataset profiling
  • Numeric statistics
  • Categorical summaries
  • Memory usage analysis

Visualization Engine

  • Interactive Plotly dashboards
  • Histograms
  • Pie charts
  • Boxplots
  • Distribution analysis
  • Category analytics

Reporting

  • Unified HTML analytics dashboard
  • Validation summaries
  • Interactive charts
  • Automated report generation

Developer Features

  • Command Line Interface (CLI)
  • Automated testing with pytest
  • GitHub Actions CI/CD pipeline
  • Modular package architecture

๐Ÿ“ฆ Installation

Install From PyPI

pip install masterclean

Development Installation

Clone Repository

git clone https://github.com/MohamedFaisal-11/masterclean.git
cd masterclean

Create Virtual Environment

python -m venv venv

Activate Environment

macOS / Linux

source venv/bin/activate

Windows

venv\Scripts\activate

Install Package

pip install -e .

๐Ÿš€ CLI Usage

Clean Dataset

masterclean clean sample.csv

MasterClean automatically:

  • Reads datasets
  • Cleans missing values
  • Removes duplicates
  • Optimizes datatypes
  • Detects validation issues
  • Generates dashboards
  • Exports cleaned data
  • Creates HTML reports

Show Version

masterclean version

๐Ÿ“ Supported File Types

Currently supported:

  • CSV (.csv)

Upcoming support:

  • Excel (.xlsx)
  • JSON
  • Parquet

๐Ÿ“‚ Generated Outputs

MasterClean automatically generates:

cleaned_data.csv
report.html

These files contain:

  • cleaned datasets
  • validation summaries
  • interactive analytics dashboards
  • profiling insights

๐Ÿ Python Usage

from masterclean import (
    read_file,
    clean_data,
    optimize_dtypes,
    validate_data,
    generate_profile,
    generate_charts,
    generate_report,
    export_data
)

# Read dataset
df = read_file("sample.csv")

# Clean dataset
df = clean_data(df)

# Optimize datatypes
df = optimize_dtypes(df)

# Validate dataset
warnings = validate_data(df)

# Generate profile
profile = generate_profile(df)

# Generate charts
charts = generate_charts(df)

# Generate report
generate_report(
    df=df,
    warnings=warnings,
    profile=profile,
    charts=charts
)

# Export cleaned dataset
export_data(df)

print("MasterClean pipeline completed successfully")

๐Ÿ“š Examples

Example files are available inside:

examples/

Run CLI example:

masterclean clean examples/sample.csv

Run Python example:

python examples/python_example.py

๐Ÿ“Š Example Validation Output

VALIDATION WARNINGS
========================================

โš  Negative values found in 'age' (1 rows)

โš  Possible outliers detected in 'salary' (1 rows)

โš  Invalid boolean-like values found in 'active': {'maybe'}

๐Ÿ“ˆ Dashboard Features

MasterClean generates a unified interactive HTML dashboard containing:

  • Dataset summaries
  • Validation warnings
  • Profiling statistics
  • Pie charts
  • Histograms
  • Boxplots
  • Category analytics
  • Interactive Plotly visualizations

๐Ÿ–ผ Dashboard Preview

Dashboard


๐Ÿ— Architecture

Read
   โ†“
Clean
   โ†“
Optimize
   โ†“
Validate
   โ†“
Profile
   โ†“
Visualize
   โ†“
Report
   โ†“
Export

๐Ÿ“‚ Project Structure

masterclean/
โ”‚
โ”œโ”€โ”€ cleaner.py
โ”œโ”€โ”€ validator.py
โ”œโ”€โ”€ datatypes.py
โ”œโ”€โ”€ profiler.py
โ”œโ”€โ”€ visualizer.py
โ”œโ”€โ”€ report.py
โ”œโ”€โ”€ exporter.py
โ”œโ”€โ”€ reader.py
โ”œโ”€โ”€ cli.py
โ”‚
examples/
โ”‚
โ”œโ”€โ”€ sample.csv
โ”œโ”€โ”€ python_example.py
โ”œโ”€โ”€ cli_example.md
โ”‚
tests/
โ”‚
โ”œโ”€โ”€ test_cleaner.py
โ”œโ”€โ”€ test_validator.py
โ”œโ”€โ”€ test_reader.py
โ”œโ”€โ”€ test_report.py
โ”œโ”€โ”€ test_visualizer.py
โ”‚
.github/workflows/
โ”‚
โ””โ”€โ”€ tests.yml

๐Ÿงช Testing

Run tests using:

python -m pytest

Current Status:

  • โœ… Automated tests passing
  • โœ… GitHub Actions CI/CD passing

๐Ÿ”„ CI/CD

MasterClean uses GitHub Actions for:

  • automated testing
  • dependency validation
  • continuous integration

๐Ÿ“Œ Current Version

v1.0.0

๐Ÿ›ฃ Roadmap

Future improvements planned:

  • Advanced schema validation
  • Large dataset optimization
  • Plugin architecture
  • AI-powered cleaning suggestions
  • Cloud deployment support
  • Streamlit dashboard integration

๐Ÿค Contributing

Contributions are welcome.

You can:

  • report bugs
  • suggest features
  • improve documentation
  • submit pull requests

๐Ÿ“„ License

MIT License


๐Ÿ‘จโ€๐Ÿ’ป Author

Mohamed Faisal Maraicar N

GitHub:

https://github.com/MohamedFaisal-11/masterclean

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

masterclean-1.1.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

masterclean-1.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file masterclean-1.1.0.tar.gz.

File metadata

  • Download URL: masterclean-1.1.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for masterclean-1.1.0.tar.gz
Algorithm Hash digest
SHA256 a513ee87d416534c1a511eb00e111a23497bccb0c9a29a79ccfbd577da8dac4e
MD5 6de2800910ad184e7810ab558faf2da5
BLAKE2b-256 64a405e20d413de9fe932e002600d9029a7a2fe5ea6253800c1d64bf4013eee5

See more details on using hashes here.

File details

Details for the file masterclean-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: masterclean-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for masterclean-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ea8fb7ad7d9121d426b78d44dc64deb7012eea9de7c1a267906a3607b4894ae
MD5 cc8f869b88967e4a89749f4f51951f47
BLAKE2b-256 adb1b7f373d6e98c309ada832134332ad993fa11840c7df4d49884bfe6cb5228

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page