Skip to main content

The Ultimate Data Cleaning & Analysis Toolkit

Project description

๐Ÿงน CleanEngine

GitHub stars GitHub forks GitHub issues PyPI version Python License: MIT Tests Downloads

๐Ÿš€ The Ultimate Data Cleaning & Analysis CLI Tool
Transform messy datasets into clean, insights-rich data with intelligent cleaning and advanced ML analysis.

CleanEngine is a powerful command-line toolkit that handles missing values, removes duplicates, detects outliers, and provides comprehensive statistical analysis using machine learning techniques.

CleanEngine Demo

๐Ÿ“Š Comparison with Other Tools

Feature CleanEngine ๐Ÿงน pandas-profiling Sweetviz Great Expectations
Data Cleaning โœ… Complete Pipeline โŒ No โŒ No โš ๏ธ Limited
Profiling & Stats โœ… Advanced Analytics โœ… Yes โœ… Yes โš ๏ธ Minimal
Correlation Analysis โœ… Multi-Method โœ… Yes โœ… Yes โŒ No
Feature Importance โœ… ML-Powered โŒ No โŒ No โŒ No
Clustering & Patterns โœ… 3 Algorithms โŒ No โŒ No โŒ No
Anomaly Detection โœ… 2 Methods โŒ No โŒ No โŒ No
Rule Engine โœ… YAML-Driven โŒ No โŒ No โœ… Yes
Interfaces โœ… CLI + GUI + Watcher CLI/Notebook Notebook CLI/Notebook
Automation โœ… Folder Watcher โŒ No โŒ No โœ… Yes

๐Ÿš€ Installation

Using pip (Recommended)

pip install cleanengine

From source

git clone https://github.com/I-invincib1e/CleanEngine.git
cd CleanEngine
pip install -e .

Verify Installation

cleanengine --help

๐ŸŽฏ Quick Start

Clean a CSV file

cleanengine clean data.csv

Analyze data without cleaning

cleanengine analyze data.xlsx

Generate sample data to test

cleanengine samples

Launch web interface

cleanengine gui

๐Ÿ“‹ CLI Commands

Core Commands

Command Flags Description Example
clean --output, -o, --verbose, -v, --force Clean a dataset with full pipeline cleanengine clean data.csv --output ./cleaned/ --verbose
analyze --output, -o, --verbose, -v Analyze data without cleaning cleanengine analyze data.csv --output ./analysis/ --verbose
validate-data --verbose, -v Validate data with rules cleanengine validate-data data.csv --verbose
profile --output, -o, --verbose, -v Generate data profile report cleanengine profile data.csv --output ./profile/ --verbose
clean-only --output, -o, --verbose, -v Clean without analysis cleanengine clean-only data.csv --output ./cleaned/ --verbose
samples --output, -o, --count, -n, --verbose, -v Create sample datasets cleanengine samples --output ./samples/ --count 5 --verbose
test --verbose, -v, --coverage Run test suite cleanengine test --verbose --coverage
gui --port, -p, --host, -h Launch Streamlit web interface cleanengine gui --port 8501 --host localhost
info None Show CleanEngine information cleanengine info

Advanced Analysis Commands

Command Flags Description Example
correlations --method, -m, --threshold, -t, --output, -o, --verbose, -v Analyze variable correlations cleanengine correlations data.csv --method pearson --threshold 0.7 --verbose
features --output, -o, --verbose, -v Analyze feature importance cleanengine features data.csv --output ./features/ --verbose
clusters --method, -m, --output, -o, --verbose, -v Discover data clusters cleanengine clusters data.csv --method kmeans --output ./clusters/ --verbose
anomalies --method, -m, --contamination, -c, --output, -o, --verbose, -v Detect anomalies/outliers cleanengine anomalies data.csv --method isolation_forest --contamination 0.1 --verbose
quality --output, -o, --verbose, -v Assess data quality cleanengine quality data.csv --output ./quality/ --verbose
statistics --output, -o, --verbose, -v Perform statistical analysis cleanengine statistics data.csv --output ./stats/ --verbose

๐Ÿ“ Supported File Formats

  • CSV: Comma-separated values
  • Excel: .xlsx and .xls files
  • JSON: JavaScript Object Notation
  • Parquet: Columnar storage format

๐Ÿ“Š Output Structure

After processing, CleanEngine creates a Cleans-<dataset_name>/ folder with:

Cleans-data/
โ”œโ”€โ”€ cleaned_data.csv          # Your cleaned dataset
โ”œโ”€โ”€ cleaning_report.json      # Detailed cleaning summary
โ”œโ”€โ”€ analysis_report.json      # Comprehensive analysis results
โ”œโ”€โ”€ visualizations/           # Generated charts and plots
โ””โ”€โ”€ logs/                     # Processing logs

โš™๏ธ Configuration

Custom Configuration File

Create a config.yaml file in your working directory:

cleaning:
  missing_values:
    strategy: "auto"  # auto, mean, median, mode, drop
  outliers:
    method: "iqr"     # iqr, zscore, custom
  encoding:
    categorical: true
    normalize: true

analysis:
  correlation:
    method: "pearson"  # pearson, spearman, kendall
  clustering:
    method: "kmeans"   # kmeans, dbscan, hierarchical

๐ŸŽจ CLI Features

  • Rich Terminal Output: Beautiful tables, progress bars, and colors
  • Interactive Help: cleanengine --help and cleanengine <command> --help
  • Auto-completion: Tab completion for commands and file paths
  • Progress Tracking: Real-time progress bars for long operations
  • Error Handling: Clear error messages with suggestions

๐Ÿ“ˆ Performance

  • Small Datasets (< 1MB): < 1 second
  • Medium Datasets (1-100MB): 1-30 seconds
  • Large Datasets (100MB-1GB): 30 seconds - 5 minutes
  • Very Large Datasets (> 1GB): Configurable chunking

๐Ÿ”ง Advanced Usage

Batch Processing Multiple Files

# Process all CSV files in current directory
for file in *.csv; do cleanengine clean "$file"; done

Custom Output Directory

cleanengine clean data.csv --output-dir ./my-clean-data/

Configuration File

cleanengine clean data.csv --config ./my-config.yaml

Verbose Output

cleanengine clean data.csv --verbose

๐Ÿ Python API

For programmatic use:

from cleanengine import DatasetCleaner

# Initialize cleaner
cleaner = DatasetCleaner()

# Clean dataset
cleaned_df = cleaner.clean_dataset('data.csv')

# Get analysis results
analysis_results = cleaner.analyze_dataset('data.csv')

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • Setting up a development environment
  • Code style and standards
  • Testing and quality assurance
  • Pull request process

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • pandas for data manipulation
  • scikit-learn for machine learning algorithms
  • Typer & Rich for beautiful CLI interfaces
  • Streamlit for web interface

Made with โค๏ธ for data scientists and analysts

GitHub โ€ข PyPI โ€ข Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanengine-0.1.2.tar.gz (80.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanengine-0.1.2-py3-none-any.whl (61.6 kB view details)

Uploaded Python 3

File details

Details for the file cleanengine-0.1.2.tar.gz.

File metadata

  • Download URL: cleanengine-0.1.2.tar.gz
  • Upload date:
  • Size: 80.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for cleanengine-0.1.2.tar.gz
Algorithm Hash digest
SHA256 74440b798b33319fd42ce625dac814b34b4da73bf31402a427f86ba2a5bf220c
MD5 26f3ba9112028cb851cec5ede633eb91
BLAKE2b-256 00cd45c1eb0b56ad52ba46cd88fae1da15279cee8baf887251e921da5e40c98c

See more details on using hashes here.

File details

Details for the file cleanengine-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: cleanengine-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 61.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for cleanengine-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 26c9641ae54921a8cebcfc49b85dce490a96aeaf0f44d84311ff0b312ae208ea
MD5 d72abb65f0c760198e8bca879f001ab1
BLAKE2b-256 b8bdd9a0b48c3b43f6b58f3e763ac0fb576050a5c815bde6ecc9ae56cd23d4be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page