The Ultimate Data Cleaning & Analysis Toolkit
Project description
๐งน CleanEngine
๐ The Ultimate Data Cleaning & Analysis CLI Tool
Transform messy datasets into clean, insights-rich data with intelligent cleaning and advanced ML analysis.
CleanEngine is a powerful command-line toolkit that handles missing values, removes duplicates, detects outliers, and provides comprehensive statistical analysis using machine learning techniques.
๐ Comparison with Other Tools
| Feature | CleanEngine ๐งน | pandas-profiling | Sweetviz | Great Expectations |
|---|---|---|---|---|
| Data Cleaning | โ Complete Pipeline | โ No | โ No | โ ๏ธ Limited |
| Profiling & Stats | โ Advanced Analytics | โ Yes | โ Yes | โ ๏ธ Minimal |
| Correlation Analysis | โ Multi-Method | โ Yes | โ Yes | โ No |
| Feature Importance | โ ML-Powered | โ No | โ No | โ No |
| Clustering & Patterns | โ 3 Algorithms | โ No | โ No | โ No |
| Anomaly Detection | โ 2 Methods | โ No | โ No | โ No |
| Rule Engine | โ YAML-Driven | โ No | โ No | โ Yes |
| Interfaces | โ CLI + GUI + Watcher | CLI/Notebook | Notebook | CLI/Notebook |
| Automation | โ Folder Watcher | โ No | โ No | โ Yes |
๐ Installation
Using pip (Recommended)
pip install cleanengine
From source
git clone https://github.com/I-invincib1e/CleanEngine.git
cd CleanEngine
pip install -e .
Verify Installation
cleanengine --help
๐ฏ Quick Start
Clean a CSV file
cleanengine clean data.csv
Analyze data without cleaning
cleanengine analyze data.xlsx
Generate sample data to test
cleanengine samples
Launch web interface
cleanengine gui
๐ CLI Commands
Core Commands
| Command | Flags | Description | Example |
|---|---|---|---|
clean |
--output, -o, --verbose, -v, --force |
Clean a dataset with full pipeline | cleanengine clean data.csv --output ./cleaned/ --verbose |
analyze |
--output, -o, --verbose, -v |
Analyze data without cleaning | cleanengine analyze data.csv --output ./analysis/ --verbose |
validate-data |
--verbose, -v |
Validate data with rules | cleanengine validate-data data.csv --verbose |
profile |
--output, -o, --verbose, -v |
Generate data profile report | cleanengine profile data.csv --output ./profile/ --verbose |
clean-only |
--output, -o, --verbose, -v |
Clean without analysis | cleanengine clean-only data.csv --output ./cleaned/ --verbose |
samples |
--output, -o, --count, -n, --verbose, -v |
Create sample datasets | cleanengine samples --output ./samples/ --count 5 --verbose |
test |
--verbose, -v, --coverage |
Run test suite | cleanengine test --verbose --coverage |
gui |
--port, -p, --host, -h |
Launch Streamlit web interface | cleanengine gui --port 8501 --host localhost |
info |
None | Show CleanEngine information | cleanengine info |
Advanced Analysis Commands
| Command | Flags | Description | Example |
|---|---|---|---|
correlations |
--method, -m, --threshold, -t, --output, -o, --verbose, -v |
Analyze variable correlations | cleanengine correlations data.csv --method pearson --threshold 0.7 --verbose |
features |
--output, -o, --verbose, -v |
Analyze feature importance | cleanengine features data.csv --output ./features/ --verbose |
clusters |
--method, -m, --output, -o, --verbose, -v |
Discover data clusters | cleanengine clusters data.csv --method kmeans --output ./clusters/ --verbose |
anomalies |
--method, -m, --contamination, -c, --output, -o, --verbose, -v |
Detect anomalies/outliers | cleanengine anomalies data.csv --method isolation_forest --contamination 0.1 --verbose |
quality |
--output, -o, --verbose, -v |
Assess data quality | cleanengine quality data.csv --output ./quality/ --verbose |
statistics |
--output, -o, --verbose, -v |
Perform statistical analysis | cleanengine statistics data.csv --output ./stats/ --verbose |
๐ Supported File Formats
- CSV: Comma-separated values
- Excel: .xlsx and .xls files
- JSON: JavaScript Object Notation
- Parquet: Columnar storage format
๐ Output Structure
After processing, CleanEngine creates a Cleans-<dataset_name>/ folder with:
Cleans-data/
โโโ cleaned_data.csv # Your cleaned dataset
โโโ cleaning_report.json # Detailed cleaning summary
โโโ analysis_report.json # Comprehensive analysis results
โโโ visualizations/ # Generated charts and plots
โโโ logs/ # Processing logs
โ๏ธ Configuration
Custom Configuration File
Create a config.yaml file in your working directory:
cleaning:
missing_values:
strategy: "auto" # auto, mean, median, mode, drop
outliers:
method: "iqr" # iqr, zscore, custom
encoding:
categorical: true
normalize: true
analysis:
correlation:
method: "pearson" # pearson, spearman, kendall
clustering:
method: "kmeans" # kmeans, dbscan, hierarchical
๐จ CLI Features
- Rich Terminal Output: Beautiful tables, progress bars, and colors
- Interactive Help:
cleanengine --helpandcleanengine <command> --help - Auto-completion: Tab completion for commands and file paths
- Progress Tracking: Real-time progress bars for long operations
- Error Handling: Clear error messages with suggestions
๐ Performance
- Small Datasets (< 1MB): < 1 second
- Medium Datasets (1-100MB): 1-30 seconds
- Large Datasets (100MB-1GB): 30 seconds - 5 minutes
- Very Large Datasets (> 1GB): Configurable chunking
๐ง Advanced Usage
Batch Processing Multiple Files
# Process all CSV files in current directory
for file in *.csv; do cleanengine clean "$file"; done
Custom Output Directory
cleanengine clean data.csv --output-dir ./my-clean-data/
Configuration File
cleanengine clean data.csv --config ./my-config.yaml
Verbose Output
cleanengine clean data.csv --verbose
๐ Python API
For programmatic use:
from cleanengine import DatasetCleaner
# Initialize cleaner
cleaner = DatasetCleaner()
# Clean dataset
cleaned_df = cleaner.clean_dataset('data.csv')
# Get analysis results
analysis_results = cleaner.analyze_dataset('data.csv')
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details on:
- Setting up a development environment
- Code style and standards
- Testing and quality assurance
- Pull request process
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- pandas for data manipulation
- scikit-learn for machine learning algorithms
- Typer & Rich for beautiful CLI interfaces
- Streamlit for web interface
Made with โค๏ธ for data scientists and analysts
GitHub โข PyPI โข Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleanengine-0.1.1.tar.gz.
File metadata
- Download URL: cleanengine-0.1.1.tar.gz
- Upload date:
- Size: 75.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a88234268d89f51c7273c7665819131960f5f6f124c4e222178e544c33afc8e
|
|
| MD5 |
5b7cdfb1bf65f09e454a573fc72ea4c1
|
|
| BLAKE2b-256 |
65bb19512a0030c46b8b6fa553b5cba04cf43155ac0ed815df89318ef1ab4512
|
File details
Details for the file cleanengine-0.1.1-py3-none-any.whl.
File metadata
- Download URL: cleanengine-0.1.1-py3-none-any.whl
- Upload date:
- Size: 56.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d26f504e31d6fbc38fe37bb49d29be2a5f8039cf4df1fb58126e7f0383c8e53
|
|
| MD5 |
d0a15e0ad235a1f7b1e4642b377e2220
|
|
| BLAKE2b-256 |
d6b8daf21593019a161912b0d5fde220818e9d316925d0257d8aa4fbb4b84701
|