A powerful Python library for data cleaning and exploratory data analysis
Project description
Datacmp
Datacmp is a powerful, lightweight Python library designed to simplify and accelerate data cleaning and exploratory data analysis (EDA) workflows. Built for data scientists and analysts, it provides intelligent preprocessing, structured insights, and beautiful visualizations—all with just a few lines of code.
Features
Smart Data Cleaning
- Automatic column name standardization (lowercase, underscores, no special chars)
- Intelligent missing value handling with configurable strategies (mean, median, mode)
- Outlier detection and handling using IQR method (cap or remove)
- Duplicate row removal with detailed logging
- Configurable threshold-based column dropping
Comprehensive Profiling
- Dataset overview with row/column counts and memory usage
- Detailed column analysis including dtypes, missing values, and unique counts
- Extended statistics (mean, median, std, skewness, kurtosis)
- Correlation analysis (Pearson, Spearman, Kendall)
- Column type detection (numeric, categorical, datetime)
Beautiful Visualizations
- Missing value heatmaps
- Correlation matrices
- Distribution plots for numeric features
- Export-ready plots in high resolution
Flexible Reporting
- HTML reports with interactive styling and embedded visualizations
- Text reports for quick inspection
- CSV export of cleaned datasets
- Complete audit trail of all cleaning operations
YAML-Based Configuration
- Fully decoupled configuration for reproducibility
- Pipeline versioning and easy sharing
- Template generation with
datacmp init
Command-Line Interface
- Run pipelines directly from terminal
- Generate config files with defaults
- Progress tracking and verbose logging
Comparison: v2.0 → v3.0
| Feature | v2.0 | v3.0 |
|---|---|---|
| API Style | Functional only | OOP + Functional |
| Method Chaining | ❌ | ✅ |
| Type Hints | Partial | Complete |
| Logging | print() | logging module |
| Visualizations | ❌ | ✅ (3 types) |
| HTML Reports | ❌ | ✅ (styled) |
| Correlations | ❌ | ✅ |
| CLI Subcommands | ❌ | ✅ |
| Test Coverage | None | Comprehensive |
| Packaging | setup.py | pyproject.toml |
Installation
From PyPI (Recommended)
pip install datacmp
From Source
git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -e .
With Optional Dependencies
# For full features
pip install datacmp[full]
# For development
pip install datacmp[dev]
Quick Start
Python API
from datacmp import DataCmp
# Load and process your data
cmp = DataCmp("data.csv")
# Clean, profile, and export in one chain
cmp.clean().profile().export("report.html")
# Or use individual methods
cmp = DataCmp("data.csv")
cmp.clean(outliers=True, duplicates=True)
cmp.profile(detailed=True)
cmp.visualize(output_dir="./plots")
cmp.export("cleaned_data.csv")
Command-Line Interface
# Run complete pipeline
datacmp run data.csv --config config.yaml --export cleaned.csv --report report.html
# Create default config file
datacmp init my_config.yaml
# Show version
datacmp version
Usage Examples
Example 1: Basic Cleaning
from datacmp import DataCmp
# Initialize with auto-clean
cmp = DataCmp("messy_data.csv", auto_clean=True)
# Export cleaned data
cmp.export("clean_data.csv")
# View cleaning log
print(cmp.get_cleaning_log())
Example 2: Custom Configuration
from datacmp import DataCmp
# Define custom config
config = {
"cleaning": {
"threshold_drop": 0.3, # Drop columns with >30% missing
"fill_strategy": {
"numeric": "mean",
"categorical": "mode"
},
"outlier_handling": {
"enabled": True,
"action": "remove" # Remove outliers instead of capping
}
}
}
# Use custom config
cmp = DataCmp("data.csv", config=config)
cmp.clean().profile()
Example 3: Method Chaining
from datacmp import DataCmp
result = (
DataCmp("data.csv")
.clean(columns=True, missing=True, outliers=True)
.profile(detailed=True)
.visualize(output_dir="./plots")
.export("report.html")
)
Example 4: Programmatic Pipeline
from datacmp import run_pipeline
# Run entire pipeline with one function
df_cleaned = run_pipeline(
data="data.csv",
config_path="config.yaml",
export_csv_path="cleaned.csv",
export_report_path="report.html"
)
Configuration
Example config.yaml
library_name: datacmp
version: 3.0.0
author: Moustafa Mohamed
cleaning:
threshold_drop: 0.45
fill_strategy:
numeric: median
categorical: mode
outlier_handling:
enabled: true
method: iqr
iqr_multiplier: 1.5
action: cap
drop_duplicates: true
profiling:
include_more_stats: true
compute_correlations: true
Configuration Options
| Option | Description | Default |
|---|---|---|
threshold_drop |
Drop columns with missing ratio above this | 0.45 |
fill_strategy.numeric |
Strategy for numeric columns (mean, median, mode) |
median |
fill_strategy.categorical |
Strategy for categorical columns (mode) |
mode |
outlier_handling.enabled |
Enable outlier detection | true |
outlier_handling.method |
Detection method (iqr) |
iqr |
outlier_handling.action |
Action to take (cap, remove) |
cap |
drop_duplicates |
Remove duplicate rows | true |
Visualizations
Datacmp automatically generates:
- Missing Value Heatmaps - Visualize patterns in missing data
- Correlation Heatmaps - Identify relationships between features
- Distribution Plots - Understand feature distributions
cmp = DataCmp("data.csv")
cmp.clean().profile()
cmp.visualize(output_dir="./plots")
Reports
HTML Reports
Beautiful, interactive HTML reports with:
- Dataset overview and statistics
- Cleaning operation log
- Embedded visualizations
- Responsive design
cmp.export("report.html")
Text Reports
Lightweight text reports for quick inspection:
cmp.export("report.txt")
API Reference
DataCmp Class
DataCmp(data, config=None, auto_clean=False)
Methods:
clean(columns=True, missing=True, outliers=True, duplicates=True)- Clean the datasetprofile(detailed=True)- Generate profiling informationvisualize(output_dir=None)- Create visualizationsexport(output, format=None, include_plots=True)- Export resultsreset()- Reset to original DataFrameget_summary()- Get dataset summary stringget_cleaning_log()- Get list of cleaning operations
Contributing
Contributions are welcome! Please read CONTRIBUTING.md for guidelines.
Development Setup
git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -e ".[dev]"
Changelog
See CHANGELOG.md for version history.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Moustafa Mohamed
- Email: moustafa.mh.mohamed@gmail.com
- Linkedin: Moustafa Mohamed
- GitHub: MoustafaMohamed01
- Kaggle: moustafamohamed01
Show Your Support
Give a ⭐️ if this project helped you!
Resources
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacmp-3.0.0.tar.gz.
File metadata
- Download URL: datacmp-3.0.0.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5904f7e7d72100b5376b4d1308873f908f11af8f13eae9d0d9b06b73ac3a3a14
|
|
| MD5 |
4ac12497cf229f191b1d8979aa161fa7
|
|
| BLAKE2b-256 |
8b21d30bdb491db5f03e277edf96a7d25518cbde4daa47f36bb75c16612db8e9
|
File details
Details for the file datacmp-3.0.0-py3-none-any.whl.
File metadata
- Download URL: datacmp-3.0.0-py3-none-any.whl
- Upload date:
- Size: 23.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc5ebe3838bcf2768135f955b9f9e74e883e44b64aee679a451bca8205377b56
|
|
| MD5 |
46135d45831d14260cecfa89084b9423
|
|
| BLAKE2b-256 |
1d5a53ffd2e6926f5a57914205533835aee7e2a945c347047a2b4b0ff42512ff
|