A powerful Python library for data cleaning and exploratory data analysis

These details have not been verified by PyPI

Project links

Project description

Datacmp

Datacmp is a powerful, lightweight Python library designed to simplify and accelerate data cleaning and exploratory data analysis (EDA) workflows. Built for data scientists and analysts, it provides intelligent preprocessing, structured insights, and beautiful visualizations—all with just a few lines of code.

Features

Smart Data Cleaning

Automatic column name standardization (lowercase, underscores, no special chars)
Intelligent missing value handling with configurable strategies (mean, median, mode)
Outlier detection and handling using IQR method (cap or remove)
Duplicate row removal with detailed logging
Configurable threshold-based column dropping

Comprehensive Profiling

Dataset overview with row/column counts and memory usage
Detailed column analysis including dtypes, missing values, and unique counts
Extended statistics (mean, median, std, skewness, kurtosis)
Correlation analysis (Pearson, Spearman, Kendall)
Column type detection (numeric, categorical, datetime)

Beautiful Visualizations

Missing value heatmaps
Correlation matrices
Distribution plots for numeric features
Export-ready plots in high resolution

Flexible Reporting

HTML reports with interactive styling and embedded visualizations
Text reports for quick inspection
CSV export of cleaned datasets
Complete audit trail of all cleaning operations

YAML-Based Configuration

Fully decoupled configuration for reproducibility
Pipeline versioning and easy sharing
Template generation with datacmp init

Command-Line Interface

Run pipelines directly from terminal
Generate config files with defaults
Progress tracking and verbose logging

Comparison: v2.0 → v3.0

Feature	v2.0	v3.0
API Style	Functional only	OOP + Functional
Method Chaining	❌	✅
Type Hints	Partial	Complete
Logging	print()	logging module
Visualizations	❌	✅ (3 types)
HTML Reports	❌	✅ (styled)
Correlations	❌	✅
CLI Subcommands	❌	✅
Test Coverage	None	Comprehensive
Packaging	setup.py	pyproject.toml

Installation

From PyPI (Recommended)

pip install datacmp

From Source

git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -e .

With Optional Dependencies

# For full features
pip install datacmp[full]

# For development
pip install datacmp[dev]

Quick Start

Python API

from datacmp import DataCmp

# Load and process your data
cmp = DataCmp("data.csv")

# Clean, profile, and export in one chain
cmp.clean().profile().export("report.html")

# Or use individual methods
cmp = DataCmp("data.csv")
cmp.clean(outliers=True, duplicates=True)
cmp.profile(detailed=True)
cmp.visualize(output_dir="./plots")
cmp.export("cleaned_data.csv")

Command-Line Interface

# Run complete pipeline
datacmp run data.csv --config config.yaml --export cleaned.csv --report report.html

# Create default config file
datacmp init my_config.yaml

# Show version
datacmp version

Usage Examples

Example 1: Basic Cleaning

from datacmp import DataCmp

# Initialize with auto-clean
cmp = DataCmp("messy_data.csv", auto_clean=True)

# Export cleaned data
cmp.export("clean_data.csv")

# View cleaning log
print(cmp.get_cleaning_log())

Example 2: Custom Configuration

from datacmp import DataCmp

# Define custom config
config = {
    "cleaning": {
        "threshold_drop": 0.3,  # Drop columns with >30% missing
        "fill_strategy": {
            "numeric": "mean",
            "categorical": "mode"
        },
        "outlier_handling": {
            "enabled": True,
            "action": "remove"  # Remove outliers instead of capping
        }
    }
}

# Use custom config
cmp = DataCmp("data.csv", config=config)
cmp.clean().profile()

Example 3: Method Chaining

from datacmp import DataCmp

result = (
    DataCmp("data.csv")
    .clean(columns=True, missing=True, outliers=True)
    .profile(detailed=True)
    .visualize(output_dir="./plots")
    .export("report.html")
)

Example 4: Programmatic Pipeline

from datacmp import run_pipeline

# Run entire pipeline with one function
df_cleaned = run_pipeline(
    data="data.csv",
    config_path="config.yaml",
    export_csv_path="cleaned.csv",
    export_report_path="report.html"
)

Configuration

Example `config.yaml`

library_name: datacmp
version: 3.0.0
author: Moustafa Mohamed

cleaning:
  threshold_drop: 0.45
  fill_strategy:
    numeric: median
    categorical: mode
  outlier_handling:
    enabled: true
    method: iqr
    iqr_multiplier: 1.5
    action: cap

drop_duplicates: true

profiling:
  include_more_stats: true
  compute_correlations: true

Configuration Options

Option	Description	Default
`threshold_drop`	Drop columns with missing ratio above this	`0.45`
`fill_strategy.numeric`	Strategy for numeric columns (`mean`, `median`, `mode`)	`median`
`fill_strategy.categorical`	Strategy for categorical columns (`mode`)	`mode`
`outlier_handling.enabled`	Enable outlier detection	`true`
`outlier_handling.method`	Detection method (`iqr`)	`iqr`
`outlier_handling.action`	Action to take (`cap`, `remove`)	`cap`
`drop_duplicates`	Remove duplicate rows	`true`

Visualizations

Datacmp automatically generates:

Missing Value Heatmaps - Visualize patterns in missing data
Correlation Heatmaps - Identify relationships between features
Distribution Plots - Understand feature distributions

cmp = DataCmp("data.csv")
cmp.clean().profile()
cmp.visualize(output_dir="./plots")

Reports

HTML Reports

Beautiful, interactive HTML reports with:

Dataset overview and statistics
Cleaning operation log
Embedded visualizations
Responsive design

cmp.export("report.html")

Text Reports

Lightweight text reports for quick inspection:

cmp.export("report.txt")

API Reference

DataCmp Class

DataCmp(data, config=None, auto_clean=False)

Methods:

clean(columns=True, missing=True, outliers=True, duplicates=True) - Clean the dataset
profile(detailed=True) - Generate profiling information
visualize(output_dir=None) - Create visualizations
export(output, format=None, include_plots=True) - Export results
reset() - Reset to original DataFrame
get_summary() - Get dataset summary string
get_cleaning_log() - Get list of cleaning operations

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Development Setup

git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -e ".[dev]"

Changelog

See CHANGELOG.md for version history.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Moustafa Mohamed

Show Your Support

Give a ⭐️ if this project helped you!

Resources

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.0.0

Oct 30, 2025

2.0.0

Jul 12, 2025

0.1.0

May 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacmp-3.0.0.tar.gz (22.1 kB view details)

Uploaded Oct 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datacmp-3.0.0-py3-none-any.whl (23.7 kB view details)

Uploaded Oct 30, 2025 Python 3

File details

Details for the file datacmp-3.0.0.tar.gz.

File metadata

Download URL: datacmp-3.0.0.tar.gz
Upload date: Oct 30, 2025
Size: 22.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for datacmp-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`5904f7e7d72100b5376b4d1308873f908f11af8f13eae9d0d9b06b73ac3a3a14`
MD5	`4ac12497cf229f191b1d8979aa161fa7`
BLAKE2b-256	`8b21d30bdb491db5f03e277edf96a7d25518cbde4daa47f36bb75c16612db8e9`

See more details on using hashes here.

File details

Details for the file datacmp-3.0.0-py3-none-any.whl.

File metadata

Download URL: datacmp-3.0.0-py3-none-any.whl
Upload date: Oct 30, 2025
Size: 23.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for datacmp-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc5ebe3838bcf2768135f955b9f9e74e883e44b64aee679a451bca8205377b56`
MD5	`46135d45831d14260cecfa89084b9423`
BLAKE2b-256	`1d5a53ffd2e6926f5a57914205533835aee7e2a945c347047a2b4b0ff42512ff`

See more details on using hashes here.

datacmp 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Datacmp

Features

Smart Data Cleaning

Comprehensive Profiling

Beautiful Visualizations

Flexible Reporting

YAML-Based Configuration

Command-Line Interface

Comparison: v2.0 → v3.0

Installation

From PyPI (Recommended)

From Source

With Optional Dependencies

Quick Start

Python API

Command-Line Interface

Usage Examples

Example 1: Basic Cleaning

Example 2: Custom Configuration

Example 3: Method Chaining

Example 4: Programmatic Pipeline

Configuration

Example config.yaml

Configuration Options

Visualizations

Reports

HTML Reports

Text Reports

API Reference

DataCmp Class

Contributing

Development Setup

Changelog

License

Author

Show Your Support

Resources

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Example `config.yaml`