Skip to main content

A powerful Python library for data cleaning and exploratory data analysis

Project description

Datacmp

PyPI version Python 3.10+ License: MIT Downloads

Datacmp is a powerful, lightweight Python library designed to simplify and accelerate data cleaning and exploratory data analysis (EDA) workflows. Built for data scientists and analysts, it provides intelligent preprocessing, structured insights, and beautiful visualizations—all with just a few lines of code.


Features

Smart Data Cleaning

  • Automatic column name standardization (lowercase, underscores, no special chars)
  • Intelligent missing value handling with configurable strategies (mean, median, mode)
  • Outlier detection and handling using IQR method (cap or remove)
  • Duplicate row removal with detailed logging
  • Configurable threshold-based column dropping

Comprehensive Profiling

  • Dataset overview with row/column counts and memory usage
  • Detailed column analysis including dtypes, missing values, and unique counts
  • Extended statistics (mean, median, std, skewness, kurtosis)
  • Correlation analysis (Pearson, Spearman, Kendall)
  • Column type detection (numeric, categorical, datetime)

Beautiful Visualizations

  • Missing value heatmaps
  • Correlation matrices
  • Distribution plots for numeric features
  • Export-ready plots in high resolution

Flexible Reporting

  • HTML reports with interactive styling and embedded visualizations
  • Text reports for quick inspection
  • CSV export of cleaned datasets
  • Complete audit trail of all cleaning operations

YAML-Based Configuration

  • Fully decoupled configuration for reproducibility
  • Pipeline versioning and easy sharing
  • Template generation with datacmp init

Command-Line Interface

  • Run pipelines directly from terminal
  • Generate config files with defaults
  • Progress tracking and verbose logging

Comparison: v2.0 → v3.0

Feature v2.0 v3.0
API Style Functional only OOP + Functional
Method Chaining
Type Hints Partial Complete
Logging print() logging module
Visualizations ✅ (3 types)
HTML Reports ✅ (styled)
Correlations
CLI Subcommands
Test Coverage None Comprehensive
Packaging setup.py pyproject.toml

Installation

From PyPI (Recommended)

pip install datacmp

From Source

git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -e .

With Optional Dependencies

# For full features
pip install datacmp[full]

# For development
pip install datacmp[dev]

Quick Start

Python API

from datacmp import DataCmp

# Load and process your data
cmp = DataCmp("data.csv")

# Clean, profile, and export in one chain
cmp.clean().profile().export("report.html")

# Or use individual methods
cmp = DataCmp("data.csv")
cmp.clean(outliers=True, duplicates=True)
cmp.profile(detailed=True)
cmp.visualize(output_dir="./plots")
cmp.export("cleaned_data.csv")

Command-Line Interface

# Run complete pipeline
datacmp run data.csv --config config.yaml --export cleaned.csv --report report.html

# Create default config file
datacmp init my_config.yaml

# Show version
datacmp version

Usage Examples

Example 1: Basic Cleaning

from datacmp import DataCmp

# Initialize with auto-clean
cmp = DataCmp("messy_data.csv", auto_clean=True)

# Export cleaned data
cmp.export("clean_data.csv")

# View cleaning log
print(cmp.get_cleaning_log())

Example 2: Custom Configuration

from datacmp import DataCmp

# Define custom config
config = {
    "cleaning": {
        "threshold_drop": 0.3,  # Drop columns with >30% missing
        "fill_strategy": {
            "numeric": "mean",
            "categorical": "mode"
        },
        "outlier_handling": {
            "enabled": True,
            "action": "remove"  # Remove outliers instead of capping
        }
    }
}

# Use custom config
cmp = DataCmp("data.csv", config=config)
cmp.clean().profile()

Example 3: Method Chaining

from datacmp import DataCmp

result = (
    DataCmp("data.csv")
    .clean(columns=True, missing=True, outliers=True)
    .profile(detailed=True)
    .visualize(output_dir="./plots")
    .export("report.html")
)

Example 4: Programmatic Pipeline

from datacmp import run_pipeline

# Run entire pipeline with one function
df_cleaned = run_pipeline(
    data="data.csv",
    config_path="config.yaml",
    export_csv_path="cleaned.csv",
    export_report_path="report.html"
)

Configuration

Example config.yaml

library_name: datacmp
version: 3.0.0
author: Moustafa Mohamed

cleaning:
  threshold_drop: 0.45
  fill_strategy:
    numeric: median
    categorical: mode
  outlier_handling:
    enabled: true
    method: iqr
    iqr_multiplier: 1.5
    action: cap

drop_duplicates: true

profiling:
  include_more_stats: true
  compute_correlations: true

Configuration Options

Option Description Default
threshold_drop Drop columns with missing ratio above this 0.45
fill_strategy.numeric Strategy for numeric columns (mean, median, mode) median
fill_strategy.categorical Strategy for categorical columns (mode) mode
outlier_handling.enabled Enable outlier detection true
outlier_handling.method Detection method (iqr) iqr
outlier_handling.action Action to take (cap, remove) cap
drop_duplicates Remove duplicate rows true

Visualizations

Datacmp automatically generates:

  • Missing Value Heatmaps - Visualize patterns in missing data
  • Correlation Heatmaps - Identify relationships between features
  • Distribution Plots - Understand feature distributions
cmp = DataCmp("data.csv")
cmp.clean().profile()
cmp.visualize(output_dir="./plots")

Reports

HTML Reports

Beautiful, interactive HTML reports with:

  • Dataset overview and statistics
  • Cleaning operation log
  • Embedded visualizations
  • Responsive design
cmp.export("report.html")

Text Reports

Lightweight text reports for quick inspection:

cmp.export("report.txt")

API Reference

DataCmp Class

DataCmp(data, config=None, auto_clean=False)

Methods:

  • clean(columns=True, missing=True, outliers=True, duplicates=True) - Clean the dataset
  • profile(detailed=True) - Generate profiling information
  • visualize(output_dir=None) - Create visualizations
  • export(output, format=None, include_plots=True) - Export results
  • reset() - Reset to original DataFrame
  • get_summary() - Get dataset summary string
  • get_cleaning_log() - Get list of cleaning operations

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Development Setup

git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -e ".[dev]"

Changelog

See CHANGELOG.md for version history.


License

This project is licensed under the MIT License - see the LICENSE file for details.


Author

Moustafa Mohamed


Show Your Support

Give a ⭐️ if this project helped you!


Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacmp-3.0.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacmp-3.0.0-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file datacmp-3.0.0.tar.gz.

File metadata

  • Download URL: datacmp-3.0.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for datacmp-3.0.0.tar.gz
Algorithm Hash digest
SHA256 5904f7e7d72100b5376b4d1308873f908f11af8f13eae9d0d9b06b73ac3a3a14
MD5 4ac12497cf229f191b1d8979aa161fa7
BLAKE2b-256 8b21d30bdb491db5f03e277edf96a7d25518cbde4daa47f36bb75c16612db8e9

See more details on using hashes here.

File details

Details for the file datacmp-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: datacmp-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for datacmp-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dc5ebe3838bcf2768135f955b9f9e74e883e44b64aee679a451bca8205377b56
MD5 46135d45831d14260cecfa89084b9423
BLAKE2b-256 1d5a53ffd2e6926f5a57914205533835aee7e2a945c347047a2b4b0ff42512ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page