Skip to main content

A powerful and configurable library for exploratory data analysis (EDA) and data cleaning for machine learning workflows.

Project description

datacmp – Exploratory Data Analysis & Data Cleaning Toolkit

PyPI version License: MIT

datacmp is a lightweight, modular Python library designed to simplify and accelerate exploratory data analysis (EDA) and data cleaning tasks in data science workflows. It provides structured insights, intelligent preprocessing, and configuration flexibility through a YAML-based pipeline.

Available on PyPI: https://pypi.org/project/datacmp/


Key Features

Data Overview & Profiling

  • Generates concise, tabulated summaries of your dataset
  • Reports missing values, data types, and unique counts
  • Optional extended statistics: mean, median, std, skewness, kurtosis
  • Column type breakdown: numeric, categorical, datetime

Column Name Standardization

  • Automatically cleans and renames columns (lowercase, no spaces)
  • Logs name transformations for traceability

Missing Value & Outlier Handling

  • Drops columns exceeding a missing value threshold
  • Fills missing values using configurable strategies (mean, median, mode)
  • Detects and handles outliers using IQR (remove or cap)
  • Optionally removes duplicate rows

YAML-Based Configuration

  • Easy customization of fill strategies, thresholds, and outlier handling
  • Fully decoupled from code logic for reproducibility

Export Capabilities (v2.0+)

  • Save cleaned datasets as CSV
  • Generate human-readable reports in TXT format

Command-Line Interface (v2.0+)

  • Run the full pipeline directly from terminal using a CLI wrapper

Installation

Install from PyPI:

pip install datacmp

Or install from source:

git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -r requirements.txt

Requirements:

  • pandas
  • tabulate
  • PyYAML

Configuration (config.yaml)

Example configuration file:

cleaning:
  fill_strategy:
    categorical: mode
    numeric: median
  outlier_handling:
    enabled: true
    method: iqr
    action: cap
    iqr_multiplier: 1.5
  threshold_drop: 0.45
drop_duplicates: true
profiling:
  include_more_stats: true

Usage (Python)

Basic usage with config:

import pandas as pd
from datacmp.run_pipeline import run_pipeline

df = pd.read_csv("data.csv")
cleaned_df = run_pipeline(
    df,
    config_path="config.yaml",
    export_csv_path="cleaned.csv",
    export_report_path="summary.txt"
)

Usage (CLI)

Run from the command line:

python cli.py --file data.csv --config config.yaml --export_csv cleaned.csv --export_report summary.txt

Available arguments:

  • --file: input CSV file (required)
  • --config: YAML config file (default = config.yaml)
  • --export_csv: optional output path for cleaned CSV
  • --export_report: optional output path for summary TXT

Project Structure

datacmp/
├── datacmp/
│   ├── __init__.py
│   ├── column_cleaning.py       # Column renaming logic
│   ├── data_cleaning.py         # Missing value & outlier processing
│   ├── detailed_info.py         # Dataset summaries & profiling
│   ├── run_pipeline.py          # Main pipeline logic
├── cli.py                       # CLI entry point
├── config.yaml                  # Example configuration
├── setup.py                     # Packaging & dependencies
├── README.md
├── LICENSE

Release History

🔹 v1.0.0 – Initial release

  • Data profiling, missing value handling, column name cleaning, YAML config support

🔹 v2.0.0 – Major update

  • Added CLI support
  • Added CSV & TXT export options
  • Enhanced profiling (column type summary)

View changelog & releases → https://github.com/MoustafaMohamed01/datacmp/releases


License

Released under the MIT License. See LICENSE for details.


Author

Developed by Moustafa Mohamed 🔗 LinkedInKaggle


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacmp-2.0.0.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacmp-2.0.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file datacmp-2.0.0.tar.gz.

File metadata

  • Download URL: datacmp-2.0.0.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for datacmp-2.0.0.tar.gz
Algorithm Hash digest
SHA256 a24f2c08ca4665d75187b8882dd061ac1d19b6290ba15b6ea344d143aa400be5
MD5 d7a0fa9e1255b70fcc03ec16a414bb68
BLAKE2b-256 117dc8876de1d69237cf27b8861a324c66547961fe0f814fba6c896325ab62b9

See more details on using hashes here.

File details

Details for the file datacmp-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: datacmp-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for datacmp-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf7a9c44b43362a2eec2fd7dbde0fc5f6a8987767cf5bb4e638eedd9506e7dc1
MD5 5b01f5e8ca4cde47b4a52b18efc47678
BLAKE2b-256 c1e2bca61ebd51bb7b9099b2b533e27ffef46a11289dfb2ba92fc61ee231e4fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page