A powerful and configurable library for exploratory data analysis (EDA) and data cleaning for machine learning workflows.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

datacmp – Exploratory Data Analysis & Data Cleaning Toolkit

datacmp is a lightweight, modular Python library designed to simplify and accelerate exploratory data analysis (EDA) and data cleaning tasks in data science workflows. It provides structured insights, intelligent preprocessing, and configuration flexibility through a YAML-based pipeline.

Available on PyPI: https://pypi.org/project/datacmp/

Key Features

Data Overview & Profiling

Generates concise, tabulated summaries of your dataset
Reports missing values, data types, and unique counts
Optional extended statistics: mean, median, std, skewness, kurtosis
Column type breakdown: numeric, categorical, datetime

Column Name Standardization

Automatically cleans and renames columns (lowercase, no spaces)
Logs name transformations for traceability

Missing Value & Outlier Handling

Drops columns exceeding a missing value threshold
Fills missing values using configurable strategies (mean, median, mode)
Detects and handles outliers using IQR (remove or cap)
Optionally removes duplicate rows

YAML-Based Configuration

Easy customization of fill strategies, thresholds, and outlier handling
Fully decoupled from code logic for reproducibility

Export Capabilities (v2.0+)

Save cleaned datasets as CSV
Generate human-readable reports in TXT format

Command-Line Interface (v2.0+)

Run the full pipeline directly from terminal using a CLI wrapper

Installation

Install from PyPI:

pip install datacmp

Or install from source:

git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -r requirements.txt

Requirements:

pandas
tabulate
PyYAML

Configuration (config.yaml)

Example configuration file:

cleaning:
  fill_strategy:
    categorical: mode
    numeric: median
  outlier_handling:
    enabled: true
    method: iqr
    action: cap
    iqr_multiplier: 1.5
  threshold_drop: 0.45
drop_duplicates: true
profiling:
  include_more_stats: true

Usage (Python)

Basic usage with config:

import pandas as pd
from datacmp.run_pipeline import run_pipeline

df = pd.read_csv("data.csv")
cleaned_df = run_pipeline(
    df,
    config_path="config.yaml",
    export_csv_path="cleaned.csv",
    export_report_path="summary.txt"
)

Usage (CLI)

Run from the command line:

python cli.py --file data.csv --config config.yaml --export_csv cleaned.csv --export_report summary.txt

Available arguments:

--file: input CSV file (required)
--config: YAML config file (default = config.yaml)
--export_csv: optional output path for cleaned CSV
--export_report: optional output path for summary TXT

Project Structure

datacmp/
├── datacmp/
│   ├── __init__.py
│   ├── column_cleaning.py       # Column renaming logic
│   ├── data_cleaning.py         # Missing value & outlier processing
│   ├── detailed_info.py         # Dataset summaries & profiling
│   ├── run_pipeline.py          # Main pipeline logic
├── cli.py                       # CLI entry point
├── config.yaml                  # Example configuration
├── setup.py                     # Packaging & dependencies
├── README.md
├── LICENSE

Release History

🔹 v1.0.0 – Initial release

Data profiling, missing value handling, column name cleaning, YAML config support

🔹 v2.0.0 – Major update

Added CLI support
Added CSV & TXT export options
Enhanced profiling (column type summary)

View changelog & releases → https://github.com/MoustafaMohamed01/datacmp/releases

License

Released under the MIT License. See LICENSE for details.

Author

Developed by Moustafa Mohamed 🔗 LinkedIn • Kaggle

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

3.0.0

Oct 30, 2025

This version

2.0.0

Jul 12, 2025

0.1.0

May 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacmp-2.0.0.tar.gz (7.6 kB view details)

Uploaded Jul 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datacmp-2.0.0-py3-none-any.whl (8.6 kB view details)

Uploaded Jul 12, 2025 Python 3

File details

Details for the file datacmp-2.0.0.tar.gz.

File metadata

Download URL: datacmp-2.0.0.tar.gz
Upload date: Jul 12, 2025
Size: 7.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for datacmp-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a24f2c08ca4665d75187b8882dd061ac1d19b6290ba15b6ea344d143aa400be5`
MD5	`d7a0fa9e1255b70fcc03ec16a414bb68`
BLAKE2b-256	`117dc8876de1d69237cf27b8861a324c66547961fe0f814fba6c896325ab62b9`

See more details on using hashes here.

File details

Details for the file datacmp-2.0.0-py3-none-any.whl.

File metadata

Download URL: datacmp-2.0.0-py3-none-any.whl
Upload date: Jul 12, 2025
Size: 8.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for datacmp-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cf7a9c44b43362a2eec2fd7dbde0fc5f6a8987767cf5bb4e638eedd9506e7dc1`
MD5	`5b01f5e8ca4cde47b4a52b18efc47678`
BLAKE2b-256	`c1e2bca61ebd51bb7b9099b2b533e27ffef46a11289dfb2ba92fc61ee231e4fe`

See more details on using hashes here.

datacmp 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

datacmp – Exploratory Data Analysis & Data Cleaning Toolkit

Key Features

Installation

Configuration (config.yaml)

Usage (Python)

Usage (CLI)

Project Structure

Release History

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes