A powerful and configurable library for exploratory data analysis (EDA) and data cleaning for machine learning workflows.
Project description
datacmp – Exploratory Data Analysis & Data Cleaning Toolkit
datacmp is a lightweight, modular Python library designed to simplify and accelerate exploratory data analysis (EDA) and data cleaning tasks in data science workflows. It provides structured insights, intelligent preprocessing, and configuration flexibility through a YAML-based pipeline.
Available on PyPI: https://pypi.org/project/datacmp/
Key Features
Data Overview & Profiling
- Generates concise, tabulated summaries of your dataset
- Reports missing values, data types, and unique counts
- Optional extended statistics: mean, median, std, skewness, kurtosis
- Column type breakdown: numeric, categorical, datetime
Column Name Standardization
- Automatically cleans and renames columns (lowercase, no spaces)
- Logs name transformations for traceability
Missing Value & Outlier Handling
- Drops columns exceeding a missing value threshold
- Fills missing values using configurable strategies (mean, median, mode)
- Detects and handles outliers using IQR (remove or cap)
- Optionally removes duplicate rows
YAML-Based Configuration
- Easy customization of fill strategies, thresholds, and outlier handling
- Fully decoupled from code logic for reproducibility
Export Capabilities (v2.0+)
- Save cleaned datasets as CSV
- Generate human-readable reports in TXT format
Command-Line Interface (v2.0+)
- Run the full pipeline directly from terminal using a CLI wrapper
Installation
Install from PyPI:
pip install datacmp
Or install from source:
git clone https://github.com/MoustafaMohamed01/datacmp.git
cd datacmp
pip install -r requirements.txt
Requirements:
- pandas
- tabulate
- PyYAML
Configuration (config.yaml)
Example configuration file:
cleaning:
fill_strategy:
categorical: mode
numeric: median
outlier_handling:
enabled: true
method: iqr
action: cap
iqr_multiplier: 1.5
threshold_drop: 0.45
drop_duplicates: true
profiling:
include_more_stats: true
Usage (Python)
Basic usage with config:
import pandas as pd
from datacmp.run_pipeline import run_pipeline
df = pd.read_csv("data.csv")
cleaned_df = run_pipeline(
df,
config_path="config.yaml",
export_csv_path="cleaned.csv",
export_report_path="summary.txt"
)
Usage (CLI)
Run from the command line:
python cli.py --file data.csv --config config.yaml --export_csv cleaned.csv --export_report summary.txt
Available arguments:
- --file: input CSV file (required)
- --config: YAML config file (default = config.yaml)
- --export_csv: optional output path for cleaned CSV
- --export_report: optional output path for summary TXT
Project Structure
datacmp/
├── datacmp/
│ ├── __init__.py
│ ├── column_cleaning.py # Column renaming logic
│ ├── data_cleaning.py # Missing value & outlier processing
│ ├── detailed_info.py # Dataset summaries & profiling
│ ├── run_pipeline.py # Main pipeline logic
├── cli.py # CLI entry point
├── config.yaml # Example configuration
├── setup.py # Packaging & dependencies
├── README.md
├── LICENSE
Release History
🔹 v1.0.0 – Initial release
- Data profiling, missing value handling, column name cleaning, YAML config support
🔹 v2.0.0 – Major update
- Added CLI support
- Added CSV & TXT export options
- Enhanced profiling (column type summary)
View changelog & releases → https://github.com/MoustafaMohamed01/datacmp/releases
License
Released under the MIT License. See LICENSE for details.
Author
Developed by Moustafa Mohamed 🔗 LinkedIn • Kaggle
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacmp-2.0.0.tar.gz.
File metadata
- Download URL: datacmp-2.0.0.tar.gz
- Upload date:
- Size: 7.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a24f2c08ca4665d75187b8882dd061ac1d19b6290ba15b6ea344d143aa400be5
|
|
| MD5 |
d7a0fa9e1255b70fcc03ec16a414bb68
|
|
| BLAKE2b-256 |
117dc8876de1d69237cf27b8861a324c66547961fe0f814fba6c896325ab62b9
|
File details
Details for the file datacmp-2.0.0-py3-none-any.whl.
File metadata
- Download URL: datacmp-2.0.0-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf7a9c44b43362a2eec2fd7dbde0fc5f6a8987767cf5bb4e638eedd9506e7dc1
|
|
| MD5 |
5b01f5e8ca4cde47b4a52b18efc47678
|
|
| BLAKE2b-256 |
c1e2bca61ebd51bb7b9099b2b533e27ffef46a11289dfb2ba92fc61ee231e4fe
|