A lightweight library for exploring and cleaning datasets for ML workflows.
Project description
Datacmp
Datacmp is a lightweight and modular Python library for exploratory data analysis (EDA) and data cleaning using pandas.
It helps you quickly generate clean summaries, standardize column names, and handle missing values — all with professional tabulated outputs and optional YAML configuration.
Features
- Dataset Summary
- Report total rows, columns, data types, missing values, and basic statistics
- Column Name Cleaning
- Standardize column names for readability and consistency
- Missing Value Handling (
clean_missing_data)- Convert data types (numeric and datetime)
- Drop columns with excessive missing values
- Fill missing values using intelligent strategies (mean, median, mode)
- Optionally remove duplicate rows
- YAML Configuration Support
- Customize behavior using
config.yamlwithout touching your code
- Customize behavior using
- Formatted Output
- Display insights with beautiful, readable tables powered by
tabulate
- Display insights with beautiful, readable tables powered by
Installation
Clone the repository:
git clone https://github.com/MoustafaMohamed01/datacmp.git
cd dataforge
Install dependencies:
pip install -r requirements.txt
Or install them manually:
pip install pandas tabulate
Project Structure
datacmp/
│
├── datacmp/
│ ├── __init__.py # Main package initializer
│ ├── column_cleaning.py # Functions to clean column names
│ ├── detailed_info.py # EDA functions for summarizing datasets
│ ├── data_cleaning.py # Functions to handle missing values intelligently
│
├── config.yaml # Optional configuration file
├── LICENSE # MIT license
├── requirements.txt # Project dependencies
├── setup.py # Setup script for packaging
├── README.md # Project documentation
Requirements
pandastabulate
All required packages are listed in requirements.txt.
License
This project is licensed under the MIT License.
Author
Developed by Moustafa Mohamed
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacmp-0.1.0.tar.gz.
File metadata
- Download URL: datacmp-0.1.0.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b25186b35f5247b530a2ff83b2e361132701af5e31005dbca2205c9640b9ef56
|
|
| MD5 |
8938c376d0957143d57e030c0c59ea8a
|
|
| BLAKE2b-256 |
2b7704766535e0b0ca9a7f1bcc6d4f0f4b06955dc56dbb22d5c806f7885593d4
|
File details
Details for the file datacmp-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datacmp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dc3718a829e18beab7809258e51cec861c3a6b15680ebe013cb3747d0844745
|
|
| MD5 |
c376d03c48e22cf83ef840f16fb69ee9
|
|
| BLAKE2b-256 |
d9b78f4bfe02c4f08ec016e9abc3ab6ac1fe38ea0898e156723e09ddcd4f7087
|