Skip to main content

A lightweight library for exploring and cleaning datasets for ML workflows.

Project description

Datacmp

License: MIT

Datacmp is a lightweight and modular Python library for exploratory data analysis (EDA) and data cleaning using pandas.
It helps you quickly generate clean summaries, standardize column names, and handle missing values — all with professional tabulated outputs and optional YAML configuration.


Features

  • Dataset Summary
    • Report total rows, columns, data types, missing values, and basic statistics
  • Column Name Cleaning
    • Standardize column names for readability and consistency
  • Missing Value Handling (clean_missing_data)
    • Convert data types (numeric and datetime)
    • Drop columns with excessive missing values
    • Fill missing values using intelligent strategies (mean, median, mode)
    • Optionally remove duplicate rows
  • YAML Configuration Support
    • Customize behavior using config.yaml without touching your code
  • Formatted Output
    • Display insights with beautiful, readable tables powered by tabulate

Installation

Clone the repository:

git clone https://github.com/MoustafaMohamed01/datacmp.git
cd dataforge

Install dependencies:

pip install -r requirements.txt

Or install them manually:

pip install pandas tabulate

Project Structure

datacmp/
│
├── datacmp/
│   ├── __init__.py            # Main package initializer
│   ├── column_cleaning.py      # Functions to clean column names
│   ├── detailed_info.py       # EDA functions for summarizing datasets
│   ├── data_cleaning.py     # Functions to handle missing values intelligently
│
├── config.yaml                # Optional configuration file
├── LICENSE                    # MIT license
├── requirements.txt           # Project dependencies
├── setup.py                   # Setup script for packaging
├── README.md                  # Project documentation

Requirements

  • pandas
  • tabulate

All required packages are listed in requirements.txt.


License

This project is licensed under the MIT License.


Author

Developed by Moustafa Mohamed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacmp-0.1.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacmp-0.1.0-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file datacmp-0.1.0.tar.gz.

File metadata

  • Download URL: datacmp-0.1.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for datacmp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b25186b35f5247b530a2ff83b2e361132701af5e31005dbca2205c9640b9ef56
MD5 8938c376d0957143d57e030c0c59ea8a
BLAKE2b-256 2b7704766535e0b0ca9a7f1bcc6d4f0f4b06955dc56dbb22d5c806f7885593d4

See more details on using hashes here.

File details

Details for the file datacmp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datacmp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for datacmp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9dc3718a829e18beab7809258e51cec861c3a6b15680ebe013cb3747d0844745
MD5 c376d03c48e22cf83ef840f16fb69ee9
BLAKE2b-256 d9b78f4bfe02c4f08ec016e9abc3ab6ac1fe38ea0898e156723e09ddcd4f7087

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page