Skip to main content

A streamlined data preprocessing toolkit for machine learning

Project description

DataPrepKit_AO_v1.0

DataPrepKit_AO_v1.0 is a Python library designed for efficient data preprocessing. It provides a comprehensive set of tools to clean, transform, and prepare your data for machine learning or analysis. This kit allows you to understand your dataset better and get it ready for modeling.

Features

  • Data Summary (data_summary): Get a detailed overview of your dataset:
    • Shape (rows, columns)
    • Missing value counts per column
    • Unique value counts per column
    • Descriptive statistics (mean, median, mode, standard deviation, minimum, maximum) for numerical columns
    • Value counts for non-numerical columns
  • Encoding Categorical Features (encode_categorical):
    • Label Encoding: Convert categorical values into numerical labels.
    • Ordinal Encoding: Encode categories in a specific order (if an order is inherent in your data).
    • One-Hot Encoding: Create dummy variables for each category.
  • Scaling Numerical Features (scale_data):
    • Standard Scaling: Center data around mean 0 with standard deviation 1.
    • Min-Max Scaling: Rescale features to a range of 0 to 1.
    • Robust Scaling: Scaling less affected by outliers, using median and IQR.
    • Normalization: Scale each sample (row) to have unit norm (length of 1).
  • Imputation of Missing Values (impute_data):
    • Dropping Missing Values: Remove rows with missing values.
    • Mean/Median/Most Frequent Imputation: Fill missing values using central tendency measures.
    • Constant Value Imputation: Fill with a user-specified constant.
    • K-Nearest Neighbors (KNN) Imputation: Use the values of nearest neighbors to fill in missing data.
    • Iterative Imputation: A more advanced method that models each feature with missing values as a function of other features.
  • Dropping Rows and Columns (drop): Remove specified rows or columns from your DataFrame.

Installation

pip install pandas numpy scikit-learn

Usage

import pandas as pd
from dataprepkit.data_prep_kit import DataPrepKit

# Load data
data = pd.read_csv('your_data.csv') 
data_prep = DataPrepKit(data=data)

# Data understanding
data_prep.data_summary()

# Encoding 
data_prep.encode_categorical(columns=['your_categorical_column'], method='one-hot')

# Scaling
data_prep.scale_data(columns=['your_numerical_column'], method='standard')

# Imputation
data_prep.impute_data(columns=['column_with_missing_values'], method='mean')

# Dropping data
data_prep.drop(columns=['unwanted_column'])
data_prep.drop(rows=[0, 5]) # Drop specific rows 

# Access the processed data
processed_df = data_prep.data

Error Handling

The library includes checks for:

  • Missing columns: KeyError is raised if you try to operate on columns not in your DataFrame.
  • Invalid method names: ValueError if you specify an incorrect encoding, scaling, or imputation method.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions, please contact abdalrahman.osama01@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprepkit-ao-1.0.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataprepkit_ao-1.0.0-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file dataprepkit-ao-1.0.0.tar.gz.

File metadata

  • Download URL: dataprepkit-ao-1.0.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for dataprepkit-ao-1.0.0.tar.gz
Algorithm Hash digest
SHA256 57c0567bc669da997682aa0defff30b3cbb64f2f8ca96d47d94e475dd5b39fec
MD5 c008dc5f5c4b64ccf9c618e983a82a1f
BLAKE2b-256 961f60e4e9dd29766f6f0108ff4e4c741c3e8f25fa236d0ad0a85e859af5f6cb

See more details on using hashes here.

File details

Details for the file dataprepkit_ao-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: dataprepkit_ao-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for dataprepkit_ao-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e681934b964d6ed4c8cb52ea4ef82281fd15e272ddefb3faef2bb5b11e7f25cb
MD5 3c43ef7c072dda807fb499efa80a53a5
BLAKE2b-256 faf6fb70b6186ca07cca5bdde9bcad71bae8007da03511a1b1a7b53cbb6768f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page