A streamlined data preprocessing toolkit for machine learning
Project description
DataPrepKit_AO_v1.0
DataPrepKit_AO_v1.0 is a Python library designed for efficient data preprocessing. It provides a comprehensive set of tools to clean, transform, and prepare your data for machine learning or analysis. This kit allows you to understand your dataset better and get it ready for modeling.
Features
- Data Summary (
data_summary): Get a detailed overview of your dataset:- Shape (rows, columns)
- Missing value counts per column
- Unique value counts per column
- Descriptive statistics (mean, median, mode, standard deviation, minimum, maximum) for numerical columns
- Value counts for non-numerical columns
- Encoding Categorical Features (
encode_categorical):- Label Encoding: Convert categorical values into numerical labels.
- Ordinal Encoding: Encode categories in a specific order (if an order is inherent in your data).
- One-Hot Encoding: Create dummy variables for each category.
- Scaling Numerical Features (
scale_data):- Standard Scaling: Center data around mean 0 with standard deviation 1.
- Min-Max Scaling: Rescale features to a range of 0 to 1.
- Robust Scaling: Scaling less affected by outliers, using median and IQR.
- Normalization: Scale each sample (row) to have unit norm (length of 1).
- Imputation of Missing Values (
impute_data):- Dropping Missing Values: Remove rows with missing values.
- Mean/Median/Most Frequent Imputation: Fill missing values using central tendency measures.
- Constant Value Imputation: Fill with a user-specified constant.
- K-Nearest Neighbors (KNN) Imputation: Use the values of nearest neighbors to fill in missing data.
- Iterative Imputation: A more advanced method that models each feature with missing values as a function of other features.
- Dropping Rows and Columns (
drop): Remove specified rows or columns from your DataFrame.
Installation
pip install pandas numpy scikit-learn
Usage
import pandas as pd
from dataprepkit.data_prep_kit import DataPrepKit
# Load data
data = pd.read_csv('your_data.csv')
data_prep = DataPrepKit(data=data)
# Data understanding
data_prep.data_summary()
# Encoding
data_prep.encode_categorical(columns=['your_categorical_column'], method='one-hot')
# Scaling
data_prep.scale_data(columns=['your_numerical_column'], method='standard')
# Imputation
data_prep.impute_data(columns=['column_with_missing_values'], method='mean')
# Dropping data
data_prep.drop(columns=['unwanted_column'])
data_prep.drop(rows=[0, 5]) # Drop specific rows
# Access the processed data
processed_df = data_prep.data
Error Handling
The library includes checks for:
- Missing columns:
KeyErroris raised if you try to operate on columns not in your DataFrame. - Invalid method names:
ValueErrorif you specify an incorrect encoding, scaling, or imputation method.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contact
For any questions, please contact abdalrahman.osama01@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataprepkit-ao-1.0.0.tar.gz.
File metadata
- Download URL: dataprepkit-ao-1.0.0.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57c0567bc669da997682aa0defff30b3cbb64f2f8ca96d47d94e475dd5b39fec
|
|
| MD5 |
c008dc5f5c4b64ccf9c618e983a82a1f
|
|
| BLAKE2b-256 |
961f60e4e9dd29766f6f0108ff4e4c741c3e8f25fa236d0ad0a85e859af5f6cb
|
File details
Details for the file dataprepkit_ao-1.0.0-py3-none-any.whl.
File metadata
- Download URL: dataprepkit_ao-1.0.0-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e681934b964d6ed4c8cb52ea4ef82281fd15e272ddefb3faef2bb5b11e7f25cb
|
|
| MD5 |
3c43ef7c072dda807fb499efa80a53a5
|
|
| BLAKE2b-256 |
faf6fb70b6186ca07cca5bdde9bcad71bae8007da03511a1b1a7b53cbb6768f6
|