Modular and extensible data preprocessing library
Project description
🪿GeeseTools🛠
Modular and Extensible Data Preprocessing Library for Machine Learning
GeeseTools is a plug-and-play, mixin-based Python library that streamlines the preprocessing of tabular datasets for machine learning tasks. Whether you’re cleaning messy data, encoding categories, transforming skewed distributions, or scaling features — this package has you covered.
Features
- Handle missing data
- Convert object columns to numeric
- Identify feature types (categorical, ordinal, nominal, etc.)
- Encode nominal and ordinal features
- Transform skewed and heavy-tailed features
- Scale features with standard or power transformations
- Train-test split with optional oversampling
- Transformation logs for transparency and reproducibility
- Built using Mixins for modular extension
⚙️ Installation
You can install the package directly from PyPI:
pip install GeeseTools
Usage
import GeeseTools as gt
# Instantiate with a dataset
obj = gt(
dataframe=df,
target_variable='target',
ordinal_features=['education_level'],
ordinal_categories=[['Low', 'Medium', 'High']],
use_one_hot_encoding=True
)
# Apply full preprocessing pipeline
X_train, X_test, y_train, y_test = obj.pre_process()
# Access logs
print(obj.transformation_log_df)
Default Sample Dataset
If no DataFrame is provided, the processor loads a built-in heart.csv dataset:
obj = GeeseTools() # Uses sample heart dataset
# Apply full preprocessing pipeline
X_train, X_test, y_train, y_test = obj.pre_process()
Project Structure
📦 GeeseTools/
├── 📂 data/ # Contains bundled datasets
│ ├── 📄 heart.csv # Sample dataset (CSV format)
│ └── 📜 __init__.py # Makes 'data' a subpackage
│
├── 📜 GeeseTools.py # Core toolkit initializer or controller
├── 📜 datasets.py # Dataset loading utilities
├── 🧩 display_mixin.py # Display-related mixin
├── 🧩 drop_features_mixin.py # Drop unwanted features
├── 🧩 drop_records_mixin.py # Drop records based on rules
├── 🧩 encode_mixin.py # Encoding (label, one-hot)
├── 🧩 feature_target_split_mixin.py # Split into features & target
├── 🧩 feature_type_mixin.py # Feature type detection
├── 🧩 impute_features_mixin.py # Fill missing values
├── 🧩 missing_data_summary_mixin.py # Summary of missing data
├── 🧩 oversample_mixin.py # Oversampling (e.g., SMOTE)
├── 🧩 pre_process_mixin.py # Complete preprocessing pipeline
├── 🧩 sample_data_mixin.py # Random sampling utilities
├── 🧩 scale_mixin.py # Scaling methods
├── 🧩 split_dataframe_mixin.py # Split dataframe columns
├── 🧩 to_numeric_mixin.py # Convert to numeric
├── 🧩 transform_mixin.py # Feature transformations
├── 🧩 unique_value_summary_mixin.py # Unique value summary
└── 📜 __init__.py # Initializes GeeseTools package
Requirements
- Python 3.9–3.11
- pandas
- scikit-learn
- imbalanced-learn
- scipy
- ipython
- openpyxl
License
MIT © Abhijeet
You're free to use, modify, and distribute this project with proper attribution.
Contributions Welcome
Fork it!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geesetools-0.1.21.tar.gz.
File metadata
- Download URL: geesetools-0.1.21.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36d49f394a29d53bdeb9f9eb57795d2baa229ababc45d0e4dff48cdfdcef1f1f
|
|
| MD5 |
9001d88a70fcc9fcf897dadfc37655a4
|
|
| BLAKE2b-256 |
1798e1a9959f264697683934dcacc82877c295fd3de7e8f0312e46316127211c
|
File details
Details for the file geesetools-0.1.21-py3-none-any.whl.
File metadata
- Download URL: geesetools-0.1.21-py3-none-any.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
053c48cf1acd581afd9619bea698e803e506baa89d7e6e1534002503b0125c74
|
|
| MD5 |
d1e9a2ce405014192ec153644ba44c9b
|
|
| BLAKE2b-256 |
abe7e2da36bb1b78c708f2db90963e38a0e7018e499c1fb04bbf868e612b4a0b
|