Skip to main content

Modular and extensible data preprocessing library

Project description

🪿GeeseTools🛠

Modular and Extensible Data Preprocessing Library for Machine Learning

GeeseTools is a plug-and-play, mixin-based Python library that streamlines the preprocessing of tabular datasets for machine learning tasks. Whether you’re cleaning messy data, encoding categories, transforming skewed distributions, or scaling features — this package has you covered.


Features

  • Handle missing data
  • Convert object columns to numeric
  • Identify feature types (categorical, ordinal, nominal, etc.)
  • Encode nominal and ordinal features
  • Transform skewed and heavy-tailed features
  • Scale features with standard or power transformations
  • Train-test split with optional oversampling
  • Transformation logs for transparency and reproducibility
  • Built using Mixins for modular extension

⚙️ Installation

You can install the package directly from PyPI:

pip install GeeseTools

Usage

import GeeseTools as gt

# Instantiate with a dataset
obj = gt(
    dataframe=df,
    target_variable='target',
    ordinal_features=['education_level'],
    ordinal_categories=[['Low', 'Medium', 'High']],
    use_one_hot_encoding=True
)

# Apply full preprocessing pipeline
X_train, X_test, y_train, y_test = obj.pre_process()

# Access logs
print(obj.transformation_log_df)

Default Sample Dataset

If no DataFrame is provided, the processor loads a built-in heart.csv dataset:

obj = GeeseTools()  # Uses sample heart dataset

# Apply full preprocessing pipeline
X_train, X_test, y_train, y_test = obj.pre_process()

Project Structure

📦 GeeseTools/
├── 📂 data/                            #  Contains bundled datasets
│   ├── 📄 heart.csv                    #  Sample dataset (CSV format)
│   └── 📜 __init__.py                  #  Makes 'data' a subpackage
│
├── 📜 GeeseTools.py                    #  Core toolkit initializer or controller
├── 📜 datasets.py                      #  Dataset loading utilities
├── 🧩 display_mixin.py                 #  Display-related mixin
├── 🧩 drop_features_mixin.py           #  Drop unwanted features
├── 🧩 drop_records_mixin.py            #  Drop records based on rules
├── 🧩 encode_mixin.py                  #  Encoding (label, one-hot)
├── 🧩 feature_target_split_mixin.py    #  Split into features & target
├── 🧩 feature_type_mixin.py            #  Feature type detection
├── 🧩 impute_features_mixin.py         #  Fill missing values
├── 🧩 missing_data_summary_mixin.py    #  Summary of missing data
├── 🧩 oversample_mixin.py              #  Oversampling (e.g., SMOTE)
├── 🧩 pre_process_mixin.py             #  Complete preprocessing pipeline
├── 🧩 sample_data_mixin.py             #  Random sampling utilities
├── 🧩 scale_mixin.py                   #  Scaling methods
├── 🧩 split_dataframe_mixin.py         #  Split dataframe columns
├── 🧩 to_numeric_mixin.py              #  Convert to numeric
├── 🧩 transform_mixin.py               #  Feature transformations
├── 🧩 unique_value_summary_mixin.py    #  Unique value summary
└── 📜 __init__.py                      #  Initializes GeeseTools package

Requirements

  • Python 3.9–3.11
  • pandas
  • scikit-learn
  • imbalanced-learn
  • scipy
  • ipython
  • openpyxl

License

MIT © Abhijeet
You're free to use, modify, and distribute this project with proper attribution.


Contributions Welcome

Fork it!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geesetools-0.1.20.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geesetools-0.1.20-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file geesetools-0.1.20.tar.gz.

File metadata

  • Download URL: geesetools-0.1.20.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for geesetools-0.1.20.tar.gz
Algorithm Hash digest
SHA256 01052266ab4db7840ce4c0ff5f3005ea354429f18534ff1a4b593e701cabf030
MD5 5466f47b44f3122d7b2f585ea3050323
BLAKE2b-256 487701bf4d31b3e2f80e3d2bc14a9a9de61d7fc54dd7853ab20533a2f04d6d88

See more details on using hashes here.

File details

Details for the file geesetools-0.1.20-py3-none-any.whl.

File metadata

  • Download URL: geesetools-0.1.20-py3-none-any.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for geesetools-0.1.20-py3-none-any.whl
Algorithm Hash digest
SHA256 aa2fd11803ca6018fd83f5008f3d6d969bdba2da167a681f19f58f5cbb8410e0
MD5 d5c0d352305ff9467f03ba40dca96350
BLAKE2b-256 a156ec1193cb9b0fa9cf8323aece0a1d1f627718769aa10c3fe1f3d1dc5fe8e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page