Skip to main content

Modular and extensible data preprocessing library

Project description

🪿🪿 GeeseTools 🛠️🛠️

Modular and Extensible Data Preprocessing Library for Machine Learning

Goose is a plug-and-play, mixin-based Python library that streamlines the preprocessing of tabular datasets for machine learning tasks. Whether you’re cleaning messy data, encoding categories, transforming skewed distributions, or scaling features — this package has you covered.


🚀 Features

  • 🧼 Handle missing data
  • 🔢 Convert object columns to numeric
  • 🔍 Identify feature types (categorical, ordinal, nominal, etc.)
  • ⚙️ Encode nominal and ordinal features
  • 🔄 Transform skewed and heavy-tailed features
  • 📏 Scale features with standard or power transformations
  • 🧪 Train-test split with optional oversampling
  • 📊 Transformation logs for transparency and reproducibility
  • 🔌 Built using Mixins for modular extension

⚙️ Installation

You can install the package directly from PyPI:

pip install GeeseTools

Or, after building your wheel file (.whl) from the source:

pip install dist/GeeseTools-0.1.8-py3-none-any.whl

Or install directly in editable mode (for development):

pip install -e .

🧪 Usage

import GeeseTools as gt

# Instantiate with a dataset
obj = gt(
    dataframe=df,
    target_variable='target',
    ordinal_features=['education_level'],
    ordinal_categories=[['Low', 'Medium', 'High']],
    use_one_hot_encoding=True
)

# Apply full preprocessing pipeline
X_train, X_test, y_train, y_test = obj.pre_process()

# Access logs
print(obj.transformation_log_df)

🗂 Default Sample Dataset

If no DataFrame is provided, the processor loads a built-in heart.csv dataset:

obj = Goose()  # Uses sample heart dataset

# Apply full preprocessing pipeline
X_train, X_test, y_train, y_test = obj.pre_process()

📁 Project Structure

src/
│
├── Goose/
│   ├── Goose.py                  # Main class
│   ├── mixins/                 # Modular preprocessing logic
│   ├── data/heart.csv          # Default dataset
│   ├── datasets.py             # Heart dataset loader
│   └── __init__.py

⚙️ Requirements

  • Python 3.9–3.11
  • pandas
  • scikit-learn
  • imbalanced-learn
  • scipy
  • ipython
  • openpyxl

📜 License

MIT © Abhijeet
You're free to use, modify, and distribute this project with proper attribution.


✨ Contributions Welcome

Want to add new mixins or support more file types? Fork it, branch it, push it, and let’s build together!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geesetools-0.1.8.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geesetools-0.1.8-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file geesetools-0.1.8.tar.gz.

File metadata

  • Download URL: geesetools-0.1.8.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for geesetools-0.1.8.tar.gz
Algorithm Hash digest
SHA256 5a8b07f67154f5845f90c4ddd28f610a6dd4cc757e3585820d4ed193b0dc0e73
MD5 247136b4ed3144ec062da1b6a8407727
BLAKE2b-256 79664998940e5c68b33ccc99a0a327ea6c0562927422c4c2570923764cf43b0a

See more details on using hashes here.

File details

Details for the file geesetools-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: geesetools-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 33.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for geesetools-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 916ef81fcedac878b4b2820d1625536e782cff62f9deb245756f8307d2980af2
MD5 4fcb18ca45a67dbfc1f847c8ff3cbdd5
BLAKE2b-256 3d63b30066915a54b7cdede2fbd5563a43809d2e681049960b839722887ede99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page