Skip to main content

Modular and extensible data preprocessing library

Project description

๐Ÿชฟ๐Ÿชฟ DucksTools ๐Ÿ› ๏ธ๐Ÿ› ๏ธ

Modular and Extensible Data Preprocessing Library for Machine Learning

DucksTools is a plug-and-play, mixin-based Python library that streamlines the preprocessing of tabular datasets for machine learning tasks. Whether youโ€™re cleaning messy data, encoding categories, transforming skewed distributions, or scaling features โ€” this package has you covered.


๐Ÿš€ Features

  • ๐Ÿงผ Handle missing data
  • ๐Ÿ”ข Convert object columns to numeric
  • ๐Ÿ” Identify feature types (categorical, ordinal, nominal, etc.)
  • โš™๏ธ Encode nominal and ordinal features
  • ๐Ÿ”„ Transform skewed and heavy-tailed features
  • ๐Ÿ“ Scale features with standard or power transformations
  • ๐Ÿงช Train-test split with optional oversampling
  • ๐Ÿ“Š Transformation logs for transparency and reproducibility
  • ๐Ÿ”Œ Built using Mixins for modular extension

โš™๏ธ Installation

You can install the package directly from PyPI:

pip install DucksTools

Or, after building your wheel file (.whl) from the source:

pip install dist/DucksTools-0.1.8-py3-none-any.whl

Or install directly in editable mode (for development):

pip install -e .

๐Ÿงช Usage

import DucksTools as gt

# Instantiate with a dataset
obj = gt(
    dataframe=df,
    target_variable='target',
    ordinal_features=['education_level'],
    ordinal_categories=[['Low', 'Medium', 'High']],
    use_one_hot_encoding=True
)

# Apply full preprocessing pipeline
X_train, X_test, y_train, y_test = obj.pre_process()

# Access logs
print(obj.transformation_log_df)

๐Ÿ—‚ Default Sample Dataset

If no DataFrame is provided, the processor loads a built-in heart.csv dataset:

obj = DucksTools()  # Uses sample heart dataset

# Apply full preprocessing pipeline
X_train, X_test, y_train, y_test = obj.pre_process()

๐Ÿ“ Project Structure

๐Ÿ“ฆ DucksTools/
โ”œโ”€โ”€ ๐Ÿ“‚ data/                            # ๐Ÿ“ Contains bundled datasets
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ heart.csv                    # ๐Ÿ“Š Sample dataset (CSV format)
โ”‚   โ””โ”€โ”€ ๐Ÿ“œ __init__.py                  # ๐Ÿ“ฆ Makes 'data' a subpackage
โ”‚
โ”œโ”€โ”€ ๐Ÿ“œ DucksTools.py                    # ๐Ÿง  Core toolkit initializer or controller
โ”œโ”€โ”€ ๐Ÿ“œ datasets.py                      # ๐Ÿ“‚ Dataset loading utilities
โ”œโ”€โ”€ ๐Ÿงฉ display_mixin.py                 # ๐Ÿ–ฅ๏ธ Display-related mixin
โ”œโ”€โ”€ ๐Ÿงฉ drop_features_mixin.py           # โœ‚๏ธ Drop unwanted features
โ”œโ”€โ”€ ๐Ÿงฉ drop_records_mixin.py            # ๐Ÿ—‘๏ธ Drop records based on rules
โ”œโ”€โ”€ ๐Ÿงฉ encode_mixin.py                  # ๐Ÿ”ค Encoding (label, one-hot)
โ”œโ”€โ”€ ๐Ÿงฉ feature_target_split_mixin.py    # ๐Ÿ”€ Split into features & target
โ”œโ”€โ”€ ๐Ÿงฉ feature_type_mixin.py            # ๐Ÿงฌ Feature type detection
โ”œโ”€โ”€ ๐Ÿงฉ impute_features_mixin.py         # ๐Ÿฉน Fill missing values
โ”œโ”€โ”€ ๐Ÿงฉ missing_data_summary_mixin.py    # ๐Ÿ“‰ Summary of missing data
โ”œโ”€โ”€ ๐Ÿงฉ oversample_mixin.py              # ๐Ÿงช Oversampling (e.g., SMOTE)
โ”œโ”€โ”€ ๐Ÿงฉ pre_process_mixin.py             # โš™๏ธ Complete preprocessing pipeline
โ”œโ”€โ”€ ๐Ÿงฉ sample_data_mixin.py             # ๐ŸŽฒ Random sampling utilities
โ”œโ”€โ”€ ๐Ÿงฉ scale_mixin.py                   # ๐Ÿ“ Scaling methods
โ”œโ”€โ”€ ๐Ÿงฉ split_dataframe_mixin.py         # ๐Ÿงฏ Split dataframe columns
โ”œโ”€โ”€ ๐Ÿงฉ to_numeric_mixin.py              # ๐Ÿ”ข Convert to numeric
โ”œโ”€โ”€ ๐Ÿงฉ transform_mixin.py               # ๐Ÿ”ง Feature transformations
โ”œโ”€โ”€ ๐Ÿงฉ unique_value_summary_mixin.py    # ๐Ÿงพ Unique value summary
โ””โ”€โ”€ ๐Ÿ“œ __init__.py                      # ๐Ÿ“ฆ Initializes DucksTools package

โš™๏ธ Requirements

  • Python 3.9โ€“3.11
  • pandas
  • scikit-learn
  • imbalanced-learn
  • scipy
  • ipython
  • openpyxl

๐Ÿ“œ License

MIT ยฉ Abhijeet
You're free to use, modify, and distribute this project with proper attribution.


โœจ Contributions Welcome

Want to add new mixins or support more file types? Fork it, branch it, push it, and letโ€™s build together!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geesetools-0.1.18.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geesetools-0.1.18-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file geesetools-0.1.18.tar.gz.

File metadata

  • Download URL: geesetools-0.1.18.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for geesetools-0.1.18.tar.gz
Algorithm Hash digest
SHA256 4d2e9757d6433c4a4b1a6e76a91973ae8bf2de5bba3c146ff372bbfe3a4642d5
MD5 f34ecefac9c593a2ae0e2653e1ec856f
BLAKE2b-256 2a3c32cbd46b9a387b92ab5bf28e9347215514b6cdb51c90290b402fe018a4c3

See more details on using hashes here.

File details

Details for the file geesetools-0.1.18-py3-none-any.whl.

File metadata

  • Download URL: geesetools-0.1.18-py3-none-any.whl
  • Upload date:
  • Size: 33.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for geesetools-0.1.18-py3-none-any.whl
Algorithm Hash digest
SHA256 7a906e0d4ed90e5ff59bf5706765dc87e31188dffff2b2b7f8d4e171a29034dc
MD5 b1c0c2cd8bc72e2c61afe42257bebdc0
BLAKE2b-256 c49c0e5f95d90f670733763a0d3497ad1d7eb679d45b4ed530223d8c720e0238

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page