Machine learning library for streamlined model building and fast real-time preprocessing
Project description
🐊 Gators
Lightning-fast data preprocessing and feature engineering for machine learning
What is Gators?
Gators is a lightning-fast data preprocessing and feature engineering library built on top of Polars, designed to streamline your entire ML workflow from raw data to production-ready models. Leveraging Polars’ blazing-fast multi-core processing.
Built by the PSP Data Team at PayPal, Gators makes data preprocessing and feature engineering both faster and simpler.
⚡ Key Features
- 🚀 Lightning Fast: Built on Polars for multi-core parallel processing
- 🔄 Unified API: Consistent sklearn-style
.fit()and.transform()interface - 📦 Production Ready: Deploy the same Python code from notebook to production
- 🎯 Comprehensive: 75+ preprocessing transformers covering every use case
- 🔗 Pipeline Support: Chain transformers seamlessly with the Pipeline class
- 🎓 Easy to Learn: If you know sklearn, you already know Gators
🛠️ What Can Gators Do?
🧹 Data Cleaning
Clean and prepare your data with powerful transformers:
CastColumns- Convert column data typesCorrelationFilter- Remove highly correlated featuresDropColumns- Remove specified columnsDropConstantColumns- Remove columns with constant valuesDropDuplicateColumns- Remove duplicate columnsDropDuplicateRows- Remove duplicate rowsDropHighNaNRatio- Remove columns with high missing value ratioDropLowCardinality- Remove low cardinality columnsHighCardinalityFilter- Filter high cardinality featuresOutlierFilter- Detect and filter outliersRenameColumns- Rename columnsReplace- Replace values in dataVarianceFilter- Remove low variance features
🔢 Categorical Encoding
Transform categorical variables with advanced encoding techniques:
BinaryEncoder- Binary representation encodingCatBoostEncoder- CatBoost-style encodingCountEncoder- Frequency-based encodingLeaveOneOutEncoder- Leave-one-out encodingOneHotEncoder- Classic one-hot encodingOrdinalEncoder- Order-based encodingRareCategoryEncoder- Handle rare categories intelligentlyTargetEncoder- Target-based encoding for supervised learningWOEEncoder- Weight of Evidence encoding
🎯 Feature Generation - Numeric
Create powerful numeric features:
ComparisonFeatures- Generate comparison featuresConditionFeatures- Create conditional featuresDistanceFeatures- Calculate distance featuresGroupLagFeatures- Generate lag features by groupGroupScalingFeatures- Scale features within groupsGroupStatisticsFeatures- Calculate group statisticsIsNull- Generate null indicator featuresMathFeatures- Apply mathematical operations (add, subtract, multiply, divide)PlanRotationFeatures- Rotate features in feature spacePolynomialFeatures- Generate polynomial combinationsRatioFeatures- Create ratio features between columnsRowStatisticsFeatures- Calculate row-wise statisticsRuleFeatures- Apply custom business rulesScalarMathFeatures- Apply scalar operations
📝 Feature Generation - String
Extract insights from text data:
CharacterStatistics- Extract character-level statisticsCombineFeatures- Combine string featuresContains- Check if string contains patternEndswith- Check if string ends with patternExtractSubstring- Extract substring from textInteractionFeatures- Generate string interaction featuresLength- Calculate string lengthLower- Convert text to lowercaseNGram- Generate n-gram featuresOccurrences- Count pattern occurrencesPatternDetector- Detect patterns in textSplit- Split stringsSplitExtract- Split and extract from stringsStartswith- Check if string starts with patternUpper- Convert text to uppercase
📅 Feature Generation - DateTime
Unlock temporal patterns:
BusinessTimeFeatures- Business hours/days calculationsCyclicFeatures- Circular encoding for cyclical time featuresDiffFeatures- Calculate time differencesDurationToDatetime- Convert duration to datetimeHolidayFeatures- Detect and encode holidaysOrdinalFeatures- Extract year, month, day, hour, etc.TimeBinFeatures- Bin times into categoriesTimeWindowFeatures- Generate time window features
🔄 Missing Value Imputation
Handle missing data intelligently:
BooleanImputer- Impute boolean columnsGroupByImputer- Group-based imputation strategiesNumericImputer- Impute numeric columns (mean, median, mode, constant)StringImputer- Impute string columns (mode, constant)
📊 Discretization
Convert continuous variables into bins:
CustomDiscretizer- Custom bin edgesEqualLengthDiscretizer- Equal-width binningEqualSizeDiscretizer- Equal-frequency binningGeometricDiscretizer- Geometric progression binningKMeansDiscretizer- K-means clustering-based binningQuantileDiscretizer- Quantile-based binningTreeBasedDiscretizer- Decision tree-based binning
⚖️ Feature Scaling
Normalize your features:
ArcsinSquarerootScaler- Arcsine square root transformationArcsinhScaler- Inverse hyperbolic sine transformationBoxCox- Box-Cox power transformationLogScaler- Logarithmic scalingMinmaxScaler- Min-max normalizationPowerScaler- Power transformationStandardScaler- Standardization (z-score normalization)YeoJonhson- Yeo-Johnson power transformation
🔗 Pipeline
Chain all transformers together:
Pipeline- sklearn-compatible pipeline for chaining transformers
🚀 Quick Start
import polars as pl
from gators.data_cleaning import DropHighNaNRatio, VarianceFilter
from gators.encoders import OneHotEncoder
from gators.imputers import NumericImputer
from gators.scalers import StandardScaler
from gators.pipeline import Pipeline
# Load your data
X = pl.read_csv("data.csv")
# Build a preprocessing pipeline
pipeline = Pipeline([
('drop_nan', DropHighNaNRatio(threshold=0.5)),
('impute', NumericImputer(strategy='median')),
('variance', VarianceFilter(threshold=0.01)),
('encode', OneHotEncoder()),
('scale', StandardScaler())
])
# Fit and transform
X_processed = pipeline.fit_transform(X)
# Deploy the same pipeline in production!
📦 Installation
pip install gators
Or install from source:
git clone https://github.com/paypal/gators.git
cd gators
pip install -e .
📚 Documentation
For detailed documentation, tutorials, and API reference, visit:
https://paypal.github.io/gators/
🎯 Use Cases
Gators is perfect for:
- Fraud Detection - Extensive feature engineering for anomaly detection
- Risk Modeling - Create powerful predictive features
- Customer Analytics - Transform complex customer data
- Time Series - Rich datetime feature engineering
- NLP Tasks - String feature extraction and encoding
🤝 Contributing
We welcome contributions! Please check out our contributing guidelines.
📄 License
Gators is licensed under the Apache License 2.0. See LICENSE file for details.
🙏 Credits
Developed by the PSP Data Team at PayPal.
Built by data scientists, for data scientists
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gators-1.0.4-py3-none-any.whl.
File metadata
- Download URL: gators-1.0.4-py3-none-any.whl
- Upload date:
- Size: 202.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8889dfb6eecd722c7ff134a4e1fddf29ea504bf17b8b7ab2ceb8bc9caaae3cb
|
|
| MD5 |
33fa83588bbec1f817aae61499e1ba9d
|
|
| BLAKE2b-256 |
53c4c5a29dd1fb86d8a50100188d273d0e69b6a05696f7d42a37d7dc2e2aedea
|