Skip to main content

Machine learning library for streamlined model building and fast real-time preprocessing

Project description

🐊 Gators

PyPI version Python versions License Coverage Documentation code style: black imports: isort

Lightning-fast data preprocessing and feature engineering for machine learning

What is Gators?

Gators is a lightning-fast data preprocessing and feature engineering library built on top of Polars, designed to streamline your entire ML workflow from raw data to production-ready models. Leveraging Polars’ blazing-fast multi-core processing.

Built by the PSP Data Team at PayPal, Gators makes data preprocessing and feature engineering both faster and simpler.

⚡ Key Features

  • 🚀 Lightning Fast: Built on Polars for multi-core parallel processing
  • 🔄 Unified API: Consistent sklearn-style .fit() and .transform() interface
  • 📦 Production Ready: Deploy the same Python code from notebook to production
  • 🎯 Comprehensive: 75+ preprocessing transformers covering every use case
  • 🔗 Pipeline Support: Chain transformers seamlessly with the Pipeline class
  • 🎓 Easy to Learn: If you know sklearn, you already know Gators

🛠️ What Can Gators Do?

🧹 Data Cleaning

Clean and prepare your data with powerful transformers:

  • CastColumns - Convert column data types
  • CorrelationFilter - Remove highly correlated features
  • DropColumns - Remove specified columns
  • DropConstantColumns - Remove columns with constant values
  • DropDuplicateColumns - Remove duplicate columns
  • DropDuplicateRows - Remove duplicate rows
  • DropHighNaNRatio - Remove columns with high missing value ratio
  • DropLowCardinality - Remove low cardinality columns
  • HighCardinalityFilter - Filter high cardinality features
  • OutlierFilter - Detect and filter outliers
  • RenameColumns - Rename columns
  • Replace - Replace values in data
  • VarianceFilter - Remove low variance features

🔢 Categorical Encoding

Transform categorical variables with advanced encoding techniques:

  • BinaryEncoder - Binary representation encoding
  • CatBoostEncoder - CatBoost-style encoding
  • CountEncoder - Frequency-based encoding
  • LeaveOneOutEncoder - Leave-one-out encoding
  • OneHotEncoder - Classic one-hot encoding
  • OrdinalEncoder - Order-based encoding
  • RareCategoryEncoder - Handle rare categories intelligently
  • TargetEncoder - Target-based encoding for supervised learning
  • WOEEncoder - Weight of Evidence encoding

🎯 Feature Generation - Numeric

Create powerful numeric features:

  • ComparisonFeatures - Generate comparison features
  • ConditionFeatures - Create conditional features
  • DistanceFeatures - Calculate distance features
  • GroupLagFeatures - Generate lag features by group
  • GroupScalingFeatures - Scale features within groups
  • GroupStatisticsFeatures - Calculate group statistics
  • IsNull - Generate null indicator features
  • MathFeatures - Apply mathematical operations (add, subtract, multiply, divide)
  • PlanRotationFeatures - Rotate features in feature space
  • PolynomialFeatures - Generate polynomial combinations
  • RatioFeatures - Create ratio features between columns
  • RowStatisticsFeatures - Calculate row-wise statistics
  • RuleFeatures - Apply custom business rules
  • ScalarMathFeatures - Apply scalar operations

📝 Feature Generation - String

Extract insights from text data:

  • CharacterStatistics - Extract character-level statistics
  • CombineFeatures - Combine string features
  • Contains - Check if string contains pattern
  • Endswith - Check if string ends with pattern
  • ExtractSubstring - Extract substring from text
  • InteractionFeatures - Generate string interaction features
  • Length - Calculate string length
  • Lower - Convert text to lowercase
  • NGram - Generate n-gram features
  • Occurrences - Count pattern occurrences
  • PatternDetector - Detect patterns in text
  • Split - Split strings
  • SplitExtract - Split and extract from strings
  • Startswith - Check if string starts with pattern
  • Upper - Convert text to uppercase

📅 Feature Generation - DateTime

Unlock temporal patterns:

  • BusinessTimeFeatures - Business hours/days calculations
  • CyclicFeatures - Circular encoding for cyclical time features
  • DiffFeatures - Calculate time differences
  • DurationToDatetime - Convert duration to datetime
  • HolidayFeatures - Detect and encode holidays
  • OrdinalFeatures - Extract year, month, day, hour, etc.
  • TimeBinFeatures - Bin times into categories
  • TimeWindowFeatures - Generate time window features

🔄 Missing Value Imputation

Handle missing data intelligently:

  • BooleanImputer - Impute boolean columns
  • GroupByImputer - Group-based imputation strategies
  • NumericImputer - Impute numeric columns (mean, median, mode, constant)
  • StringImputer - Impute string columns (mode, constant)

📊 Discretization

Convert continuous variables into bins:

  • CustomDiscretizer - Custom bin edges
  • EqualLengthDiscretizer - Equal-width binning
  • EqualSizeDiscretizer - Equal-frequency binning
  • GeometricDiscretizer - Geometric progression binning
  • KMeansDiscretizer - K-means clustering-based binning
  • QuantileDiscretizer - Quantile-based binning
  • TreeBasedDiscretizer - Decision tree-based binning

⚖️ Feature Scaling

Normalize your features:

  • ArcsinSquarerootScaler - Arcsine square root transformation
  • ArcsinhScaler - Inverse hyperbolic sine transformation
  • BoxCox - Box-Cox power transformation
  • LogScaler - Logarithmic scaling
  • MinmaxScaler - Min-max normalization
  • PowerScaler - Power transformation
  • StandardScaler - Standardization (z-score normalization)
  • YeoJonhson - Yeo-Johnson power transformation

🔗 Pipeline

Chain all transformers together:

  • Pipeline - sklearn-compatible pipeline for chaining transformers

🚀 Quick Start

import polars as pl
from gators.data_cleaning import DropHighNaNRatio, VarianceFilter
from gators.encoders import OneHotEncoder
from gators.imputers import NumericImputer
from gators.scalers import StandardScaler
from gators.pipeline import Pipeline

# Load your data
X = pl.read_csv("data.csv")

# Build a preprocessing pipeline
pipeline = Pipeline([
    ('drop_nan', DropHighNaNRatio(threshold=0.5)),
    ('impute', NumericImputer(strategy='median')),
    ('variance', VarianceFilter(threshold=0.01)),
    ('encode', OneHotEncoder()),
    ('scale', StandardScaler())
])

# Fit and transform
X_processed = pipeline.fit_transform(X)

# Deploy the same pipeline in production!

📦 Installation

pip install gators

Or install from source:

git clone https://github.com/paypal/gators.git
cd gators
pip install -e .

📚 Documentation

For detailed documentation, tutorials, and API reference, visit:

https://paypal.github.io/gators/

🎯 Use Cases

Gators is perfect for:

  • Fraud Detection - Extensive feature engineering for anomaly detection
  • Risk Modeling - Create powerful predictive features
  • Customer Analytics - Transform complex customer data
  • Time Series - Rich datetime feature engineering
  • NLP Tasks - String feature extraction and encoding

🤝 Contributing

We welcome contributions! Please check out our contributing guidelines.

📄 License

Gators is licensed under the Apache License 2.0. See LICENSE file for details.

🙏 Credits

Developed by the PSP Data Team at PayPal.


Built by data scientists, for data scientists

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gators-1.0.4-py3-none-any.whl (202.2 kB view details)

Uploaded Python 3

File details

Details for the file gators-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: gators-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 202.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for gators-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c8889dfb6eecd722c7ff134a4e1fddf29ea504bf17b8b7ab2ceb8bc9caaae3cb
MD5 33fa83588bbec1f817aae61499e1ba9d
BLAKE2b-256 53c4c5a29dd1fb86d8a50100188d273d0e69b6a05696f7d42a37d7dc2e2aedea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page