Skip to main content

Machine learning library for streamlined model building and fast real-time preprocessing

Project description

Gators Logo

Gators: A Lightning-Fast Data Preprocessing And Feature Engineering Python Library

PyPI version Python versions License Coverage Documentation code style: black imports: isort Downloads Downloads/Month GitHub Stars GitHub Forks Contributors Last Commit

📚 Full Documentation

What is Gators?

Gators is a library built on top of Polars, designed to streamline your entire ML workflow from raw data to production-ready models, leveraging Polars' blazing-fast multi-core processing.

Built by the PSP Data Team at PayPal, Gators makes data preprocessing and feature engineering both faster and simpler.

⚡ Key Features

  • 🚀 Lightning Fast: Built on Polars for multi-core parallel processing
  • 🔄 Unified API: Consistent sklearn-style .fit() and .transform() interface
  • 📦 Production Ready: Deploy the same Python code from notebook to production
  • 🎯 Comprehensive: 75+ preprocessing transformers covering every use case
  • 🔗 Pipeline Support: Chain transformers seamlessly with the Pipeline class
  • 🎓 Easy to Learn: If you know sklearn, you already know Gators

🛠️ What Can Gators Do?

🧹 Data Cleaning

Clean and prepare your data with powerful transformers:

  • CastColumns - Convert column data types
  • CorrelationFilter - Remove highly correlated features
  • DropColumns - Remove specified columns
  • DropConstantColumns - Remove columns with constant values
  • DropDuplicateColumns - Remove duplicate columns
  • DropDuplicateRows - Remove duplicate rows
  • DropHighNaNRatio - Remove columns with high missing value ratio
  • DropLowCardinality - Remove low cardinality columns
  • HighCardinalityFilter - Filter high cardinality features
  • OutlierFilter - Detect and filter outliers
  • RenameColumns - Rename columns
  • Replace - Replace values in data
  • VarianceFilter - Remove low variance features

🔢 Categorical Encoding

Transform categorical variables with advanced encoding techniques:

  • BinaryEncoder - Binary representation encoding
  • CatBoostEncoder - CatBoost-style encoding
  • CountEncoder - Frequency-based encoding
  • LeaveOneOutEncoder - Leave-one-out encoding
  • OneHotEncoder - Classic one-hot encoding
  • OrdinalEncoder - Order-based encoding
  • RareCategoryEncoder - Replace rare/infrequent categories by a single category
  • TargetEncoder - Target-based encoding for supervised learning
  • WOEEncoder - Weight of Evidence encoding

🎯 Feature Generation - Numeric

Create powerful numeric features: Mathematical Operations:

  • DistanceFeatures - Calculate distance features
  • IsNull - Generate null indicator features
  • MathFeatures - Apply mathematical operations (add, subtract, multiply, divide)
  • RatioFeatures - Create ratio features between columns
  • PlaneRotationFeatures - Rotate features in feature space
  • PolynomialFeatures - Generate polynomial combinations
  • ScalarMathFeatures - Apply scalar operations

Aggregation & Statistics:

  • GroupLagFeatures - Generate lag features by group
  • GroupScalingFeatures - Scale features within groups
  • GroupStatisticsFeatures - Calculate group statistics
  • RowStatisticsFeatures - Calculate row-wise statistics

Rule-based

  • ComparisonFeatures - Generate comparison features
  • ConditionFeatures - Create conditional features
  • RuleFeatures - Apply custom business rules

📝 Feature Generation - String

Extract insights from text data:

  • CharacterStatistics - Extract character-level statistics
  • CombineFeatures - Combine string features
  • Contains - Check if string contains pattern
  • Endswith - Check if string ends with pattern
  • ExtractSubstring - Extract substring from text
  • InteractionFeatures - Generate string interaction features
  • Length - Calculate string length
  • Lower - Convert text to lowercase
  • NGram - Generate n-gram features
  • Occurrences - Count pattern occurrences
  • PatternDetector - Detect patterns in text
  • Split - Split strings
  • SplitExtract - Split and extract from strings
  • Startswith - Check if string starts with pattern
  • Upper - Convert text to uppercase

📅 Feature Generation - DateTime

Unlock temporal patterns:

  • BusinessTimeFeatures - Business hours/days calculations
  • CyclicFeatures - Circular encoding for cyclical time features
  • DiffFeatures - Calculate time differences
  • DurationToDatetime - Convert duration to datetime
  • HolidayFeatures - Detect and encode holidays
  • OrdinalFeatures - Extract year, month, day, hour, etc.
  • TimeBinFeatures - Bin times into categories
  • TimeWindowFeatures - Generate time window features

🔄 Missing Value Imputation

Handle missing data intelligently:

  • BooleanImputer - Impute boolean columns
  • GroupByImputer - Group-based imputation strategies
  • NumericImputer - Impute numeric columns (mean, median, mode, constant)
  • StringImputer - Impute string columns (mode, constant)

📊 Discretization

Convert continuous variables into bins:

  • CustomDiscretizer - Custom bin edges
  • EqualLengthDiscretizer - Equal-width binning
  • EqualSizeDiscretizer - Equal-frequency binning
  • GeometricDiscretizer - Geometric progression binning
  • KMeansDiscretizer - K-means clustering-based binning
  • QuantileDiscretizer - Quantile-based binning
  • TreeBasedDiscretizer - Decision tree-based binning

⚖️ Feature Scaling

Normalize your features:

  • ArcsinSquarerootScaler - Arcsine square root transformation
  • ArcsinhScaler - Inverse hyperbolic sine transformation
  • BoxCox - Box-Cox power transformation
  • LogScaler - Logarithmic scaling
  • MinmaxScaler - Min-max normalization
  • PowerScaler - Power transformation
  • StandardScaler - Standardization (z-score normalization)
  • YeoJohnson - Yeo-Johnson power transformation

🔗 Pipeline

Chain all transformers together:

  • Pipeline - sklearn-compatible pipeline for chaining transformers

🚀 Quick Start

import polars as pl
from gators.data_cleaning import DropHighNaNRatio, VarianceFilter
from gators.encoders import OneHotEncoder
from gators.imputers import NumericImputer
from gators.scalers import StandardScaler
from gators.pipeline import Pipeline

# Load your data
X = pl.read_csv("data.csv")

# Build a preprocessing pipeline
pipeline = Pipeline(steps=[
    ('drop_nan', DropHighNaNRatio(max_ratio=0.5)),
    ('impute', NumericImputer(strategy='median')),
    ('variance', VarianceFilter(min_var=0.01)),
    ('encode', OneHotEncoder()),  # One-hot encode ALL the String or Categorical columns    
    ('scale', StandardScaler())
])

# Fit and transform
X_processed = pipeline.fit_transform(X)

# Serialize the pipeline with pickle/joblib for production deployment

📦 Installation

Requires Python 3.10 or higher.

pip3 install gators

Or install from source:

git clone https://github.com/paypal/gators.git
cd gators
pip3 install -e .    # Install in editable/development mode

📚 Documentation

For detailed documentation, tutorials, and API reference, visit:

https://paypal.github.io/gators/

🎯 Use Cases

Gators is perfect for:

  • Fraud Detection - Extensive feature engineering for anomaly detection
  • Risk Modeling - Create powerful predictive features
  • Customer Analytics - Transform complex customer data
  • Time Series - Rich datetime feature engineering
  • NLP Tasks - String feature extraction and encoding

🏢 Used By

Gators powers ML pipelines at:

  • PayPal (internal use)

🤝 Contributing

We welcome contributions! Please check out our contributing guidelines.

📄 License

Gators is licensed under the Apache License 2.0. See LICENSE file for details.

🙏 Credits

Developed by the PSP Data Team at PayPal.


Built by data scientists, for data scientists

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gators-1.0.5-py3-none-any.whl (198.6 kB view details)

Uploaded Python 3

File details

Details for the file gators-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: gators-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 198.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for gators-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 28b9e01b4e596324c03ead19fbc6f7636c48297f0fb33fa0e1d46ed71dd9f65f
MD5 c07db0b2cf78f907c416755b33edd1aa
BLAKE2b-256 4d0783dde291423129c40bb07b4db2d15778b1002b4ad82a35bfefcc55684c14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page