Skip to main content

A flexible framework for generating features, ratings, and building machine learning or other models for training and inference on sports data.

Project description

spforge

spforge is a sports prediction framework for building feature-rich, stateful, and sklearn-compatible modeling pipelines.

It is designed for:

  • player- and team-level ratings
  • rolling and lagged feature generation
  • match-aware cross-validation
  • probabilistic and point-estimate models
  • pandas and polars DataFrames (via narwhals)

Typical use cases include:

  • predicting game winners
  • predicting player or team points
  • generating probabilities using either machine learning models or distributions
  • feature engineering and cross-validation

Installation

pip install spforge

Core assumptions

spforge assumes your data is structured as:

  • One row per entity per match
    • e.g. (game_id, player_id) or (game_id, team_id)
  • Higher-level predictions (team/game) are handled via aggregation or grouping.

Key concepts

Before diving into examples, here are fundamental concepts that guide how spforge works:

  • Temporal ordering prevents future leakage: Data must be sorted chronologically (by date, then match, then team/player). This ensures models never "see the future" when making predictions.

  • Elo-style ratings: Player and team ratings evolve over time based on match performance. Think of it like a chess rating - win against strong opponents and your rating increases more. Ratings are calculated BEFORE each match to avoid leakage.

  • State management lifecycle:

    • fit_transform(df): Learn patterns from historical data (ratings update, windows build up)
    • transform(df): Apply to more historical data (continues updating state)
    • future_transform(df): Generate features for prediction WITHOUT updating internal state (read-only)
  • Granularity-based aggregation: Player-level data (e.g., individual stats) can be automatically aggregated to team-level for game winner predictions.

  • pandas and polars support: All components work identically with both DataFrame types via the narwhals library.

Example

This example demonstrates predicting NBA game winners using player-level ratings.

import pandas as pd
from sklearn.linear_model import LogisticRegression

from examples import get_sub_sample_nba_data
from spforge.autopipeline import AutoPipeline
from spforge.data_structures import ColumnNames
from spforge.ratings import PlayerRatingGenerator, RatingKnownFeatures

df = get_sub_sample_nba_data(as_pandas=True, as_polars=False)

# Step 1: Define column mappings for your dataset
column_names = ColumnNames(
    team_id="team_id",
    match_id="game_id",
    start_date="start_date",
    player_id="player_name",
)

# Step 2: CRITICAL - Sort data chronologically to prevent future leakage
# This ensures ratings and features only use past information
df = df.sort_values(
    by=[
        column_names.start_date,  # First by date
        column_names.match_id,    # Then by match
        column_names.team_id,     # Then by team
        column_names.player_id,   # Finally by player
    ]
)

# Step 3: Filter to valid games (exactly 2 teams)
df = (
    df.assign(
        team_count=df.groupby(column_names.match_id)[column_names.team_id].transform("nunique")
    )
    .loc[lambda x: x.team_count == 2]
    .drop(columns=["team_count"])
)

# Step 4: Split into historical (training) and future (prediction) data
# In production, "future" would be upcoming games without outcomes
most_recent_10_games = df[column_names.match_id].unique()[-10:]
historical_df = df[~df[column_names.match_id].isin(most_recent_10_games)]
future_df = df[df[column_names.match_id].isin(most_recent_10_games)].drop(columns=["won"])

# Step 5: Generate player ratings based on win/loss history
# Each player gets a rating that updates after each game
# Unlike traditional team Elo, ratings follow individual players
rating_generator = PlayerRatingGenerator(
    performance_column="won",  # Update ratings based on wins/losses
    rating_change_multiplier=30,  # How quickly ratings adjust (higher = more volatile)
    column_names=column_names,
    non_predictor_features_out=[RatingKnownFeatures.PLAYER_RATING],
)
# fit_transform learns ratings from historical games
historical_df = rating_generator.fit_transform(historical_df)

# Step 6: Create prediction pipeline
# AutoPipeline automatically handles preprocessing (encoding, scaling)
# granularity aggregates player-level data to team-level before fitting
pipeline = AutoPipeline(
    estimator=LogisticRegression(),
    granularity=["game_id", "team_id"],  # Aggregate players → teams
    estimator_features=rating_generator.features_out + ["location"],  # Rating + home/away
)

# Train on historical data
pipeline.fit(X=historical_df, y=historical_df["won"])

# Step 7: Make predictions on future games
# future_transform generates features WITHOUT updating rating state
# This is crucial: we don't want to update ratings until games actually happen
future_df = rating_generator.future_transform(future_df)
future_predictions = pipeline.predict_proba(future_df)[:, 1]  # Probability of winning
future_df["game_winner_probability"] = future_predictions

# Aggregate player-level predictions to team-level for final output
team_grouped_predictions = future_df.groupby(column_names.match_id).first()[
    [
        column_names.start_date,
        column_names.team_id,
        "team_id_opponent",
        "game_winner_probability",
    ]
]

print(team_grouped_predictions)

Output:

            start_date     team_id  team_id_opponent  game_winner_probability
game_id                                                                      
0022200767  2023-01-31  1610612749        1610612766                 0.731718
0022200768  2023-01-31  1610612740        1610612743                 0.242622
0022200770  2023-02-01  1610612753        1610612755                 0.278237
0022200771  2023-02-01  1610612757        1610612763                 0.340883
0022200772  2023-02-01  1610612738        1610612751                 0.629010
0022200773  2023-02-01  1610612745        1610612760                 0.401803
0022200774  2023-02-01  1610612744        1610612750                 0.430164
0022200775  2023-02-01  1610612758        1610612759                 0.587513
0022200776  2023-02-01  1610612761        1610612762                 0.376864
0022200777  2023-02-01  1610612737        1610612756                 0.371888

AutoPipeline

AutoPipeline is a sklearn-compatible wrapper that handles the full modeling pipeline, from preprocessing to final estimation.

  • Builds all required preprocessing steps automatically based on the estimator:
    • One-hot encoding and imputation for linear models (e.g. LogisticRegression)
    • Native categorical handling for LightGBM
    • Ordinal encoding where appropriate
  • Supports predictor transformers, allowing upstream models to generate features that are consumed by the final estimator.
  • Supports optional granularity-based aggregation, enabling row-level data (e.g. player-game) to be grouped before fitting (e.g. game-team level).
  • Provides additional functionality such as:
    • training-time row filtering
    • target clipping and validation handling
    • consistent feature tracking for sklearn integration

Feature Engineering

spforge provides stateful feature generators that create rich features from historical match data while maintaining temporal ordering to prevent data leakage.

Feature types available

  • Ratings: Elo-style player/team ratings that evolve based on performance (separate offense/defense ratings)
    • Can combine multiple stats into a composite performance metric using performance_weights (e.g., 60% kills + 40% assists)
    • Auto-normalizes raw stats to 0-1 range with auto_scale_performance=True
  • Lags: Previous match statistics, automatically shifted to prevent leakage
  • Rolling windows: Averages/sums over the last N matches
  • FeatureGeneratorPipeline: Chain multiple generators together sequentially

Example: Building a feature pipeline

from spforge import FeatureGeneratorPipeline
from spforge.feature_generator import LagTransformer, RollingWindowTransformer
from spforge.ratings import PlayerRatingGenerator, RatingKnownFeatures
from spforge.performance_transformers import ColumnWeight

# Create individual feature generators
player_rating_generator = PlayerRatingGenerator(
    performance_column="points",
    auto_scale_performance=True,  # Normalizes points to 0-1 range
    column_names=column_names,
    features_out=[RatingKnownFeatures.PLAYER_RATING_DIFFERENCE_PROJECTED],
)

# Alternative: Combine multiple stats into a composite performance metric
# player_rating_generator = PlayerRatingGenerator(
#     performance_column="weighted_performance",  # Name for the composite metric
#     performance_weights=[
#         ColumnWeight(name="kills", weight=0.6),
#         ColumnWeight(name="assists", weight=0.4),
#     ],
#     column_names=column_names,
#     features_out=[RatingKnownFeatures.PLAYER_RATING_DIFFERENCE_PROJECTED],
# )

lag_transformer = LagTransformer(
    features=["points"],
    lag_length=3,  # Last 3 games
    granularity=["player_id"],
)

rolling_transformer = RollingWindowTransformer(
    features=["points"],
    window=10,  # Last 10 games average
    granularity=["player_id"],
)

# Chain them together
features_pipeline = FeatureGeneratorPipeline(
    column_names=column_names,
    feature_generators=[
        player_rating_generator,
        lag_transformer,
        rolling_transformer,
    ],
)

# Learn from historical data
historical_df = features_pipeline.fit_transform(historical_df)

# For production predictions (doesn't update internal state)
future_df = features_pipeline.future_transform(future_df)

Key points:

  • fit_transform: Learn ratings/patterns from historical data (updates internal state)
  • transform: Apply to more historical data (continues updating state)
  • future_transform: Generate features for prediction (read-only, no state updates)
  • Features are automatically shifted by 1 match to prevent data leakage

See examples/nba/feature_engineering_example.py for a complete example with detailed explanations.

Cross Validation and Scorer metrics

Regular k-fold cross-validation doesn't work for time-series sports data because it can create "future leakage" - using future games to predict past games. MatchKFoldCrossValidator ensures training data is always BEFORE validation data, respecting temporal ordering.

Why this matters

Sports data has strong time dependencies: teams improve, players get injured, strategies evolve. Standard CV would overestimate model performance by allowing the model to "see the future."

Example: Time-series cross-validation

from spforge.cross_validator import MatchKFoldCrossValidator
from spforge.scorer import SklearnScorer, Filter, Operator
from sklearn.metrics import mean_absolute_error

# Set up temporal cross-validation
cross_validator = MatchKFoldCrossValidator(
    date_column_name=column_names.start_date,
    match_id_column_name=column_names.match_id,
    estimator=pipeline,  # Your AutoPipeline
    prediction_column_name="points_pred",
    target_column="points",
    n_splits=3,  # Number of temporal folds
    # Must include both estimator features and context features
    features=pipeline.required_features,
)

# Generate validation predictions
# add_training_predictions=True also returns predictions on training data
validation_df = cross_validator.generate_validation_df(df=df, add_training_predictions=True)

# Score only validation rows, filtering to players who actually played
scorer = SklearnScorer(
    pred_column="points_pred",
    target="points",
    scorer_function=mean_absolute_error,
    validation_column="is_validation",  # Only score where is_validation == True
    filters=[
        Filter(column_name="minutes", value=0, operator=Operator.GREATER_THAN)
    ],
)

mae = scorer.score(validation_df)
print(f"Validation MAE: {mae:.2f}")

Key points:

  • add_training_predictions=True returns both training and validation predictions
    • is_validation=True marks validation rows, is_validation=False marks training rows
    • Use validation_column in scorer to score only validation rows
  • Training data always comes BEFORE validation data chronologically
  • Must pass all required features (use pipeline.required_features)
  • Scorers can filter rows (e.g., only score players who played minutes > 0)

See examples/nba/cross_validation_example.py for a complete example.

Distributions (Advanced)

Instead of predicting a single point estimate, you can predict full probability distributions. For example, instead of "player will score 15 points", predict P(0 points), P(1 point), ..., P(40 points).

When to use distributions

  • Modeling count data (points, goals, kills, assists)
  • When you need uncertainty estimates or confidence intervals
  • For expected value calculations in betting or DFS
  • When the outcome has inherent randomness

What NegativeBinomialEstimator does during fit

During training, NegativeBinomialEstimator:

  1. Takes the point estimates (from point_estimate_pred_column) and actual target values
  2. Optimizes a dispersion parameter r using maximum likelihood estimation on the negative binomial distribution
  3. If r_specific_granularity is set (e.g., per player), calculates entity-specific r values by:
    • Computing rolling means and variances of point estimates over recent matches
    • Binning entities by quantiles of mean and variance
    • Fitting separate r values for each bin to capture different uncertainty patterns

During prediction, it uses the learned r parameter(s) and the point estimates to generate a full probability distribution over all possible values (0 to max_value).

Example: Comparing classifiers vs distribution estimators

A key advantage is comparing different approaches for generating probability distributions. Both LGBMClassifier and LGBMRegressor+NegativeBinomial output probabilities in the same format, making them directly comparable.

from spforge.distributions import NegativeBinomialEstimator
from spforge.transformers import EstimatorTransformer
from lightgbm import LGBMClassifier, LGBMRegressor

# Approach 1: LGBMClassifier (direct probability prediction)
pipeline_classifier = AutoPipeline(
    estimator=LGBMClassifier(verbose=-100, random_state=42),
    estimator_features=features_pipeline.features_out,
)

# Approach 2: LGBMRegressor + NegativeBinomialEstimator
distribution_estimator = NegativeBinomialEstimator(
    max_value=40,  # Predict 0-40 points
    point_estimate_pred_column="points_estimate",  # Uses regressor output
    r_specific_granularity=["player_id"],  # Player-specific dispersion
    predicted_r_weight=1,
    column_names=column_names,
)

pipeline_negbin = AutoPipeline(
    estimator=distribution_estimator,
    estimator_features=features_pipeline.features_out,
    predictor_transformers=[
        EstimatorTransformer(
            prediction_column_name="points_estimate",
            estimator=LGBMRegressor(verbose=-100, random_state=42),
            features=features_pipeline.features_out,
        )
    ],
)

# Compare using cross-validation (see examples for full setup)
# Results on NBA player points prediction:
# LGBMClassifier Ordinal Loss:              1.0372
# LGBMRegressor + NegativeBinomial Ordinal Loss: 0.3786
# LGBMRegressor + NegativeBinomial Point Est MAE: 4.5305

Key points:

  • Both approaches output probability distributions over the same range
  • NegativeBinomialEstimator performs significantly better (lower ordinal loss)
  • Distribution approach provides both probability distributions and point estimates
  • Can model player-specific variance with r_specific_granularity

See examples/nba/cross_validation_example.py for a complete runnable example with both approaches.

Predictions as features for downstream models (Advanced)

A common pattern in sports analytics is using output from one model as input to another. For example, team strength (game winner probability) often influences individual player performance.

Why this matters

Hierarchical modeling captures dependencies: team context → player performance, game flow → outcome probabilities. By chaining models, each stage can specialize and the final model combines their insights.

Example: Two-stage modeling with predictor_transformers

from spforge.transformers import EstimatorTransformer
from lightgbm import LGBMRegressor

# Stage 1: Create a raw point estimate
points_estimate_transformer = EstimatorTransformer(
    prediction_column_name="points_estimate_raw",
    estimator=LGBMRegressor(verbose=-100, n_estimators=30),
)

# Stage 2: Refine estimate using Stage 1 output
player_points_pipeline = AutoPipeline(
    estimator=LGBMRegressor(verbose=-100, n_estimators=50),
    estimator_features=features_pipeline.features_out,  # Original features
    # predictor_transformers execute first, adding their predictions
    predictor_transformers=[points_estimate_transformer],
)

# During fit:
#   1. Stage 1 fits and generates "points_estimate_raw" column
#   2. Stage 2 fits using original features + points_estimate_raw
player_points_pipeline.fit(X=train_df, y=train_df["points"])

# During predict:
#   1. Stage 1 generates "points_estimate_raw"
#   2. Stage 2 uses it to make final prediction
predictions = player_points_pipeline.predict(test_df)

Key points:

  • predictor_transformers chains estimators: output of one becomes input to next
  • All transformers share the same target (y) during fit
  • Transformers execute during both fit() and predict()
  • Common use cases:
    • Generate point estimates for distribution models
    • Multi-stage refinement of predictions
    • Combining different model types (linear → tree-based)

See examples/nba/predictor_transformers_example.py for a complete example. Also demonstrated in examples/nba/cross_validation_example.py.

More Examples

For complete, runnable examples with detailed explanations:

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spforge-0.10.21.tar.gz (662.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spforge-0.10.21-py3-none-any.whl (708.8 kB view details)

Uploaded Python 3

File details

Details for the file spforge-0.10.21.tar.gz.

File metadata

  • Download URL: spforge-0.10.21.tar.gz
  • Upload date:
  • Size: 662.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for spforge-0.10.21.tar.gz
Algorithm Hash digest
SHA256 9b408fedd0cb430426c28ad5983620ed469b728019759bd6c002fcc7454ca798
MD5 32fd6ddbcc9a3c4f17af2d67ef30c6e1
BLAKE2b-256 08aa1a26cf71d085e7538f68cd7908fed9259e161e65570055286b8273761a75

See more details on using hashes here.

File details

Details for the file spforge-0.10.21-py3-none-any.whl.

File metadata

  • Download URL: spforge-0.10.21-py3-none-any.whl
  • Upload date:
  • Size: 708.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for spforge-0.10.21-py3-none-any.whl
Algorithm Hash digest
SHA256 0adc81fae069915c871bbc982c58f6401aa79a3388d209b0d18bf1c7a7e19b56
MD5 e730234265d5cb4ee9bd79ac621dbd07
BLAKE2b-256 4d5cd957368e94f4b616815174c5cb796f04b926bbe2c72f3f137190acbce152

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page