Skip to main content

MLB prediction models and data tools

Project description

mlb-ml-lab

MLB prediction models — fetch player and team data, build feature matrices, train models, and evaluate hit over/under forecasts.

Features

  • Zero ML dependencies (no pybaseball, pybaseballstats, python-mlb-statsapi). A custom httpx-based client wraps statsapi.mlb.com and baseballsavant.mlb.com directly.
  • Typed schemas throughout — PlayerGameLog, TeamInfo, RosterPlayer, etc. are typed dataclasses.
  • Disk caching with per-key TTL — avoids hammering the MLB API during development.
  • Rate limiting — built-in token bucket (10 req/s).
  • Park factors scraped live from Baseball Savant with static fallbacks.
  • NWS weather forecasts — free, no API key, covers every MLB venue.
  • Feature engineering pipeline — plugin-based extractors with a registry pattern, designed to be extractable as its own package.
  • Walk-forward validation — no random train/test splits. Sports data is temporally dependent.

Installation

# Clone the repo
git clone https://github.com/timhollingsworth/mlb-ml-lab
cd mlb-ml-lab

# Install with Poetry
poetry install

Requires Python 3.12+.

Quick Start

Fetch player game logs

from mlb_ml_lab import MlbClient

client = MlbClient()

# Get all teams
teams = client.get_teams()

# Get roster for a team (Angels = 108)
roster = client.get_roster(108)

# Get game logs for a player (Shohei Ohtani = 660271)
logs = client.get_player_game_log(660271, season=2024)

# Each log has typed fields
for log in logs:
    print(log.date, log.hits, log.at_bats)

Fetch game context (venue, weather, datetime)

# Game feed gives you venue, weather, and game datetime
feed = client.get_game_context(778554)
# → {"venue_id": 4, "venue_name": "Rate Field",
#    "game_datetime": "2025-03-27T20:10:00Z",
#    "weather_condition": "Cloudy", "weather_temp": "68", ...}

Build a feature matrix

from mlb_ml_lab import MlbClient, build_feature_matrix, describe_features, make_targets

client = MlbClient()

# 1. Fetch data
teams = client.get_teams()
logs = client.get_player_game_log(660271, season=2024)
contexts = {778554: client.get_game_context(778554)}

# 2. Assemble features (runs all registered extractors)
matrix = build_feature_matrix(
    logs,
    season=2024,
    teams=teams,
    extra_kwargs={"game_contexts": contexts},
)

# 3. See what features are available
metas = describe_features()
for m in metas:
    print(f"{m.name:40s} {m.source:10s} {m.description}")

# 4. Create target labels
targets = make_targets(logs)

Weather forecast for an upcoming game

from datetime import datetime
from mlb_ml_lab import NwsWeather

nws = NwsWeather()

# Angel Stadium (venue_id=1) at game time
forecast = nws.forecast(1, target_time=datetime(2025, 7, 4, 19, 7))
# → {"temp": 75, "wind_speed": "8 mph", "wind_direction": "SW",
#    "precip_pct": 10, "conditions": "Partly Cloudy", "source": "forecast"}

Park factors

from mlb_ml_lab import ParkFactors

pf = ParkFactors()
# Coors Field (venue_id=19) 2024 wOBA factor
factor = pf.factor(19, "wOBA", season=2024)
print(factor)  # e.g. 1.11 (11% boost)

Project Structure

mlb-ml-lab/
├── src/
│   └── mlb_ml_lab/
│       ├── data/               # Data layer (installable)
│       │   ├── client.py       # MlbClient — MLB Stats API + Baseball Savant
│       │   ├── schemas.py      # Typed dataclasses
│       │   ├── cache.py        # DiskCache (JSON, per-key TTL)
│       │   ├── rate_limiter.py # TokenBucket rate limiter
│       │   ├── parks.py        # ParkFactors (Savant scrape + fallback)
│       │   └── weather.py      # NwsWeather (NWS API, free, no key)
│       └── features/           # Feature engineering (installable)
│           ├── base.py         # FeatureExtractor ABC, registry
│           ├── rolling.py      # Rolling window stats (hits, PA, BABIP)
│           ├── context.py      # Home/away, rest days, park factors, weather
│           ├── matchup.py      # Opponent pitching stats
│           ├── statcast.py     # Statcast advanced metrics
│           ├── forecast.py     # NWS weather forecast features
│           ├── assemble.py     # build_feature_matrix(), describe_features()
│           └── targets.py      # make_targets() for hit thresholds
├── pipeline/                   # Modeling (training, prediction, evaluation)
├── tests/
│   ├── data/                   # Tests for data layer
│   └── features/               # Tests for feature engineering
├── data/                       # Raw/processed datasets (gitignored)
├── experiments/                # Notebooks (gitignored)
├── pyproject.toml
├── README.md
├── LICENSE
├── AGENTS.md                   # Dev instructions (AI assistant)
└── ROADMAP.md                  # Build-out plan

Development

# Run fast tests (no live API calls)
poetry run pytest

# Run all tests including live API calls
poetry run pytest --runslow

# Run a single test
poetry run pytest tests/features/test_forecast.py::TestWeatherForecastFeatures::test_indoor_venue_returns_indoor -v

# Lint
poetry run ruff check .

# Format
poetry run ruff format .

Adding a new feature extractor

  1. Create a new module in src/mlb_ml_lab/features/ (e.g. src/mlb_ml_lab/features/schedule.py).
  2. Subclass FeatureExtractor, implement features and extract.
  3. Decorate with @register.
  4. Import it in src/mlb_ml_lab/features/__init__.py.
  5. It will automatically be discovered by build_feature_matrix().

Data Sources

Source Endpoint Key Required Notes
MLB Stats API statsapi.mlb.com/api/v1/ No Rate limit ~10 req/s
Baseball Savant baseballsavant.mlb.com/leaderboard/ No CSV download, BOM stripping required
NWS API api.weather.gov No (User-Agent required) Free, no key, hourly forecasts

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlb_ml_lab-0.1.0.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlb_ml_lab-0.1.0-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file mlb_ml_lab-0.1.0.tar.gz.

File metadata

  • Download URL: mlb_ml_lab-0.1.0.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlb_ml_lab-0.1.0.tar.gz
Algorithm Hash digest
SHA256 de8f37b3c0add75b7370ef30ba9fd10692f9205528d7222c79e935815092f0a0
MD5 47c24ca6acfe58a48cbe1782ee838938
BLAKE2b-256 9ff46e142c6529f6383f83b310c8dcdcccd88e74d40c067822dd44f5dcc02acf

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlb_ml_lab-0.1.0.tar.gz:

Publisher: python-publish.yml on SecuritahGuy/mlb-ml-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlb_ml_lab-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mlb_ml_lab-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlb_ml_lab-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c3fa464c62e607e74c46230387a203b8b9861becd7bdb21b09c02a20c5710de
MD5 529616ca2ea85b5c5dc2a5f052fcab25
BLAKE2b-256 52b4577a4eb70cd1f89ed5cf162751e027956fa16d7695e8b919ef811db23087

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlb_ml_lab-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on SecuritahGuy/mlb-ml-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page