MLB prediction models and data tools
Project description
mlb-ml-lab
MLB prediction models — fetch player and team data, build feature matrices, train models, and evaluate hit over/under forecasts.
Features
- Zero ML dependencies (no
pybaseball,pybaseballstats,python-mlb-statsapi). A customhttpx-based client wrapsstatsapi.mlb.comandbaseballsavant.mlb.comdirectly. - Typed schemas throughout —
PlayerGameLog,TeamInfo,RosterPlayer, etc. are typed dataclasses. - Disk caching with per-key TTL — avoids hammering the MLB API during development.
- Rate limiting — built-in token bucket (10 req/s).
- Park factors scraped live from Baseball Savant with static fallbacks.
- NWS weather forecasts — free, no API key, covers every MLB venue.
- Feature engineering pipeline — plugin-based extractors with a registry pattern, designed to be extractable as its own package.
- Walk-forward validation — no random train/test splits. Sports data is temporally dependent.
Installation
# Clone the repo
git clone https://github.com/timhollingsworth/mlb-ml-lab
cd mlb-ml-lab
# Install with Poetry
poetry install
Requires Python 3.12+.
Quick Start
Fetch player game logs
from mlb_ml_lab import MlbClient
client = MlbClient()
# Get all teams
teams = client.get_teams()
# Get roster for a team (Angels = 108)
roster = client.get_roster(108)
# Get game logs for a player (Shohei Ohtani = 660271)
logs = client.get_player_game_log(660271, season=2024)
# Each log has typed fields
for log in logs:
print(log.date, log.hits, log.at_bats)
Fetch game context (venue, weather, datetime)
# Game feed gives you venue, weather, and game datetime
feed = client.get_game_context(778554)
# → {"venue_id": 4, "venue_name": "Rate Field",
# "game_datetime": "2025-03-27T20:10:00Z",
# "weather_condition": "Cloudy", "weather_temp": "68", ...}
Build a feature matrix
from mlb_ml_lab import MlbClient, build_feature_matrix, describe_features, make_targets
client = MlbClient()
# 1. Fetch data
teams = client.get_teams()
logs = client.get_player_game_log(660271, season=2024)
contexts = {778554: client.get_game_context(778554)}
# 2. Assemble features (runs all registered extractors)
matrix = build_feature_matrix(
logs,
season=2024,
teams=teams,
extra_kwargs={"game_contexts": contexts},
)
# 3. See what features are available
metas = describe_features()
for m in metas:
print(f"{m.name:40s} {m.source:10s} {m.description}")
# 4. Create target labels
targets = make_targets(logs)
Weather forecast for an upcoming game
from datetime import datetime
from mlb_ml_lab import NwsWeather
nws = NwsWeather()
# Angel Stadium (venue_id=1) at game time
forecast = nws.forecast(1, target_time=datetime(2025, 7, 4, 19, 7))
# → {"temp": 75, "wind_speed": "8 mph", "wind_direction": "SW",
# "precip_pct": 10, "conditions": "Partly Cloudy", "source": "forecast"}
Park factors
from mlb_ml_lab import ParkFactors
pf = ParkFactors()
# Coors Field (venue_id=19) 2024 wOBA factor
factor = pf.factor(19, "wOBA", season=2024)
print(factor) # e.g. 1.11 (11% boost)
Project Structure
mlb-ml-lab/
├── src/
│ └── mlb_ml_lab/
│ ├── data/ # Data layer (installable)
│ │ ├── client.py # MlbClient — MLB Stats API + Baseball Savant
│ │ ├── schemas.py # Typed dataclasses
│ │ ├── cache.py # DiskCache (JSON, per-key TTL)
│ │ ├── rate_limiter.py # TokenBucket rate limiter
│ │ ├── parks.py # ParkFactors (Savant scrape + fallback)
│ │ └── weather.py # NwsWeather (NWS API, free, no key)
│ └── features/ # Feature engineering (installable)
│ ├── base.py # FeatureExtractor ABC, registry
│ ├── rolling.py # Rolling window stats (hits, PA, BABIP)
│ ├── context.py # Home/away, rest days, park factors, weather
│ ├── matchup.py # Opponent pitching stats
│ ├── statcast.py # Statcast advanced metrics
│ ├── forecast.py # NWS weather forecast features
│ ├── assemble.py # build_feature_matrix(), describe_features()
│ └── targets.py # make_targets() for hit thresholds
├── pipeline/ # Modeling (training, prediction, evaluation)
├── tests/
│ ├── data/ # Tests for data layer
│ └── features/ # Tests for feature engineering
├── data/ # Raw/processed datasets (gitignored)
├── experiments/ # Notebooks (gitignored)
├── pyproject.toml
├── README.md
├── LICENSE
├── AGENTS.md # Dev instructions (AI assistant)
└── ROADMAP.md # Build-out plan
Development
# Run fast tests (no live API calls)
poetry run pytest
# Run all tests including live API calls
poetry run pytest --runslow
# Run a single test
poetry run pytest tests/features/test_forecast.py::TestWeatherForecastFeatures::test_indoor_venue_returns_indoor -v
# Lint
poetry run ruff check .
# Format
poetry run ruff format .
Adding a new feature extractor
- Create a new module in
src/mlb_ml_lab/features/(e.g.src/mlb_ml_lab/features/schedule.py). - Subclass
FeatureExtractor, implementfeaturesandextract. - Decorate with
@register. - Import it in
src/mlb_ml_lab/features/__init__.py. - It will automatically be discovered by
build_feature_matrix().
Data Sources
| Source | Endpoint | Key Required | Notes |
|---|---|---|---|
| MLB Stats API | statsapi.mlb.com/api/v1/ |
No | Rate limit ~10 req/s |
| Baseball Savant | baseballsavant.mlb.com/leaderboard/ |
No | CSV download, BOM stripping required |
| NWS API | api.weather.gov |
No (User-Agent required) | Free, no key, hourly forecasts |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlb_ml_lab-0.1.0.tar.gz.
File metadata
- Download URL: mlb_ml_lab-0.1.0.tar.gz
- Upload date:
- Size: 27.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de8f37b3c0add75b7370ef30ba9fd10692f9205528d7222c79e935815092f0a0
|
|
| MD5 |
47c24ca6acfe58a48cbe1782ee838938
|
|
| BLAKE2b-256 |
9ff46e142c6529f6383f83b310c8dcdcccd88e74d40c067822dd44f5dcc02acf
|
Provenance
The following attestation bundles were made for mlb_ml_lab-0.1.0.tar.gz:
Publisher:
python-publish.yml on SecuritahGuy/mlb-ml-lab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlb_ml_lab-0.1.0.tar.gz -
Subject digest:
de8f37b3c0add75b7370ef30ba9fd10692f9205528d7222c79e935815092f0a0 - Sigstore transparency entry: 2047782613
- Sigstore integration time:
-
Permalink:
SecuritahGuy/mlb-ml-lab@533974da0f488310b48a7bf815ecbd745ec680c9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/SecuritahGuy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@533974da0f488310b48a7bf815ecbd745ec680c9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file mlb_ml_lab-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mlb_ml_lab-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c3fa464c62e607e74c46230387a203b8b9861becd7bdb21b09c02a20c5710de
|
|
| MD5 |
529616ca2ea85b5c5dc2a5f052fcab25
|
|
| BLAKE2b-256 |
52b4577a4eb70cd1f89ed5cf162751e027956fa16d7695e8b919ef811db23087
|
Provenance
The following attestation bundles were made for mlb_ml_lab-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on SecuritahGuy/mlb-ml-lab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlb_ml_lab-0.1.0-py3-none-any.whl -
Subject digest:
0c3fa464c62e607e74c46230387a203b8b9861becd7bdb21b09c02a20c5710de - Sigstore transparency entry: 2047782624
- Sigstore integration time:
-
Permalink:
SecuritahGuy/mlb-ml-lab@533974da0f488310b48a7bf815ecbd745ec680c9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/SecuritahGuy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@533974da0f488310b48a7bf815ecbd745ec680c9 -
Trigger Event:
release
-
Statement type: