Skip to main content

A comprehensive Python package for scraping and analyzing NHL data with built-in Expected Goals (xG) modeling

Project description

ScraperNHL

Scrape and analyze hockey data from 6 leagues with one unified API.

PyPI version Python License Docs

ScraperNHL provides play-by-play events, player stats, schedules, rosters, and standings for the NHL, AHL, PWHL, OHL, WHL, and QMJHL — all returned as pandas DataFrames, all from the same interface.

NHL support goes further with an advanced analytics pipeline: time-on-ice matrices, shift-level analysis, on-ice shot/Corsi/Fenwick stats, and per-60 rates.


Supported Leagues

League Key Season format Current season
National Hockey League nhl YYYYYYYY 20252026
American Hockey League ahl integer 90
Professional Women's Hockey League pwhl integer 8
Ontario Hockey League ohl integer 83
Western Hockey League whl integer 289
Quebec Major Junior Hockey League qmjhl integer 211

Installation

pip install scrapernhl

From source (latest dev):

git clone https://github.com/maxtixador/scrapernhl.git
cd scrapernhl
pip install -e .

Requirements: Python 3.10+, pandas, numpy, requests, beautifulsoup4, selectolax


Two Ways to Use It

1. Functional API — one-liners for everything

from scrapernhl import scrape

# Play-by-play — works for all 6 leagues
pbp = scrape('nhl',   'pbp', game_id=2023020001)
pbp = scrape('ahl',   'pbp', game_id=1027781)
pbp = scrape('qmjhl', 'pbp', game_id=31909)
pbp = scrape('ohl',   'pbp', game_id=28150)
pbp = scrape('whl',   'pbp', game_id=1022126)
pbp = scrape('pwhl',  'pbp', game_id=210)

# Player stats
skaters = scrape('ahl',   'stats', season=90, position='skaters')
goalies = scrape('ohl',   'stats', season=83, position='goalies')
skaters = scrape('nhl',   'stats', team='MTL', season=20232024, position='skaters')  # NHL needs a team

# Schedule, roster, standings
schedule  = scrape('whl',  'schedule',  season=289)
schedule  = scrape('nhl',  'schedule',  team='MTL', season=20232024)  # NHL needs a team
roster    = scrape('nhl',  'roster',    team='MTL', season=20232024)
standings = scrape('qmjhl','standings', season=211)
standings = scrape('nhl',  'standings', season=20232024)

# Teams and seasons
teams   = scrape('nhl', 'teams')              # active NHL teams
teams   = scrape('ahl', 'teams', season=90)   # AHL teams for a season
seasons = scrape('ahl', 'seasons')

2. Object-Oriented API — more control

from scrapernhl import HockeyScraper

s = HockeyScraper('ahl')

pbp      = s.play_by_play(game_id=1027781)
skaters  = s.player_stats(season=90, position='skaters')
goalies  = s.player_stats(season=90, position='goalies')
schedule = s.schedule(season=90)               # team='all' by default for non-NHL
roster   = s.roster(team='390', season=90)     # team ID from bootstrap data
standing = s.standings(season=90)
teams    = s.teams_by_season(season=90)
seasons  = s.seasons('all')                    # 'all', 'regular', or 'playoff'

# Convenience aliases — same result, different names
s.scrape_pbp(game_id=1027781)
s.scrape_skaters()
s.scrape_goalies()
s.scrape_schedule()
s.scrape_roster(team='390')
s.scrape_standings()

# Scrape multiple games and get one concatenated DataFrame
df = s.scrape_multiple_games([1027781, 1027779])

League Metadata (non-NHL)

Bootstrap data is fetched automatically when you create a non-NHL scraper. Use it to look up valid team IDs and season IDs before making other calls.

s = HockeyScraper('ahl')

s.teams                          # list of team dicts
s.current_season_id              # '90'
s.get_teams(include_all=False)   # excludes the "All Teams" placeholder
s.get_team_by_id('390')          # dict with id, name, team_code, logo, ...
s.get_team_by_code('ABB')
s.get_seasons('regular')         # list of season dicts; also 'playoff', 'all'
s.get_current_season()           # dict for the current season
s.get_conferences()
s.get_divisions()
s.get_positions()
s.get_league_metadata()          # league name, short_name, code, logo
s.is_playoffs_active()           # True during playoff season
s.is_bilingual()                 # True for QMJHL (has French translations)

# Raw bootstrap dict
data = s.bootstrap(season='90', page_name='scorebar')

NHL-Specific Methods

The following are only available on HockeyScraper('nhl') and raise NotImplementedError for other leagues.

Play-by-Play Sources

nhl = HockeyScraper('nhl')

# Three different PBP sources for the same game
json_pbp = nhl.scrape_plays(2023020001)    # JSON API — fastest
html_pbp = nhl.html_pbp(2023020001)        # HTML report — includes faceoff zone, shot type
full_pbp = nhl.scrape_game(2023020001)     # Merged pipeline (HTML + JSON) — most complete

# Raw dict from the JSON API
data = nhl.get_game_data(2023020001)

# With include_tuple=True, scrape_game returns a GameResult namedtuple
# (pbp_df, shifts_df, html_pbp_df, home_team, away_team)
result = nhl.scrape_game(2023020001, include_tuple=True)
pbp, shifts, html, home, away = result

Shifts, Stats, Standings

shifts = nhl.shifts(2023020001)

nhl.team_stats(team='MTL', season=20232024, session=2, goalies=False)
# session: 1=preseason, 2=regular season, 3=playoffs

nhl.standings_by_date('2024-01-15')
nhl.standings_by_date()           # defaults to Jan 1 of the previous year

Teams and Draft

# Three team data sources
nhl.scrape_teams(source='calendar')    # active teams from the schedule calendar
nhl.scrape_teams(source='franchise')   # franchise list with first/last season
nhl.scrape_teams(source='records')     # records API — includes logos, conference, division

# Draft
nhl.draft(year=2024, round='all')      # all rounds
nhl.draft(year=2023, round=1)          # single round
nhl.draft_records(year=2024)           # records API — more player detail
nhl.team_draft_history(franchise=1)    # all picks for one franchise (1 = NJD)

NHL Analytics Pipeline

scrape_game is the starting point. It merges HTML and JSON PBP into one enriched DataFrame with on-ice player lists, strength state, zone starts, and shot coordinates.

nhl = HockeyScraper('nhl')

# Step 1: Get game data
pbp    = nhl.scrape_game(2023020001)
shifts = nhl.shifts(2023020001)

# Step 2: Player-by-second matrix and strength states
matrix    = nhl.seconds_matrix(pbp, shifts)
strengths = nhl.strengths_by_second(matrix)

# Step 4: Time-on-ice by strength
toi = nhl.toi_by_strength_all(matrix, strengths)
toi = nhl.toi_by_strength_all(matrix, strengths, in_seconds=True)

# Step 5: Pairwise shared TOI
teammates = nhl.shared_toi_teammates(matrix, strengths)
opponents = nhl.shared_toi_opponents(matrix, strengths)

# Step 5: On-ice shot/goal stats
player_stats = nhl.on_ice_stats(pbp)
player_stats = nhl.on_ice_stats(pbp, include_goalies=True, rates=True)  # per-60 rates

# Combination stats (e.g. all 2-player pairs for MTL)
combos = nhl.combo_on_ice_stats(pbp, focus_team='MTL', n_team=2, m_opp=0)

# Team-level aggregates by strength state
team_agg = nhl.team_strength_aggregates(pbp, rates=True)

# On-ice player columns: choose long (tidy) or wide (numbered) format
long_df = nhl.build_on_ice_long(pbp)
wide_df = nhl.build_on_ice_wide(pbp, max_skaters=6, include_goalie=True)

# Shift events table (ON/OFF events from the shifts DataFrame)
shift_events = nhl.build_shifts_events(shifts)

Command-Line Interface

# Play-by-play
scrapernhl ahl   game 1027781              --output game.csv
scrapernhl game  2023020001               --output nhl_game.json

# Player stats (non-NHL)
scrapernhl ahl   stats --season 90 --player-type skater  --output stats.csv
scrapernhl ohl   stats --season 83 --player-type goalie  --output goalies.json

# NHL player stats (top-level command, requires team + season)
scrapernhl stats MTL 20252026            --output mtl_skaters.csv
scrapernhl stats MTL 20252026 --goalies  --output mtl_goalies.csv

# Schedule
scrapernhl whl   schedule --season 289   --output schedule.csv
scrapernhl schedule MTL 20252026         --output nhl_schedule.csv

# Standings
scrapernhl standings                     --output standings.csv
scrapernhl qmjhl standings --season 211  --output standings.json

scrapernhl --help
scrapernhl ahl --help

Important Behavior Notes

NHL player_stats and schedule require a team tricode. The NHL API serves data per-team, not league-wide. Pass team='MTL', team='TOR', etc. Non-NHL leagues default to team='all' for league-wide data.

Bootstrap data is fetched on init for non-NHL leagues. The first call to HockeyScraper('ahl') makes one network request to get teams, seasons, and configuration. Subsequent calls use the cached data.

Caching is automatic and disk-based.

Data type Cache TTL
Play-by-play None (always fresh)
Schedule 1 hour
Player stats 1 hour
Standings 30 minutes
Roster 24 hours

Running Tests

# Integration tests — require a network connection
pytest tests/test_client.py -v

# Run only a specific class
pytest tests/test_client.py::TestNHLAnalytics -v
pytest tests/test_client.py::TestPlayByPlay -v

717 tests cover all 6 leagues across: instantiation, bootstrap accessors, play-by-play, player stats (skaters + goalies), schedules, rosters, standings, teams, seasons, batch scraping, all NHL-specific methods, the full analytics pipeline, and the scrape() functional API.


Project Structure

scrapernhl/
├── __init__.py         # Public API: HockeyScraper, scrape()
├── client.py           # Unified HockeyScraper class (~900 lines)
├── config.py           # League configs, API keys, cache TTLs
├── urls.py             # URL builders for every league/endpoint
├── parsers.py          # Extract records from raw API responses
├── transform.py        # Normalize coordinates, events, times
├── enrichment.py       # Add team names, season metadata (non-NHL)
├── utils.py            # Rate limiter, disk cache, HTTP session
├── cli.py              # Click-based CLI
└── nhl/
    ├── scraper_legacy.py   # Full NHL pipeline: HTML PBP, shifts, TOI
    ├── analytics.py        # Advanced analytics (Corsi, scoring chances, zone starts)
    └── scrapers/           # Modular per-endpoint scrapers

Contributing

Bug reports and pull requests are welcome at https://github.com/maxtixador/scrapernhl.

License

MIT

Author

Max Tixador @woumaxx · @HabsBrain.com · maxtixador@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapernhl-0.3.2.tar.gz (146.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapernhl-0.3.2-py3-none-any.whl (134.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapernhl-0.3.2.tar.gz.

File metadata

  • Download URL: scrapernhl-0.3.2.tar.gz
  • Upload date:
  • Size: 146.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapernhl-0.3.2.tar.gz
Algorithm Hash digest
SHA256 15155015ae832a6314b82da130fd0d1e82dd94cf1f69b09e9a6afe22bf7a5bc1
MD5 36e0db61453246ed41a8eb5357fb9be1
BLAKE2b-256 4253f42b1626d6deb002c0d092443453fe7462a2292080bf783170895a7ee959

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapernhl-0.3.2.tar.gz:

Publisher: python-publish.yml on maxtixador/scrapernhl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapernhl-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: scrapernhl-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 134.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapernhl-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0b95c942644c099e948f0531cf0d6f3a21f8fe92842e964451ef7c19dd074518
MD5 4e5785e97088cad1b6fc8524efe16e6e
BLAKE2b-256 ba192067d3e18c19ad696ccd48c83f86e9be5a814843020fed6abcc729397323

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapernhl-0.3.2-py3-none-any.whl:

Publisher: python-publish.yml on maxtixador/scrapernhl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page