Skip to main content

Python Package for scraping NHL Play-by-Play and Shift data.

Project description

https://badge.fury.io/py/hockey-scraper.svg Documentation Status

Please upgrade to version 1.40 or higher as earlier versions won’t work.

Hockey-Scraper

Purpose

Scrape NHL data off the NHL API and website. This includes the Play by Play and Shift data for each game and the schedule information. It currently supports all preseason, regular season, and playoff games from the 2007-2008 season onwards.

Prerequisites

You are going to need to have python installed for this. This should work for both python 2.7 and 3. I recommend having from at least version 3.6.0 but earlier versions should be fine.

Installation

To install all you need to do is open up your terminal and run:

pip install hockey_scraper

NHL Usage

The full documentation can be found here.

Standard Scrape Functions

Scrape data on a season by season level:

import hockey_scraper

# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file
hockey_scraper.scrape_seasons([2015, 2016], True)

# Scrapes the 2008 season without shifts and returns a dictionary containing the pbp Pandas DataFrame
scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')

Scrape a list of games:

import hockey_scraper

# Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file
hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)

# Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a Dictionary with the Pandas DataFrames
scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')

Scrape all games in a given date range:

import hockey_scraper

# Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)

# Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a Dictionary with the pbp Pandas DataFrame
scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')

The dictionary returned by setting the default argument “data_format” equal to “Pandas” is structured like:

{
  # Both of these are always included
  'pbp': pbp_df,

  # This is only included when the argument 'if_scrape_shifts' is set equal to True
  'shifts': shifts_df
}

Schedule

The schedule for any past or future games can be scraped as follows:

import hockey_scraper

# As oppossed to the other calls the default format is 'Pandas' which returns a DataFrame
sched_df = hockey_scraper.scrape_schedule("2019-10-01", "2020-07-01")

The columns returned are: [‘game_id’, ‘date’, ‘venue’, ‘home_team’, ‘away_team’, ‘start_time’, ‘home_score’, ‘away_score’, ‘status’]

Persistent Data

All the raw game data files retrieved can also be saved to your disk. This allows for faster rescraping (we don’t need to re-retrieve them) and the ability to parse the data yourself.

This is achieved by setting the keyword argument docs_dir=True. This will store the data in a directory called ~/hockey_scraper_data. You can provide your own directory where you want everything to be stored (it must exist beforehand). By default docs_dir=False.

For example, let’s say we are scraping the JSON PBP data for game 2019020001. If docs_dir isn’t False it will first check if the data is already in the directory. If so, it will load in the data from that file and not make a GET request to the NHL API. However if it doesn’t exist, it will make a GET request and then save the output to the directory. This will ensure that next time you are requesting that data it can load it from a file.

Here are some examples.

The default saving location is ~/hockey_scraper_data.

# Create or try to refer to a directory in the home directory
# Will create a directory called 'hockey_scraper_data' in the home directory (if it doesn't exist)
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=True)

User defined directory

USER_PATH = "/...."
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)

You can override the existing files by specifying rescrape=True. It will retrieve all the files from source and save the newer versions to docs_dir.

hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)

Live Scraping

Here is a simple example of a way to setup live scraping. I strongly suggest checking out this section of the docs if you plan on using this.

import hockey_scraper as hs


def to_csv(game):
    """
    Store each game DataFrame in a file

    :param game: LiveGame object

    :return: None
    """

    # If the game:
    # 1. Started - We recorded at least one event
    # 2. Not in Intermission
    # 3. Not Over
    if game.is_ongoing():
        # Print the description of the last event
        print(game.game_id, "->", game.pbp_df.iloc[-1]['Description'])

        # Store in CSV files
        game.pbp_df.to_csv(f"../hockey_scraper_data/{game.game_id}_pbp.csv", sep=',')
        game.shifts_df.to_csv(f"../hockey_scraper_data/{game.game_id}_shifts.csv", sep=',')

if __name__ == "__main__":
    # B4 we start set the directory to store the files
    # You don't have to do this but I recommend it
    hs.live_scrape.set_docs_dir("../hockey_scraper_data")

    # Scrape the info for all the games on 2018-11-15
    games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=20)

    # While all the games aren't finished
    while not games.finished():
        # Update for all the games currently being played
        games.update_live_games(sleep_next=True)

        # Go through every LiveGame object and apply some function
        # You can of course do whatever you want here.
        for game in games.live_games:
            to_csv(game)

Contact

Please contact me for any issues or suggestions. For any bugs or anything related to the code please open an issue. Otherwise you can email me at Harryshomer@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

hockey_scraper-1.40.3-py3-none-any.whl (86.4 kB view details)

Uploaded Python 3

File details

Details for the file hockey_scraper-1.40.3-py3-none-any.whl.

File metadata

File hashes

Hashes for hockey_scraper-1.40.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9583b48bccfa02cc43a377221b38bc4d7d007e07c7238c7e5d35e48b86c9fa18
MD5 818e1eafd043a821e60972d15feafd73
BLAKE2b-256 b90911e9edce95d7241d96d121e7f9b5a22796a95110dfd17475d7a1b3583513

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page