A Python-based web scraper for NCAA basketball.
Project description
CBBpy: A Python-based web scraper for NCAA basketball
Purpose
This package is designed to bridge the gap between data and analysis for NCAA D1 basketball. CBBpy can grab play-by-play, boxscore, and other game metadata for any NCAA D1 men's or women's basketball game. Inspired by the ncaahoopR package by Luke Benz - check that out if you are an R user!
Installation and import
CBBpy requires Python >= 3.9 as well as the following packages:
- pandas>=2.0.0
- numpy>=2.0.0
- python-dateutil>=2.4.0
- pytz>=2022.1
- tqdm>=4.63.0
- lxml>=4.9.0
- joblib>=1.0.0
- beautifulsoup4>=4.11.0
- requests>=2.27.0
- rapidfuzz>=3.9.0
- platformdirs>=4.0.0
Install using pip:
pip install cbbpy
Or upgrade an existing installation:
pip install --upgrade cbbpy
The men's and women's scrapers can be imported as such:
import cbbpy.mens_scraper as s
import cbbpy.womens_scraper as s
Functions available in CBBpy
NOTE: game ID, as far as CBBpy is concerned, is a valid ESPN game ID
s.get_game_info(game_id: Union[str, int]) grabs all the metadata (game date, time, score, teams, referees, etc) for a particular game.
s.get_game_boxscore(game_id: Union[str, int]) returns a pandas DataFrame with each player's stats for a particular game.
s.get_game_pbp(game_id: Union[str, int]) scrapes the play-by-play tables for a game and returns a pandas DataFrame, with each entry representing a play made during the game.
s.get_game(game_id: Union[str, int], info: bool = True, box: bool = True, pbp: bool = True) gets all information about a game (game info, boxscore, PBP) and returns a tuple of results (game_info, boxscore, pbp). info, box, pbp are booleans which users can set to False if there is any information they wish not to scrape. For example, box = False would return an empty DataFrame for the boxscore info, while scraping PBP and metadata info normally.
s.get_games_season(season: Union[str, int] = None, info: bool = True, box: bool = True, pbp: bool = True) scrapes all game information for all finished or in progress games in a particular season (defaults to the current season). As an example, to scrape games for the 2020-21 season, call get_games_season(2021). Returns a tuple of 3 DataFrames, similar to get_game. See get_game for an explanation of booleans info, box, pbp.
s.get_games_range(start_date: Union[int, datetime], end_date: Union[int, datetime], info: bool = True, box: bool = True, pbp: bool = True) scrapes all game information for all finished or in progress games between start_date and end_date (inclusive). As an example, to scrape games from November 30, 2022 to December 10, 2022, call get_games_season('11-30-2022', '12-10-2022'). Returns a tuple of 3 DataFrames, similar to get_game. See get_game for an explanation of booleans info, box, pbp.
s.get_games_team(team: str, season: Union[str, int] = None, info: bool = True, box: bool = True, pbp: bool = True) scrapes all game information for all finished or in progress games in a particular season (defaults to the current season) for a given team. As an example, to scrape games for Duke's 2020-21 season, call get_games_team('duke', 2021); for their current season, you can just call get_games_team('duke'). If a given team does not have an exact match in the static list of teams scraped from ESPN's site, this function will scrape the games for the closest fuzzy-matched team (e.g. if "valpo" is provided as the team, the function will scrape the games for "Valparaiso"). Returns a tuple of 3 DataFrames, similar to get_game. See get_game for an explanation of booleans info, box, pbp.
s.get_games_conference(conference: str, season: Union[str, int] = None, info: bool = True, box: bool = True, pbp: bool = True) scrapes all game information for all finished or in progress games in a particular season (defaults to the current season) for all teams in a given conference. As an example, to scrape games for the A10's 2017-18 season, call get_games_conference('a10', 2018); for their current season, you can just call get_games_conference('a10'). If a given conference does not have an exact match in the static list of conferences scraped from ESPN's site, this function will scrape the games for the closest fuzzy-matched conference (e.g. if "am east" is provided as the conference, the function will scrape the games for "America East Conference"). Returns a tuple of 3 DataFrames, similar to get_game. See get_game for an explanation of booleans info, box, pbp.
s.get_game_ids(date: Union[str, datetime]) returns a list of all game IDs for a particular date.
s.get_player_info(player_id: Union[str, int]) returns a DataFrame describing the player's info from ESPN's bio page.
s.get_teams_from_conference(conference: str, season: Union[str, int] = None) returns a list of the teams in the given conference for a season (defaults to the current season).
s.get_team_schedule(team: str, season: Union[str, int] = None) returns a DataFrame of a team's schedule for a given season (defaults to the current season). If a given team does not have an exact match in the static list of teams scraped from ESPN's site, this function will scrape the schedule for the closest fuzzy-matched team (e.g. if "valpo" is provided as the team, the function will scrape the schedule for "Valparaiso").
s.get_conference_schedule(conference: str, season: Union[str, int] = None) returns a DataFrame of the schedules for all teams in a given conference for a given season (defaults to the current season). If a given conference does not have an exact match in the static list of conferences scraped from ESPN's site, this function will scrape the schedules for the closest fuzzy-matched conference (e.g. if "am east" is provided as the conference, the function will scrape the schedules for "America East Conference").
Examples
Function call:
import cbbpy.mens_scraper as s
s.get_game_info('401522202')
Returns:
| game_id | home_team | home_id | home_rank | home_record | home_score | away_team | away_id | away_rank | away_record | away_score | home_win | num_ots | is_conference | is_neutral | is_postseason | tournament | game_day | game_time | game_loc | arena | arena_capacity | attendance | tv_network | referee_1 | referee_2 | referee_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 401522202 | UConn Huskies | 41 | 4 | 31-8 | 76 | San Diego State Aztecs | 21 | 5 | 32-7 | 59 | True | 0 | False | True | True | Men's Basketball Championship - National Championship | April 03, 2023 | 06:20 PM PDT | Houston, TX | NRG Stadium | 0 | 72423 | CBS | Ron Groover | Terry Oglesby | Keith Kimble |
Function call:
import cbbpy.womens_scraper as s
s.get_game_boxscore('401528028')
Returns (partially):
| game_id | team | player | player_id | position | starter | min | fgm | fga | 2pm | 2pa | 3pm | 3pa | ftm | fta | oreb | dreb | reb | ast | stl | blk | to | pf | pts | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 401528028 | LSU Tigers | A. Reese | 4433402 | F | True | 29 | 5 | 12 | 5 | 12 | 0 | 0 | 5 | 8 | 6 | 4 | 10 | 5 | 3 | 1 | 0 | 3 | 15 |
| 1 | 401528028 | LSU Tigers | L. Williams | 4280886 | F | True | 37 | 9 | 16 | 9 | 16 | 0 | 0 | 2 | 2 | 1 | 4 | 5 | 0 | 3 | 0 | 3 | 4 | 20 |
| 2 | 401528028 | LSU Tigers | F. Johnson | 4698736 | G | True | 37 | 4 | 11 | 3 | 7 | 1 | 4 | 1 | 1 | 2 | 5 | 7 | 4 | 1 | 0 | 4 | 1 | 10 |
| 3 | 401528028 | LSU Tigers | K. Poole | 4433418 | G | True | 24 | 2 | 3 | 0 | 1 | 2 | 2 | 0 | 2 | 0 | 3 | 3 | 1 | 0 | 1 | 1 | 2 | 6 |
| 4 | 401528028 | LSU Tigers | A. Morris | 4281251 | G | True | 33 | 8 | 14 | 7 | 11 | 1 | 3 | 4 | 4 | 1 | 1 | 2 | 9 | 1 | 0 | 2 | 3 | 21 |
Function call:
import cbbpy.mens_scraper as s
s.get_game_pbp('401522202')
Returns (partially):
| game_id | home_team | away_team | play_desc | home_score | away_score | half | secs_left_half | secs_left_reg | play_team | play_type | shooting_play | scoring_play | is_three | shooter | is_assisted | assist_player | shot_x | shot_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 401522202 | UConn Huskies | San Diego State Aztecs | Jump Ball won by UConn | 0 | 0 | 1 | 1200 | 2400 | UConn Huskies | jump ball | False | False | False | False | nan | nan | ||
| 1 | 401522202 | UConn Huskies | San Diego State Aztecs | Jordan Hawkins made Jumper. Assisted by Adama Sanogo. | 2 | 0 | 1 | 1174 | 2374 | UConn Huskies | jumper | True | True | False | Jordan Hawkins | True | Adama Sanogo | 18 | 15 |
| 2 | 401522202 | UConn Huskies | San Diego State Aztecs | Lamont Butler made Three Point Jumper. Assisted by Matt Bradley. | 2 | 3 | 1 | 1152 | 2352 | San Diego State Aztecs | three point jumper | True | True | True | Lamont Butler | True | Matt Bradley | 39 | 22 |
| 3 | 401522202 | UConn Huskies | San Diego State Aztecs | Tristen Newton Turnover. | 2 | 3 | 1 | 1130 | 2330 | UConn Huskies | turnover | False | False | False | False | nan | nan | ||
| 4 | 401522202 | UConn Huskies | San Diego State Aztecs | Darrion Trammell made Three Point Jumper. Assisted by Keshad Johnson. | 2 | 6 | 1 | 1108 | 2308 | San Diego State Aztecs | three point jumper | True | True | True | Darrion Trammell | True | Keshad Johnson | 1 | 0 |
Function call:
import cbbpy.mens_scraper as s
s.get_player_info('5105865')
Returns:
| player_id | first_name | last_name | jersey_number | pos | status | team | experience | height | weight | birthplace | date_of_birth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5105865 | Reed | Bailey | 1 | Forward | active | Davidson Wildcats | Junior | 6' 10" | 230 lbs | Harvard, MA |
Function call:
import cbbpy.womens_scraper as s
s.get_team_schedule('davidson', 2022)
Returns (partially):
| team | team_id | season | game_id | game_day | game_time | opponent | opponent_id | season_type | game_status | tv_network | game_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Davidson | 2166 | 2022 | 401370995 | November 09, 2021 | 04:00 PM PST | Delaware Blue Hens | 48 | Regular Season | Final | ESPN+ | W 93-71 |
| 1 | Davidson | 2166 | 2022 | 401370996 | November 13, 2021 | 05:30 PM PST | San Francisco Dons | 2539 | Regular Season | Final | L 60-65 | |
| 2 | Davidson | 2166 | 2022 | 401365883 | November 18, 2021 | 09:00 AM PST | New Mexico State Aggies | 166 | Regular Season | Final | ESPNU | L 64-75 |
| 3 | Davidson | 2166 | 2022 | 401377036 | November 19, 2021 | 11:30 AM PST | Pennsylvania Quakers | 219 | Regular Season | Final | ESPNU | W 72-60 |
| 4 | Davidson | 2166 | 2022 | 401377040 | November 21, 2021 | 03:00 PM PST | East Carolina Pirates | 151 | Regular Season | Final | ESPNU | W 76-67 |
Function call:
import cbbpy.mens_scraper as s
s.get_conference_schedule('ovc', 2015)
Returns (showing the middle of the output):
| team | team_id | season | game_id | game_day | game_time | opponent | opponent_id | season_type | game_status | tv_network | game_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | Belmont | 2057 | 2015 | 400766521 | March 06, 2015 | 07:15 PM PST | Eastern Kentucky Colonels | 2198 | Regular Season | Final | ESPNU | W 53-52 |
| 31 | Belmont | 2057 | 2015 | 400766705 | March 07, 2015 | 04:00 PM PST | Murray State Racers | 93 | Regular Season | Final | ESPN2 | W 88-87 |
| 32 | Belmont | 2057 | 2015 | 400785349 | March 20, 2015 | 12:30 PM PDT | Virginia Cavaliers | 258 | Postseason | Final | truTV | L 67-79 |
| 33 | Eastern Kentucky | 2198 | 2015 | 400596308 | November 14, 2014 | 04:00 PM PST | Savannah State Tigers | 2542 | Regular Season | Final | W 76-53 | |
| 34 | Eastern Kentucky | 2198 | 2015 | 400596315 | November 18, 2014 | 04:00 PM PST | Kentucky Christian Knights | 3077 | Regular Season | Final | W 115-35 |
Contact
Feel free to reach out to me directly with any questions, requests, or suggestions at dnlcowan37@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cbbpy-2.1.2.tar.gz.
File metadata
- Download URL: cbbpy-2.1.2.tar.gz
- Upload date:
- Size: 83.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
232d6a94e754361140fc7e83c2999e49760268e038dc3516ff6e76893503db62
|
|
| MD5 |
0d993c1bb211f8497a927af7c097ed91
|
|
| BLAKE2b-256 |
9212d1e6a611d4a4a8b398ef0f7ffe1cc748be98f92e9f8c8a7f0ec95b4fb27c
|
File details
Details for the file CBBpy-2.1.2-py3-none-any.whl.
File metadata
- Download URL: CBBpy-2.1.2-py3-none-any.whl
- Upload date:
- Size: 85.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ed89174e315be3ebe68c8cf0ad283637e94b3497cef37115a765240005b4356
|
|
| MD5 |
facb97eebf601eab9ca921030fca87ce
|
|
| BLAKE2b-256 |
a610912ec2df2ac65994de4956ec52a261c40e11c02e1192bfd7e161d80b6a1f
|