Skip to main content

Python package for Basketball Reference that gathers data by scraping the website

Project description

basketball-reference-webscrapper

basketball-reference-webscrapper is a Python package for fetching NBA games data from two sources:

  1. Basketball Reference website (web scraping)
  2. NBA Stats API (official API via nba_api package)

Features

  • ✅ Web scrapes NBA gamelogs, schedules, and player attributes from Basketball Reference
  • ✅ Fetches data directly from official NBA Stats API (faster, but local-only)
  • ✅ Validates user inputs to ensure data accuracy
  • ✅ Handles team-specific data filtering (single team, multiple teams, or all teams)
  • ✅ Returns data as pandas DataFrames
  • ✅ Consistent interface across both data sources

Installation

pip install basketball-reference-webscrapper

Dependencies:

  • pandas, beautifulsoup4, requests - for web scraping
  • nba-api - for NBA API access (included automatically)

Usage

Option 1: Basketball Reference (Web Scraping)

Best for: Production environments, cloud deployments, historical data (1947-present)

from basketball_reference_webscrapper.data_models.feature_model import FeatureIn
from basketball_reference_webscrapper.webscrapping_basketball_reference import WebScrapBasketballReference

# Create feature object
feature = FeatureIn(
    data_type='gamelog',  # 'gamelog', 'schedule', or 'player_attributes'
    season=2023,
    team='BOS'  # 'all', 'BOS', or ['BOS', 'LAL']
)

# Fetch data
scraper = WebScrapBasketballReference(feature_object=feature)
data = scraper.webscrappe_nba_games_data()
print(data.head())

Option 2: NBA Stats API

Best for: Local development, faster data retrieval, recent seasons (2000-present)

from basketball_reference_webscrapper.data_models.feature_model import FeatureIn
from basketball_reference_webscrapper.web_scrap_nba_api import WebScrapNBAApi

# Create feature object
feature = FeatureIn(
    data_type='gamelog',  # 'gamelog' or 'schedule'
    season=2023,
    team='BOS'  # 'all', 'BOS', or ['BOS', 'LAL']
)

# Fetch data
scraper = WebScrapNBAApi(feature_object=feature)
data = scraper.fetch_nba_api_data()
print(data.head())

⚠️ Important: NBA API blocks cloud providers (AWS, Heroku, GCP, etc.). Use locally only.

Comparison: Which Data Source to Use?

Feature NBA API Basketball Reference
Speed ⚡ Fast (~1-2s/team) 🐌 Slow (~20-30s/team)
Cloud-friendly ❌ No (blocks cloud IPs) ✅ Yes
Historical data 2000-present 1947-present
Opponent stats ❌ Not included ✅ Complete
Player attributes ❌ Not supported ✅ Supported
Reliability High (official API) Medium (web scraping)

Recommendation:

  • Development/Analysis (local): Use NBA API for speed
  • Production/Cloud: Use Basketball Reference for reliability
  • Historical research: Use Basketball Reference
  • Need opponent stats: Use Basketball Reference

Supported Data Types

Basketball Reference

  • gamelog - Game-by-game team statistics
  • schedule - Team schedule and results
  • player_attributes - Player roster information

NBA API

  • gamelog - Game-by-game team statistics (no opponent stats)
  • schedule - Team schedule and results (no pts_opp)

Input Validation

Both scrapers validate inputs:

  • Data Type: Must be valid for the chosen scraper
  • Season: Must be integer ≥ 2000 for NBA API, ≥ 1947 for Basketball Reference
  • Team: 'all', valid team abbreviation (e.g., 'BOS'), or list of abbreviations (e.g., ['BOS', 'LAL'])

Valid Team Abbreviations

ATL, BOS, BRK, CHA, CHI, CLE, DAL, DEN, DET, GSW, HOU, IND,
LAC, LAL, MEM, MIA, MIL, MIN, NOP, NYK, OKC, ORL, PHI, PHO,
POR, SAC, SAS, TOR, UTA, WAS

Examples

Example 1: Fetch Single Team Gamelog (Basketball Reference)

from basketball_reference_webscrapper.data_models.feature_model import FeatureIn
from basketball_reference_webscrapper.webscrapping_basketball_reference import WebScrapBasketballReference

feature = FeatureIn(data_type='gamelog', season=2023, team='BOS')
scraper = WebScrapBasketballReference(feature_object=feature)
data = scraper.webscrappe_nba_games_data()

print(f"Fetched {len(data)} games for Boston Celtics")
print(data[['game_date', 'opp', 'results', 'pts_tm', 'pts_opp']].head())

Example 2: Fetch Multiple Teams Schedule (NBA API)

from basketball_reference_webscrapper.data_models.feature_model import FeatureIn
from basketball_reference_webscrapper.web_scrap_nba_api import WebScrapNBAApi

feature = FeatureIn(data_type='schedule', season=2023, team=['LAL', 'GSW'])
scraper = WebScrapNBAApi(feature_object=feature)
data = scraper.fetch_nba_api_data()

print(f"Teams: {data['tm'].unique()}")
print(data[['game_date', 'opponent', 'w_l', 'pts_tm']].head())

Example 3: Fetch All Teams (use with caution)

# This will take several minutes and make 30+ requests
feature = FeatureIn(data_type='gamelog', season=2023, team='all')

# Choose your scraper based on environment
# scraper = WebScrapNBAApi(feature_object=feature)  # Local only
scraper = WebScrapBasketballReference(feature_object=feature)  # Works anywhere

data = scraper.webscrappe_nba_games_data()
print(f"Fetched data for {data['tm'].nunique()} teams")

Data Engineering Use

This package is designed for data engineering pipelines:

  • Clean data structure: Only returns actual data from source (no empty placeholders)
  • Consistent schema: Same column names across data sources where applicable
  • Flexible filtering: Easy to fetch specific teams or all teams
  • Error handling: Comprehensive logging and error messages

Note: NBA API scraper excludes opponent statistics (would require 82+ additional API calls per team). Handle opponent data joins in your ETL pipeline if needed.

Configuration

The package uses params.yaml for configuration. Both scrapers share the same team reference data in constants/team_city_refdata.csv.

Troubleshooting

NBA API: JSONDecodeError

Error: JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Cause: You're running in a cloud environment (AWS, Heroku, GCP, etc.). NBA API blocks datacenter IPs.

Solution:

  • Run locally for development
  • Use Basketball Reference scraper for production/cloud deployments

Rate Limiting

  • NBA API: ~100 requests/minute (built-in 0.4s delay between requests)
  • Basketball Reference: Respectful delays built-in (~20s per team)

Testing

# Run all tests
poetry run pytest

# Run specific test file
poetry run pytest tests/test_web_scrap_nba_api.py -v
poetry run pytest tests/test_webscrapping_basketball_reference.py -v

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

See LICENSE file for details.

Contact

For questions or feedback: yannick.flores1992@gmail.com

Changelog

v0.5.4 (Latest)

  • ✨ Added NBA Stats API support via WebScrapNBAApi class
  • ✨ Added nba-api package integration
  • 📝 Comprehensive test coverage for both scrapers
  • 🔧 Removed opponent statistics from NBA API output (data integrity)
  • ⚡ Optimized rate limiting (0.4s between NBA API requests)
  • 📚 Updated documentation with comparison guide

v0.5.3

  • 🐛 Fixed Basketball Reference scraper headers for better reliability
  • 🔧 Improved error handling and logging

Acknowledgments

  • Basketball Reference for providing comprehensive NBA statistics
  • NBA.com for the official stats API
  • nba_api package maintainers for the excellent Python wrapper

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

basketball_reference_webscrapper-0.8.2.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file basketball_reference_webscrapper-0.8.2.tar.gz.

File metadata

File hashes

Hashes for basketball_reference_webscrapper-0.8.2.tar.gz
Algorithm Hash digest
SHA256 abde9a2599858fd880da0f6a43710556f45ed1329581557ed1c184b0adcc944a
MD5 bd5c0b67db233ece3e1d52b1846ec2ea
BLAKE2b-256 6547acd33b9d6222b474cff5626fcab56d21961ccb732851b39dd266ba75c9dc

See more details on using hashes here.

File details

Details for the file basketball_reference_webscrapper-0.8.2-py3-none-any.whl.

File metadata

File hashes

Hashes for basketball_reference_webscrapper-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0544e7072bd6bcbce80f3b9509530da59166f143c0b4c3aaa8e627c0014c07e6
MD5 b69c39b04d86b0bd356f60a80438f033
BLAKE2b-256 6a72a4dd0b7b9d1e49bef057d287c2ddf820b9b00680f9177953cdf5cb4f36de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page