Python package for Basketball Reference that gathers data by scraping the website
Project description
basketball-reference-webscrapper
basketball-reference-webscrapper is a Python package for fetching NBA games data from two sources:
- Basketball Reference website (web scraping)
- NBA Stats API (official API via
nba_apipackage)
Features
- ✅ Web scrapes NBA gamelogs, schedules, and player attributes from Basketball Reference
- ✅ Fetches data directly from official NBA Stats API (faster, but local-only)
- ✅ Validates user inputs to ensure data accuracy
- ✅ Handles team-specific data filtering (single team, multiple teams, or all teams)
- ✅ Returns data as pandas DataFrames
- ✅ Consistent interface across both data sources
Installation
pip install basketball-reference-webscrapper
Dependencies:
pandas,beautifulsoup4,requests- for web scrapingnba-api- for NBA API access (included automatically)
Usage
Option 1: Basketball Reference (Web Scraping)
Best for: Production environments, cloud deployments, historical data (1947-present)
from basketball_reference_webscrapper.data_models.feature_model import FeatureIn
from basketball_reference_webscrapper.webscrapping_basketball_reference import WebScrapBasketballReference
# Create feature object
feature = FeatureIn(
data_type='gamelog', # 'gamelog', 'schedule', or 'player_attributes'
season=2023,
team='BOS' # 'all', 'BOS', or ['BOS', 'LAL']
)
# Fetch data
scraper = WebScrapBasketballReference(feature_object=feature)
data = scraper.webscrappe_nba_games_data()
print(data.head())
Option 2: NBA Stats API
Best for: Local development, faster data retrieval, recent seasons (2000-present)
from basketball_reference_webscrapper.data_models.feature_model import FeatureIn
from basketball_reference_webscrapper.web_scrap_nba_api import WebScrapNBAApi
# Create feature object
feature = FeatureIn(
data_type='gamelog', # 'gamelog' or 'schedule'
season=2023,
team='BOS' # 'all', 'BOS', or ['BOS', 'LAL']
)
# Fetch data
scraper = WebScrapNBAApi(feature_object=feature)
data = scraper.fetch_nba_api_data()
print(data.head())
⚠️ Important: NBA API blocks cloud providers (AWS, Heroku, GCP, etc.). Use locally only.
Comparison: Which Data Source to Use?
| Feature | NBA API | Basketball Reference |
|---|---|---|
| Speed | ⚡ Fast (~1-2s/team) | 🐌 Slow (~20-30s/team) |
| Cloud-friendly | ❌ No (blocks cloud IPs) | ✅ Yes |
| Historical data | 2000-present | 1947-present |
| Opponent stats | ❌ Not included | ✅ Complete |
| Player attributes | ❌ Not supported | ✅ Supported |
| Reliability | High (official API) | Medium (web scraping) |
Recommendation:
- Development/Analysis (local): Use NBA API for speed
- Production/Cloud: Use Basketball Reference for reliability
- Historical research: Use Basketball Reference
- Need opponent stats: Use Basketball Reference
Supported Data Types
Basketball Reference
gamelog- Game-by-game team statisticsschedule- Team schedule and resultsplayer_attributes- Player roster information
NBA API
gamelog- Game-by-game team statistics (no opponent stats)schedule- Team schedule and results (no pts_opp)
Input Validation
Both scrapers validate inputs:
- Data Type: Must be valid for the chosen scraper
- Season: Must be integer ≥ 2000 for NBA API, ≥ 1947 for Basketball Reference
- Team:
'all', valid team abbreviation (e.g.,'BOS'), or list of abbreviations (e.g.,['BOS', 'LAL'])
Valid Team Abbreviations
ATL, BOS, BRK, CHA, CHI, CLE, DAL, DEN, DET, GSW, HOU, IND,
LAC, LAL, MEM, MIA, MIL, MIN, NOP, NYK, OKC, ORL, PHI, PHO,
POR, SAC, SAS, TOR, UTA, WAS
Examples
Example 1: Fetch Single Team Gamelog (Basketball Reference)
from basketball_reference_webscrapper.data_models.feature_model import FeatureIn
from basketball_reference_webscrapper.webscrapping_basketball_reference import WebScrapBasketballReference
feature = FeatureIn(data_type='gamelog', season=2023, team='BOS')
scraper = WebScrapBasketballReference(feature_object=feature)
data = scraper.webscrappe_nba_games_data()
print(f"Fetched {len(data)} games for Boston Celtics")
print(data[['game_date', 'opp', 'results', 'pts_tm', 'pts_opp']].head())
Example 2: Fetch Multiple Teams Schedule (NBA API)
from basketball_reference_webscrapper.data_models.feature_model import FeatureIn
from basketball_reference_webscrapper.web_scrap_nba_api import WebScrapNBAApi
feature = FeatureIn(data_type='schedule', season=2023, team=['LAL', 'GSW'])
scraper = WebScrapNBAApi(feature_object=feature)
data = scraper.fetch_nba_api_data()
print(f"Teams: {data['tm'].unique()}")
print(data[['game_date', 'opponent', 'w_l', 'pts_tm']].head())
Example 3: Fetch All Teams (use with caution)
# This will take several minutes and make 30+ requests
feature = FeatureIn(data_type='gamelog', season=2023, team='all')
# Choose your scraper based on environment
# scraper = WebScrapNBAApi(feature_object=feature) # Local only
scraper = WebScrapBasketballReference(feature_object=feature) # Works anywhere
data = scraper.webscrappe_nba_games_data()
print(f"Fetched data for {data['tm'].nunique()} teams")
Data Engineering Use
This package is designed for data engineering pipelines:
- Clean data structure: Only returns actual data from source (no empty placeholders)
- Consistent schema: Same column names across data sources where applicable
- Flexible filtering: Easy to fetch specific teams or all teams
- Error handling: Comprehensive logging and error messages
Note: NBA API scraper excludes opponent statistics (would require 82+ additional API calls per team). Handle opponent data joins in your ETL pipeline if needed.
Configuration
The package uses params.yaml for configuration. Both scrapers share the same team reference data in constants/team_city_refdata.csv.
Troubleshooting
NBA API: JSONDecodeError
Error: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Cause: You're running in a cloud environment (AWS, Heroku, GCP, etc.). NBA API blocks datacenter IPs.
Solution:
- Run locally for development
- Use Basketball Reference scraper for production/cloud deployments
Rate Limiting
- NBA API: ~100 requests/minute (built-in 0.4s delay between requests)
- Basketball Reference: Respectful delays built-in (~20s per team)
Testing
# Run all tests
poetry run pytest
# Run specific test file
poetry run pytest tests/test_web_scrap_nba_api.py -v
poetry run pytest tests/test_webscrapping_basketball_reference.py -v
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
See LICENSE file for details.
Contact
For questions or feedback: yannick.flores1992@gmail.com
Changelog
v0.5.4 (Latest)
- ✨ Added NBA Stats API support via
WebScrapNBAApiclass - ✨ Added
nba-apipackage integration - 📝 Comprehensive test coverage for both scrapers
- 🔧 Removed opponent statistics from NBA API output (data integrity)
- ⚡ Optimized rate limiting (0.4s between NBA API requests)
- 📚 Updated documentation with comparison guide
v0.5.3
- 🐛 Fixed Basketball Reference scraper headers for better reliability
- 🔧 Improved error handling and logging
Acknowledgments
- Basketball Reference for providing comprehensive NBA statistics
- NBA.com for the official stats API
nba_apipackage maintainers for the excellent Python wrapper
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file basketball_reference_webscrapper-0.8.2.tar.gz.
File metadata
- Download URL: basketball_reference_webscrapper-0.8.2.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abde9a2599858fd880da0f6a43710556f45ed1329581557ed1c184b0adcc944a
|
|
| MD5 |
bd5c0b67db233ece3e1d52b1846ec2ea
|
|
| BLAKE2b-256 |
6547acd33b9d6222b474cff5626fcab56d21961ccb732851b39dd266ba75c9dc
|
File details
Details for the file basketball_reference_webscrapper-0.8.2-py3-none-any.whl.
File metadata
- Download URL: basketball_reference_webscrapper-0.8.2-py3-none-any.whl
- Upload date:
- Size: 31.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0544e7072bd6bcbce80f3b9509530da59166f143c0b4c3aaa8e627c0014c07e6
|
|
| MD5 |
b69c39b04d86b0bd356f60a80438f033
|
|
| BLAKE2b-256 |
6a72a4dd0b7b9d1e49bef057d287c2ddf820b9b00680f9177953cdf5cb4f36de
|