NFL data pipeline combining PFF grades and PFR game data for over/under analysis
Project description
nfl-data-pipeline
A pip-installable data pipeline that scrapes NFL team grades from PFF (Pro Football Focus) and game/betting data from Pro Football Reference, merges the datasets, and runs postprocessing (rolling averages, rankings) to produce a dataset for over/under analysis.
Quick Start
pip install nfl-data-pipeline
Or install from source with Poetry:
git clone https://github.com/thadhutcheson/nfl-data-pipeline.git
cd nfl-data-pipeline
poetry install
Features
- PFF Scraping -- Selenium-based scraper for PFF team grades (requires PFF Premium)
- PFR Scraping -- Proxy-rotated scraper for Pro Football Reference boxscores
- Date & Team Normalization -- Standardizes dates and team names across sources
- Dataset Merging -- Inner join on date + team columns
- Rolling Averages -- Pre-game cumulative stat averages per team per season
- Games Played Tracking -- Cumulative games played before each matchup
- Feature Rankings -- Per-date rankings across all teams
- CLI Interface --
nfl-pipelinecommand withscrape,process, andpipelinesubcommands - Python API -- Import and call any step programmatically
Prerequisites
- Python 3.12+
- Google Chrome + ChromeDriver
- A PFF Premium subscription (for PFF scraping)
- Rotating proxies in CSV format (for PFR scraping)
Setup
# Install dependencies
poetry install
# Copy and fill in credentials
cp .env.example .env
# Add your proxies
mkdir -p proxies
# Place your proxies.csv in proxies/ (format: address:port:user:password per line)
Configuration
Override defaults with environment variables:
| Variable | Default | Description |
|---|---|---|
NFL_SEASONS |
2024 |
Comma-separated list of seasons for PFF scraping |
NFL_START_YEAR |
2024 |
Start year for PFR URL scraping |
NFL_END_YEAR |
2024 |
End year for PFR URL scraping |
NFL_MAX_WEEK |
18 |
Last week to scrape in the final year |
NFL_DATA_DIR |
data |
Base directory for all data output |
NFL_PROXY_FILE |
proxies/proxies.csv |
Path to proxy CSV file |
PFF_EMAIL |
- | PFF account email |
PFF_PASSWORD |
- | PFF account password |
CLI Usage
# Run the full pipeline end-to-end
nfl-pipeline pipeline
# Scrape only PFF data (scrape + parse dates + normalize names)
nfl-pipeline scrape pff
# Scrape only PFR data (URLs + game data + parse dates + normalize names)
nfl-pipeline scrape pfr
# Run all post-processing steps
nfl-pipeline process all
# Run individual processing steps
nfl-pipeline process merge
nfl-pipeline process over-under
nfl-pipeline process averages
nfl-pipeline process games-played
nfl-pipeline process rankings
# Show version
nfl-pipeline --version
Python API
import nfl_data_pipeline
# Run individual steps
nfl_data_pipeline.scrape_pff_data()
nfl_data_pipeline.collect_boxscore_urls()
nfl_data_pipeline.scrape_all_game_info()
nfl_data_pipeline.merge_datasets()
nfl_data_pipeline.process_over_under()
nfl_data_pipeline.compute_rolling_averages()
nfl_data_pipeline.add_games_played()
nfl_data_pipeline.compute_rankings()
# Or run full pipelines
from nfl_data_pipeline.pipeline import run_full_pipeline, run_pff_pipeline, run_pfr_pipeline
run_full_pipeline()
Pipeline
PFF Scrape PFR Scrape
| |
v v
Extract Dates Normalize Dates
| |
v v
Normalize Names Normalize Names
| |
+-------+ +-------+
| |
v v
Merge
|
v
Over/Under
|
v
Rolling Averages
|
v
Games Played
|
v
Rankings
Project Structure
nfl-data-pipeline/
├── src/
│ └── nfl_data_pipeline/
│ ├── __init__.py # __version__, top-level re-exports
│ ├── _config.py # Paths, env vars, logging setup
│ ├── teams.py # Team name/abbreviation mappings
│ ├── cli.py # Click CLI entry point
│ ├── pipeline.py # Full pipeline orchestrator
│ ├── scrapers/
│ │ ├── pff.py # PFF grades scraper
│ │ ├── pfr.py # PFR game data scraper
│ │ ├── pfr_urls.py # PFR boxscore URL collector
│ │ ├── auth.py # PFF authentication
│ │ └── proxies.py # Shared proxy loading
│ ├── parsers/
│ │ ├── pff_dates.py # PFF date extraction
│ │ ├── pff_teams.py # PFF team name normalization
│ │ ├── pfr_dates.py # PFR date normalization
│ │ └── pfr_teams.py # PFR team name extraction
│ └── processing/
│ ├── merge.py # Merge PFF + PFR datasets
│ ├── over_under.py # O/U betting line extraction
│ ├── rolling_averages.py # Rolling stat averages
│ ├── games_played.py # Cumulative games played
│ └── rankings.py # Feature rankings
├── tests/
├── pyproject.toml
├── Makefile
├── LICENSE
└── README.md
Make Commands
make all # Run the full pipeline end-to-end
make pff # Run only the PFF scraping + processing chain
make pfr # Run only the PFR scraping + processing chain
make merge # Merge PFF and PFR data (runs both chains first)
make rankings # Run full postprocessing through rankings
make test # Run the test suite
make clean # Remove all generated data files
make dirs # Create data directory structure
Notes
- PFF scraping is fragile. It relies on XPath selectors tied to PFF's DOM structure. If PFF changes their frontend, the selectors in
scrapers/pff.pywill need updating. - PFR scraping requires proxies. Pro Football Reference rate-limits aggressively. Without rotating proxies, requests will be blocked.
- The PFF scraper uses a real browser. It opens Chrome via Selenium, logs in with your credentials, and navigates page by page. This is slow but necessary since PFF renders data client-side.
- Data files are not tracked in git. Run the pipeline to generate them, or bring your own data in the expected format.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nfl_data_pipeline-1.0.1.tar.gz.
File metadata
- Download URL: nfl_data_pipeline-1.0.1.tar.gz
- Upload date:
- Size: 20.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92b93e00c108547d68bd8aa0bcaa6ce1ea87ea08e694d6ebc985f4c5e0ba0520
|
|
| MD5 |
726b0b4fe97da875ab7cd5ad56f2adcd
|
|
| BLAKE2b-256 |
6b3c8545d7c74900259fd5329bba9e37953efe36b6e96dfef20289035c975ce5
|
Provenance
The following attestation bundles were made for nfl_data_pipeline-1.0.1.tar.gz:
Publisher:
publish.yml on thadhutch/nfl-data-pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nfl_data_pipeline-1.0.1.tar.gz -
Subject digest:
92b93e00c108547d68bd8aa0bcaa6ce1ea87ea08e694d6ebc985f4c5e0ba0520 - Sigstore transparency entry: 927306975
- Sigstore integration time:
-
Permalink:
thadhutch/nfl-data-pipeline@1e87423f33c33bbb9f26486131ef62518838b67b -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/thadhutch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1e87423f33c33bbb9f26486131ef62518838b67b -
Trigger Event:
release
-
Statement type:
File details
Details for the file nfl_data_pipeline-1.0.1-py3-none-any.whl.
File metadata
- Download URL: nfl_data_pipeline-1.0.1-py3-none-any.whl
- Upload date:
- Size: 27.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3901ec4b662a711dd412b999652dc8fcac39ae6db3b5e627fb4829c25b50d4b8
|
|
| MD5 |
36fbb9edcd49007bbae720b401dc4b79
|
|
| BLAKE2b-256 |
527eb51b34219832d5935994e9be5d23375a1ddd76094e4e37a302768997f851
|
Provenance
The following attestation bundles were made for nfl_data_pipeline-1.0.1-py3-none-any.whl:
Publisher:
publish.yml on thadhutch/nfl-data-pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nfl_data_pipeline-1.0.1-py3-none-any.whl -
Subject digest:
3901ec4b662a711dd412b999652dc8fcac39ae6db3b5e627fb4829c25b50d4b8 - Sigstore transparency entry: 927306979
- Sigstore integration time:
-
Permalink:
thadhutch/nfl-data-pipeline@1e87423f33c33bbb9f26486131ef62518838b67b -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/thadhutch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1e87423f33c33bbb9f26486131ef62518838b67b -
Trigger Event:
release
-
Statement type: