Skip to main content

NFL data pipeline combining PFF grades and PFR game data for over/under analysis

Project description

nfl-data-pipeline

CI Python 3.12+ License: MIT

A pip-installable data pipeline that scrapes NFL team grades from PFF (Pro Football Focus) and game/betting data from Pro Football Reference, merges the datasets, and runs postprocessing (rolling averages, rankings) to produce a dataset for over/under analysis.

Quick Start

pip install nfl-data-pipeline

Or install from source with Poetry:

git clone https://github.com/thadhutcheson/nfl-data-pipeline.git
cd nfl-data-pipeline
poetry install

Features

  • PFF Scraping -- Selenium-based scraper for PFF team grades (requires PFF Premium)
  • PFR Scraping -- Proxy-rotated scraper for Pro Football Reference boxscores
  • Date & Team Normalization -- Standardizes dates and team names across sources
  • Dataset Merging -- Inner join on date + team columns
  • Rolling Averages -- Pre-game cumulative stat averages per team per season
  • Games Played Tracking -- Cumulative games played before each matchup
  • Feature Rankings -- Per-date rankings across all teams
  • CLI Interface -- nfl-pipeline command with scrape, process, and pipeline subcommands
  • Python API -- Import and call any step programmatically

Prerequisites

  • Python 3.12+
  • Google Chrome + ChromeDriver
  • A PFF Premium subscription (for PFF scraping)
  • Rotating proxies in CSV format (for PFR scraping)

Setup

# Install dependencies
poetry install

# Copy and fill in credentials
cp .env.example .env

# Add your proxies
mkdir -p proxies
# Place your proxies.csv in proxies/ (format: address:port:user:password per line)

Configuration

Override defaults with environment variables:

Variable Default Description
NFL_SEASONS 2024 Comma-separated list of seasons for PFF scraping
NFL_START_YEAR 2024 Start year for PFR URL scraping
NFL_END_YEAR 2024 End year for PFR URL scraping
NFL_MAX_WEEK 18 Last week to scrape in the final year
NFL_DATA_DIR data Base directory for all data output
NFL_PROXY_FILE proxies/proxies.csv Path to proxy CSV file
PFF_EMAIL - PFF account email
PFF_PASSWORD - PFF account password

CLI Usage

# Run the full pipeline end-to-end
nfl-pipeline pipeline

# Scrape only PFF data (scrape + parse dates + normalize names)
nfl-pipeline scrape pff

# Scrape only PFR data (URLs + game data + parse dates + normalize names)
nfl-pipeline scrape pfr

# Run all post-processing steps
nfl-pipeline process all

# Run individual processing steps
nfl-pipeline process merge
nfl-pipeline process over-under
nfl-pipeline process averages
nfl-pipeline process games-played
nfl-pipeline process rankings

# Show version
nfl-pipeline --version

Python API

import nfl_data_pipeline

# Run individual steps
nfl_data_pipeline.scrape_pff_data()
nfl_data_pipeline.collect_boxscore_urls()
nfl_data_pipeline.scrape_all_game_info()
nfl_data_pipeline.merge_datasets()
nfl_data_pipeline.process_over_under()
nfl_data_pipeline.compute_rolling_averages()
nfl_data_pipeline.add_games_played()
nfl_data_pipeline.compute_rankings()

# Or run full pipelines
from nfl_data_pipeline.pipeline import run_full_pipeline, run_pff_pipeline, run_pfr_pipeline
run_full_pipeline()

Pipeline

PFF Scrape          PFR Scrape
    |                   |
    v                   v
Extract Dates      Normalize Dates
    |                   |
    v                   v
Normalize Names    Normalize Names
    |                   |
    +-------+   +-------+
            |   |
            v   v
           Merge
             |
             v
        Over/Under
             |
             v
      Rolling Averages
             |
             v
       Games Played
             |
             v
         Rankings

Project Structure

nfl-data-pipeline/
├── src/
│   └── nfl_data_pipeline/
│       ├── __init__.py              # __version__, top-level re-exports
│       ├── _config.py               # Paths, env vars, logging setup
│       ├── teams.py                 # Team name/abbreviation mappings
│       ├── cli.py                   # Click CLI entry point
│       ├── pipeline.py              # Full pipeline orchestrator
│       ├── scrapers/
│       │   ├── pff.py               # PFF grades scraper
│       │   ├── pfr.py               # PFR game data scraper
│       │   ├── pfr_urls.py          # PFR boxscore URL collector
│       │   ├── auth.py              # PFF authentication
│       │   └── proxies.py           # Shared proxy loading
│       ├── parsers/
│       │   ├── pff_dates.py         # PFF date extraction
│       │   ├── pff_teams.py         # PFF team name normalization
│       │   ├── pfr_dates.py         # PFR date normalization
│       │   └── pfr_teams.py         # PFR team name extraction
│       └── processing/
│           ├── merge.py             # Merge PFF + PFR datasets
│           ├── over_under.py        # O/U betting line extraction
│           ├── rolling_averages.py  # Rolling stat averages
│           ├── games_played.py      # Cumulative games played
│           └── rankings.py          # Feature rankings
├── tests/
├── pyproject.toml
├── Makefile
├── LICENSE
└── README.md

Make Commands

make all            # Run the full pipeline end-to-end
make pff            # Run only the PFF scraping + processing chain
make pfr            # Run only the PFR scraping + processing chain
make merge          # Merge PFF and PFR data (runs both chains first)
make rankings       # Run full postprocessing through rankings
make test           # Run the test suite
make clean          # Remove all generated data files
make dirs           # Create data directory structure

Notes

  • PFF scraping is fragile. It relies on XPath selectors tied to PFF's DOM structure. If PFF changes their frontend, the selectors in scrapers/pff.py will need updating.
  • PFR scraping requires proxies. Pro Football Reference rate-limits aggressively. Without rotating proxies, requests will be blocked.
  • The PFF scraper uses a real browser. It opens Chrome via Selenium, logs in with your credentials, and navigates page by page. This is slow but necessary since PFF renders data client-side.
  • Data files are not tracked in git. Run the pipeline to generate them, or bring your own data in the expected format.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nfl_data_pipeline-1.0.1.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nfl_data_pipeline-1.0.1-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file nfl_data_pipeline-1.0.1.tar.gz.

File metadata

  • Download URL: nfl_data_pipeline-1.0.1.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nfl_data_pipeline-1.0.1.tar.gz
Algorithm Hash digest
SHA256 92b93e00c108547d68bd8aa0bcaa6ce1ea87ea08e694d6ebc985f4c5e0ba0520
MD5 726b0b4fe97da875ab7cd5ad56f2adcd
BLAKE2b-256 6b3c8545d7c74900259fd5329bba9e37953efe36b6e96dfef20289035c975ce5

See more details on using hashes here.

Provenance

The following attestation bundles were made for nfl_data_pipeline-1.0.1.tar.gz:

Publisher: publish.yml on thadhutch/nfl-data-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nfl_data_pipeline-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for nfl_data_pipeline-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3901ec4b662a711dd412b999652dc8fcac39ae6db3b5e627fb4829c25b50d4b8
MD5 36fbb9edcd49007bbae720b401dc4b79
BLAKE2b-256 527eb51b34219832d5935994e9be5d23375a1ddd76094e4e37a302768997f851

See more details on using hashes here.

Provenance

The following attestation bundles were made for nfl_data_pipeline-1.0.1-py3-none-any.whl:

Publisher: publish.yml on thadhutch/nfl-data-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page