NFL data pipeline combining PFF grades and PFR game data for over/under analysis

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

nfl-data-pipeline

An end-to-end data pipeline that combines PFF team grades with Pro Football Reference game and betting data, then produces analysis-ready datasets with rolling averages, rankings, and over/under features.

Features
Installation
Prerequisites
Configuration
Usage
Pipeline Architecture
Project Structure
Development
Contributing
Known Limitations
License

Features

PFF Scraping — Selenium-based scraper for PFF team grades (requires PFF Premium; manual login on first run, cookies cached for subsequent runs)
PFR Scraping — Proxy-rotated scraper for Pro Football Reference boxscores
Data Normalization — Standardizes dates and team names across sources
Dataset Merging — Inner join on date + team columns
Rolling Averages — Pre-game cumulative stat averages per team per season
Games Played Tracking — Cumulative games played before each matchup
Feature Rankings — Per-date rankings across all teams
CLI + Python API — Run the full pipeline or any individual step

Installation

Install from PyPI:

pip install nfl-data-pipeline

Or install from source with Poetry:

git clone https://github.com/thadhutch/nfl-data-pipeline.git
cd nfl-data-pipeline
poetry install

Prerequisites

Requirement	Why
Python 3.12+	Runtime
Google Chrome	PFF scraper uses Selenium to render client-side data
PFF Premium subscription	Authenticates access to PFF team grades
Rotating proxies (CSV)	PFR rate-limits aggressively; proxies prevent blocks

Configuration

Create a .env file (see .env.example) or export environment variables directly:

cp .env.example .env

Variable	Default	Description
`NFL_SEASONS`	`2024`	Comma-separated seasons to scrape from PFF
`NFL_START_YEAR`	`2024`	First year for PFR boxscore URL collection
`NFL_END_YEAR`	`2024`	Last year for PFR boxscore URL collection
`NFL_MAX_WEEK`	`18`	Final week to scrape in the last season
`NFL_DATA_DIR`	`data`	Base directory for all output files
`NFL_PROXY_FILE`	`proxies/proxies.csv`	Path to proxy list (`address:port:user:password` per line)

Usage

CLI

The nfl-pipeline command is available after installation.

Note: The first time you run nfl-pipeline scrape pff, a Chrome window will open for you to log in to PFF manually. After login, cookies are saved locally and reused for future runs.

# Full end-to-end pipeline
nfl-pipeline pipeline

# Scrape from a single source
nfl-pipeline scrape pff          # PFF grades (scrape + date parsing + name normalization)
nfl-pipeline scrape pfr          # PFR game data (URLs + scrape + date/name normalization)

# Run all post-processing steps
nfl-pipeline process all

# Run individual processing steps
nfl-pipeline process merge
nfl-pipeline process over-under
nfl-pipeline process averages
nfl-pipeline process games-played
nfl-pipeline process rankings

# Check installed version
nfl-pipeline --version

Python API

Every pipeline step is importable:

import nfl_data_pipeline

# Scraping
nfl_data_pipeline.scrape_pff_data()
nfl_data_pipeline.collect_boxscore_urls()
nfl_data_pipeline.scrape_all_game_info()

# Processing
nfl_data_pipeline.merge_datasets()
nfl_data_pipeline.process_over_under()
nfl_data_pipeline.compute_rolling_averages()
nfl_data_pipeline.add_games_played()
nfl_data_pipeline.compute_rankings()

Or run an entire pipeline at once:

from nfl_data_pipeline.pipeline import (
    run_full_pipeline,
    run_pff_pipeline,
    run_pfr_pipeline,
    run_processing_pipeline,
)

run_full_pipeline()        # end-to-end
run_pff_pipeline()         # PFF scraping chain only
run_pfr_pipeline()         # PFR scraping chain only
run_processing_pipeline()  # post-processing only

Make Targets

A Makefile is included for common development workflows:

make all            # Full pipeline end-to-end
make pff            # PFF scraping + processing chain
make pfr            # PFR scraping + processing chain
make merge          # Merge PFF and PFR data (runs both chains first)
make rankings       # Full postprocessing through rankings
make test           # Run the test suite
make clean          # Remove all generated data files
make dirs           # Create data directory structure

Pipeline Architecture

PFF Scrape              PFR Scrape
    |                       |
    v                       v
Extract Dates          Normalize Dates
    |                       |
    v                       v
Normalize Names        Normalize Names
    |                       |
    +----------+   +--------+
               |   |
               v   v
              Merge
                |
                v
           Over/Under
                |
                v
         Rolling Averages
                |
                v
          Games Played
                |
                v
            Rankings

Output files are written to NFL_DATA_DIR (default: data/):

Stage	Output
PFF scrape	`data/pff/raw_team_data.csv`
PFF normalized	`data/pff/normalized_team_data.csv`
PFR URLs	`data/pfr/boxscores_urls.txt`
PFR normalized	`data/pfr/final_pfr_odds.csv`
Merged	`data/pff_and_pfr_data.csv`
Final dataset	`data/over-under/v1-dataset-gp-ranked.csv`

Project Structure

nfl-data-pipeline/
├── src/nfl_data_pipeline/
│   ├── __init__.py           # Public API re-exports
│   ├── _config.py            # Paths, env vars, logging
│   ├── cli.py                # Click CLI entry point
│   ├── pipeline.py           # Pipeline orchestrators
│   ├── teams.py              # Team name/abbreviation mappings
│   ├── scrapers/
│   │   ├── pff.py            # PFF grades scraper (Selenium)
│   │   ├── pfr.py            # PFR game data scraper
│   │   ├── pfr_urls.py       # PFR boxscore URL collector
│   │   ├── auth.py           # PFF authentication
│   │   └── proxies.py        # Proxy loading utilities
│   ├── parsers/
│   │   ├── pff_dates.py      # PFF date extraction
│   │   ├── pff_teams.py      # PFF team name normalization
│   │   ├── pfr_dates.py      # PFR date normalization
│   │   └── pfr_teams.py      # PFR team name extraction
│   └── processing/
│       ├── merge.py          # Merge PFF + PFR datasets
│       ├── over_under.py     # O/U betting line extraction
│       ├── rolling_averages.py
│       ├── games_played.py
│       └── rankings.py
├── tests/
├── pyproject.toml
├── Makefile
├── LICENSE
└── README.md

Development

# Clone and install with dev dependencies
git clone https://github.com/thadhutch/nfl-data-pipeline.git
cd nfl-data-pipeline
poetry install

# Run the test suite
poetry run pytest -v

# Run a specific test file
poetry run pytest tests/test_rolling_averages.py -v

CI runs automatically on every push to master and on pull requests via GitHub Actions. Releases are published to PyPI through Trusted Publishers.

Contributing

Contributions are welcome! To get started:

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Make your changes and add tests where appropriate
Run the test suite (poetry run pytest -v)
Commit your changes (git commit -m "Add my feature")
Push to your fork (git push origin feature/my-feature)
Open a Pull Request

Please make sure all existing tests pass before submitting a PR.

Known Limitations

PFF scraping is DOM-dependent. The scraper relies on XPath selectors tied to PFF's frontend. If PFF changes their page structure, the selectors in scrapers/pff.py will need updating.
PFR scraping requires rotating proxies. Without them, requests will be rate-limited and blocked.
The PFF scraper is slow by design. It drives a real Chrome browser via Selenium because PFF renders data client-side.
PFF login requires manual interaction on first run. A Chrome window will open for you to log in. Cookies are cached afterward, so subsequent runs are fully automated.
Data files are not tracked in git. Run the pipeline to generate them, or bring your own data in the expected CSV format.

License

This project is licensed under the MIT License.

PyPI · Issues · CI Status

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

thadhutch

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.1

Feb 8, 2026

This version

1.1.0

Feb 8, 2026

1.0.1

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nfl_data_pipeline-1.1.0.tar.gz (22.4 kB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nfl_data_pipeline-1.1.0-py3-none-any.whl (28.9 kB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file nfl_data_pipeline-1.1.0.tar.gz.

File metadata

Download URL: nfl_data_pipeline-1.1.0.tar.gz
Upload date: Feb 8, 2026
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nfl_data_pipeline-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`aa706d6d15d2d4d8adc762f8c54270bf1068350d4f606fb44849cd49bd140aca`
MD5	`73340491ca174bef837869cd50617608`
BLAKE2b-256	`c44ab683db4fcf135156c5779c1da4e5c6d1aa8caf708106ae4c3d7db93306b6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nfl_data_pipeline-1.1.0.tar.gz:

Publisher: publish.yml on thadhutch/nfl-data-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nfl_data_pipeline-1.1.0.tar.gz
- Subject digest: aa706d6d15d2d4d8adc762f8c54270bf1068350d4f606fb44849cd49bd140aca
- Sigstore transparency entry: 927323448
- Sigstore integration time: Feb 8, 2026
Source repository:
- Permalink: thadhutch/nfl-data-pipeline@2c8e95d1467d79991259e7e367bdaf664b65df54
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/thadhutch
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2c8e95d1467d79991259e7e367bdaf664b65df54
- Trigger Event: release

File details

Details for the file nfl_data_pipeline-1.1.0-py3-none-any.whl.

File metadata

Download URL: nfl_data_pipeline-1.1.0-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 28.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nfl_data_pipeline-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0695f3c98462a4eb632d28c3c53c776500f950384d7039d39587dd448ee9815c`
MD5	`97c335326b049d292519c3d74f187398`
BLAKE2b-256	`ef7f8c82c9e649ae4ca54554ddcd2c955ec18bbf3bc5a8121c61f6f0b826e2f2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nfl_data_pipeline-1.1.0-py3-none-any.whl:

Publisher: publish.yml on thadhutch/nfl-data-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nfl_data_pipeline-1.1.0-py3-none-any.whl
- Subject digest: 0695f3c98462a4eb632d28c3c53c776500f950384d7039d39587dd448ee9815c
- Sigstore transparency entry: 927323451
- Sigstore integration time: Feb 8, 2026
Source repository:
- Permalink: thadhutch/nfl-data-pipeline@2c8e95d1467d79991259e7e367bdaf664b65df54
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/thadhutch
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2c8e95d1467d79991259e7e367bdaf664b65df54
- Trigger Event: release

nfl-data-pipeline 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

nfl-data-pipeline

Table of Contents

Features

Installation

Prerequisites

Configuration

Usage

CLI

Python API

Make Targets

Pipeline Architecture

Project Structure

Development

Contributing

Known Limitations

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance