Python package for loading and caching CSVs hosted on github into pandas dataframes
Project description
nfelo DCM
nfelo DCM is an abstraction layer for loading and caching NFL related CSVs stored on the web. DCM stands for Dataframe-CSV Mapping. The goal of the DCM is to get pandas dataframes of fresh data loaded in a way that balances simplicity, efficiency, and performance.
import nfelodcm
## Load 2 dataframes
db = nfelodcm.load(['pbp', 'games'])
## access the PBP dataframe
db['pbp']
Maps
Maps are config files that tell the DCM where data CSVs are located, how they should be retrieved, and what fields to pull. Each CSV has its own config in Maps/{table}.json, where parameters can be set for things like freshness SLAs, CSV parsing engines, iteration strategy, and assignments (mutations).
An important characteristic of these maps is that all fields must be 1) specified in the map and 2) typed. Fields not listed in the map will not be loaded. Untyped fields will throw an error.
Here is a sample config:
{
"name": "games",
"description": "nflgamedata games",
"download_url": "https://raw.githubusercontent.com/nflverse/nfldata/master/data/games.csv",
"compression": null,
"engine": "c",
"freshness": {
"type": "gh_commit",
"gh_api_endpoint": "https://api.github.com/repos/nflverse/nfldata/commits",
"gh_release_tag": null,
"sla_seconds": 500
},
"iter": {
"type": null,
"start": null
},
"assignments": [
"fastr_team_id_repl",
"score_clean"
],
"map": {
"game_id": "object",
"season": "int32",
"week": "int32",
...
}
}
Config Fields
| Field | Description |
|---|---|
name |
Table identifier |
description |
Human-readable description |
download_url |
URL to fetch CSV (use {0} placeholder for season in iter tables) |
compression |
Compression type ("gzip", null) |
engine |
Pandas CSV engine ("c", "python") |
freshness.type |
"gh_release" or "gh_commit" |
freshness.gh_api_endpoint |
GitHub API endpoint for freshness checks |
freshness.gh_release_tag |
Release tag for gh_release type |
freshness.sla_seconds |
Seconds before re-checking freshness |
iter.type |
"season" for multi-file tables, null for single file |
iter.start |
Starting year for season iteration |
iter.accept_partial |
Allow success if some season files fail |
assignments |
List of assignment function names to apply |
map |
Column name → dtype mapping |
Freshness
The DCM uses a two-tier freshness strategy:
- SLA Check: If the last freshness check was within
sla_seconds, skip the remote check entirely - Remote Check: Query GitHub API to compare remote timestamps against local state
For gh_release tables, freshness is determined by the updated_at timestamp of release assets. For gh_commit tables, freshness is based on the latest commit date.
Per-File Freshness (v0.2.1+)
For season-iterated tables (pbp, rosters, player_stats, etc.), the DCM tracks freshness per-season. When an update is needed, only stale seasons are re-downloaded - cached seasons are read from Data/Parts/{table}/. This significantly reduces bandwidth for incremental updates.
Data Storage
Data/
games.csv
pbp.csv # Combined table CSV
Parts/
pbp/
1999.csv # Per-season cache (iter tables only)
2000.csv
...
State/
Tables/
games.json # Per-table state (last_local_update, last_freshness_check)
pbp.json
Parts/
pbp.json # Per-season timestamps (iter tables only)
Global/
season_state.json # Current NFL season state
Assignments
Assignments are DataFrame transformations applied after data is pulled. They take a DataFrame as input and return a mutated DataFrame. Assignments are defined in Engine/Assignments/ and referenced by name in config files.
Common assignments include:
fastr_team_id_repl- Standardize team abbreviationsscore_clean- Fix known data errors in game scorespenalty_formatting- Parse penalty descriptions
GitHub Token (Optional)
To increase GitHub API rate limits from 60/hr to 5,000/hr, create a .env file in your working directory:
GITHUB_TOKEN=ghp_your_token_here
The token is used for freshness checks only, not for downloading CSVs. A token is only relevant/needed when pulling many tables with extremely fast processing times. In most use cases, the default rate limit is sufficient.
API
import nfelodcm
# Load tables
db = nfelodcm.load(['pbp', 'games'])
# Get a single DataFrame
df = nfelodcm.get_df('pbp')
# Get table config
config = nfelodcm.get_map('games')
# List available tables
tables = nfelodcm.list_tables()
# Get current season state
season, week = nfelodcm.get_season_state('last_full_week')
Further Detailed Documentation
| File | Description |
|---|---|
nfelodcm/Engine/Primatives/README.md |
Core architecture of the DCM data pipeline |
tests/README.md |
Test suite for the DCM |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nfelodcm-0.2.1.tar.gz.
File metadata
- Download URL: nfelodcm-0.2.1.tar.gz
- Upload date:
- Size: 39.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0d203347f18e7128dc70501ee3f55b513df057c7d8b9eb6ce31a1d962facef1
|
|
| MD5 |
afd6d0db9e684bf4a77f51ede2d5cf7c
|
|
| BLAKE2b-256 |
d2e741f439fe01c3004c776d1fe1ba64f6378a0922affc6d036d6b6bc0814673
|
File details
Details for the file nfelodcm-0.2.1-py3-none-any.whl.
File metadata
- Download URL: nfelodcm-0.2.1-py3-none-any.whl
- Upload date:
- Size: 52.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b9cb92141e7f7190815137ab276e8c43ac83f5232046324cf28e12eeeb6c446
|
|
| MD5 |
debc8ba10d6698a8d41cd1235fdf37dc
|
|
| BLAKE2b-256 |
d5a3b673394e27c22d1dcf35ce70e8c0414e9fbda12016aae5748c7bcf26bd44
|