Skip to main content

Statcast → BigQuery: idempotent ingestion + LLM-friendly docs + Baseball Savant verification

Project description

statcast-bigquery

Idempotent Statcast → BigQuery ingestion, with first-class documentation for SQL/LLM agents and round-trip validation against Baseball Savant.

Install

pip install statcast-bigquery

Quickstart

gcloud auth application-default login
statcast-bigquery sync \
    --start 2024-04-01 --end 2024-10-31 \
    --table myproject.mydataset.statcast_pitches

Backfill

Backfill historical seasons in resumable chunks:

statcast-bigquery sync \
    --start 2015-04-01 --end 2026-05-11 \
    --chunk-by year --resume \
    --table myproject.mydataset.statcast_pitches

--resume skips chunks already recorded as success in <dataset>._statcast_ingest_runs. Override with --runs-table if you want the run log in a sidecar dataset. Re-running with the same --chunk-by is safe; switching --chunk-by yearmonth between runs will re-process (chunks must match exactly to skip).

Documentation

statcast-bigquery docs --format llm > STATCAST_FOR_LLMS.md

Seed your data dictionary

If you maintain a data_dictionary table (one row per column with business definitions, tags, lineage), you can seed it directly:

statcast-bigquery docs --format dictionary --apply \
    --dataset mydataset --table myproject.mydataset.statcast_pitches \
    --dictionary-table myproject.shared_ops.data_dictionary

Atomically replaces rows for (dataset, table) only; other entries in the dictionary table are untouched. Required target schema:

dataset, table, column, dtype, description, business_definition,
owner, tags ARRAY<STRING>, source_system, upstream_lineage_json,
created_at TIMESTAMP, updated_at TIMESTAMP

Verification

statcast-bigquery verify \
    --source baseball-savant \
    --aggregation player-season \
    --metric all --season 2024 \
    --table myproject.mydataset.statcast_pitches

Standings verify

End-to-end integrity check: reconstruct each team's season standings from the pitch data and compare against MLB statsapi:

statcast-bigquery verify \
    --source baseball-savant \
    --aggregation team-season \
    --metric all --season 2024 \
    --table myproject.mydataset.statcast_pitches

Three metrics are checked per team: wins, losses, run_diff. Default tolerances are ±1 game and ±5 runs. A passing run gives high confidence that no games are missing from the pitch ingest.

MIT licensed. This software does not include or distribute MLB data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statcast_bigquery-0.4.1.tar.gz (863.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

statcast_bigquery-0.4.1-py3-none-any.whl (65.7 kB view details)

Uploaded Python 3

File details

Details for the file statcast_bigquery-0.4.1.tar.gz.

File metadata

  • Download URL: statcast_bigquery-0.4.1.tar.gz
  • Upload date:
  • Size: 863.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for statcast_bigquery-0.4.1.tar.gz
Algorithm Hash digest
SHA256 6aff7986272c9c307b1d871abc711d71f8e88d49053466d0c85c43208160978e
MD5 604994ac9a17f0622d21087a2e5cadf6
BLAKE2b-256 906cbda041fc443010aac05d79928a133c3a274d8dda0d0c64d4c92f81eb877a

See more details on using hashes here.

Provenance

The following attestation bundles were made for statcast_bigquery-0.4.1.tar.gz:

Publisher: release.yml on blahovec-labs/statcast-bigquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file statcast_bigquery-0.4.1-py3-none-any.whl.

File metadata

File hashes

Hashes for statcast_bigquery-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 96fda0011179387bb6ce00c825840a9064a8e02201be33144a765f5c5eb9beda
MD5 d89b0927f1dec35687100ba09a46c19e
BLAKE2b-256 9ecc4cf591bfe2f0acb528901b8560e41783a7f5dfb2f010619c1523e4086e73

See more details on using hashes here.

Provenance

The following attestation bundles were made for statcast_bigquery-0.4.1-py3-none-any.whl:

Publisher: release.yml on blahovec-labs/statcast-bigquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page