Skip to main content

Statcast → BigQuery: idempotent ingestion + LLM-friendly docs + Baseball Savant verification

Project description

statcast-bigquery

Idempotent Statcast → BigQuery ingestion, with first-class documentation for SQL/LLM agents and round-trip validation against Baseball Savant.

Install

pip install statcast-bigquery

Quickstart

gcloud auth application-default login
statcast-bigquery sync \
    --start 2024-04-01 --end 2024-10-31 \
    --table myproject.mydataset.statcast_pitches

Backfill

Backfill historical seasons in resumable chunks:

statcast-bigquery sync \
    --start 2015-04-01 --end 2026-05-11 \
    --chunk-by year --resume \
    --table myproject.mydataset.statcast_pitches

--resume skips chunks already recorded as success in <dataset>._statcast_ingest_runs. Override with --runs-table if you want the run log in a sidecar dataset. Re-running with the same --chunk-by is safe; switching --chunk-by yearmonth between runs will re-process (chunks must match exactly to skip).

Documentation

statcast-bigquery docs --format llm > STATCAST_FOR_LLMS.md

Seed your data dictionary

If you maintain a data_dictionary table (one row per column with business definitions, tags, lineage), you can seed it directly:

statcast-bigquery docs --format dictionary --apply \
    --dataset mydataset --table myproject.mydataset.statcast_pitches \
    --dictionary-table myproject.shared_ops.data_dictionary

Atomically replaces rows for (dataset, table) only; other entries in the dictionary table are untouched. Required target schema:

dataset, table, column, dtype, description, business_definition,
owner, tags ARRAY<STRING>, source_system, upstream_lineage_json,
created_at TIMESTAMP, updated_at TIMESTAMP

Verification

statcast-bigquery verify \
    --source baseball-savant \
    --aggregation player-season \
    --metric all --season 2024 \
    --table myproject.mydataset.statcast_pitches

Standings verify

End-to-end integrity check: reconstruct each team's season standings from the pitch data and compare against MLB statsapi:

statcast-bigquery verify \
    --source baseball-savant \
    --aggregation team-season \
    --metric all --season 2024 \
    --table myproject.mydataset.statcast_pitches

Three metrics are checked per team: wins, losses, run_diff. Default tolerances are ±1 game and ±5 runs. A passing run gives high confidence that no games are missing from the pitch ingest.

MIT licensed. This software does not include or distribute MLB data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statcast_bigquery-0.4.0.tar.gz (854.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

statcast_bigquery-0.4.0-py3-none-any.whl (65.3 kB view details)

Uploaded Python 3

File details

Details for the file statcast_bigquery-0.4.0.tar.gz.

File metadata

  • Download URL: statcast_bigquery-0.4.0.tar.gz
  • Upload date:
  • Size: 854.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for statcast_bigquery-0.4.0.tar.gz
Algorithm Hash digest
SHA256 145940c8d2c60e1a555d82ac192cb4cba57c36d7046d7433f0c47d9e98e6772f
MD5 da67173d25b04531ecf1a41d213f6b5b
BLAKE2b-256 a75f16861c7294973011f6213900fa706300355b4907d28df67a99e21c6c98cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for statcast_bigquery-0.4.0.tar.gz:

Publisher: release.yml on blahovec-labs/statcast-bigquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file statcast_bigquery-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for statcast_bigquery-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ae37058295c884a51df8f13eeac2eadcc7bbb4de4f43f936fcecef3ea585ece
MD5 abb53b58d26259c3e95bd74f206e1767
BLAKE2b-256 29a6e801caed8393ab7cd4e38bfc77e0be210921349434d7586be9058a9e52c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for statcast_bigquery-0.4.0-py3-none-any.whl:

Publisher: release.yml on blahovec-labs/statcast-bigquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page