A package for loading MLB Statcast pitch data quickly using HF Dataset
Project description
statcast-pitches
pybaseball is a great tool for downloading baseball data. Even though the library is optimized and scrapes this data in parallel, it can be time consuming.
The point of this repository is to utilize GitHub Actions to scrape new baseball data weekly during the MLB season, and update a parquet file hosted as a huggingface dataset. Reading this data as a huggingface dataset is much faster than scraping the new data each time you re run your code, or just want updated statcast pitch data in general.
The update.py script updates each week during the MLB season, updating the statcast-era-pitches HuggingFace Dataset so that you don't have to re scrape this data yourself.
You can explore the entire dataset in your browser at this link
Installation
pip install statcast-pitches
Usage
With statcast_pitches package
Example 1 w/ polars (suggested)
import statcast_pitches
import polars as pl
# load all pitches from 2015-present
pitches_lf = statcast_pitches.load()
# filter to get 2024 bat speed data
bat_speed_24_df = (pitches_lf
.filter(pl.col("game_date").dt.year() == 2024)
.select("bat_speed", "swing_length")
.collect())
print(bat_speed_24_df.head(3))
output:
| bat_speed | swing_length | |
|---|---|---|
| 0 | 73.61710 | 6.92448 |
| 1 | 58.63812 | 7.56904 |
| 2 | 71.71226 | 6.46088 |
Notes
- Because
statcast_pitches.load()uses a LazyFrame, we can load it much faster and even perform operations on it before 'collecting' it into memory. If it were loaded as a DataFrame, this code would execute in ~30-60 seconds, instead it runs between 2-8 seconds.
Example 2 Duckdb
import statcast_pitches
# get bat tracking data from 2024
params = ("2024",)
query_2024_bat_speed = f"""
SELECT bat_speed, swing_length
FROM pitches
WHERE
YEAR(game_date) =?
AND bat_speed IS NOT NULL;
"""
bat_speed_24_df = statcast_pitches.load(
query=query_2024_bat_speed,
params=params,
).collect()
print(bat_speed_24_df.head(3))
output:
| bat_speed | swing_length | |
|---|---|---|
| 0 | 73.61710 | 6.92448 |
| 1 | 58.63812 | 7.56904 |
| 2 | 71.71226 | 6.46088 |
Notes:
- If no query is specified, all data from 2015-present will be loaded into a DataFrame.
- The table in your query MUST be called 'pitches', or it will fail.
- Since
load()returns a LazyFrame, notice that I had to callpl.DataFrame.collect()before callinghead() - This is slower than the other polars approach, however sometimes using SQL is fun
With HuggingFace API (not recommended)
Pandas
import pandas as pd
df = pd.read_parquet("hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet")
Polars
import polars as pl
df = pl.read_parquet('hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet')
Duckdb
SELECT *
FROM 'hf://datasets/Jensen-holm/statcast-era-pitches/data/statcast_era_pitches.parquet';
HuggingFace Dataset
from datasets import load_dataset
ds = load_dataset("Jensen-holm/statcast-era-pitches")
Tidyverse
library(tidyverse)
statcast_pitches <- read_parquet(
"https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches/resolve/main/data/statcast_era_pitches.parquet"
)
see the dataset on HugingFace itself for more details.
Eager Benchmarking
| Eager Load Time (s) | API |
|---|---|
| 1421.103 | pybaseball |
| 26.899 | polars |
| 33.093 | pandas |
| 68.692 | duckdb |
⚠️ Data-Quality Warning ⚠️
MLB states that real time pitch_type classification is automated and subject to change as data gets reviewed. This is currently not taken into account as the huggingface dataset gets updated. pitch_type is the only column that is affected by this.
Contributing
Feel free to submit issues and PR's if you have a contribution you would like to make.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file statcast_pitches-1.0.0.tar.gz.
File metadata
- Download URL: statcast_pitches-1.0.0.tar.gz
- Upload date:
- Size: 45.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f43cd267db9cfb7ea725f0f5815d4f6f30f06ded21bf5f56a88782d580217be
|
|
| MD5 |
3f16c8501a35814d514889667b3d64d3
|
|
| BLAKE2b-256 |
c55848aa01df910682e74dad9ed56668ab20e8e1ab04c019312422de7ef041b1
|
File details
Details for the file statcast_pitches-1.0.0-py3-none-any.whl.
File metadata
- Download URL: statcast_pitches-1.0.0-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa20e00a920805b7557b81003947c238c3471cac8fa2cc290895cff2c3767285
|
|
| MD5 |
3e77aec9eb9cd5f9b96370c1f2fbf2e6
|
|
| BLAKE2b-256 |
fd0a0436cd99728daf928a8fe1c0d17b9b316026c4eae3e19373dac3f1240287
|