Skip to main content

A Python package for scraping baseball data.

Project description

pybaseballstats

A Python package for scraping baseball statistics from the web. Inspired by the pybaseball package by James LeDoux.


PyPI Downloads Coverage Pytest Status Mypy Status


Available Sources

  1. Baseball Savant
    • This source provides high quality pitch-by-pitch data for all MLB games since 2015 as well as interesting leaderboards for various categories.
  2. Umpire Scorecards
    • This source provides umpire game logs and statistics for all MLB games since 2008.
  3. Baseball Reference
    • This source provides comprehensive, high detail stats for all MLB players and teams since 1871.
  4. Retrosheet
    • This source provides play-by-play data for all MLB games since 1871. This data is primarily used for the player_lookup function as well as ejection data. I am considering adding support for the play by play data as well.

[!NOTE] Although past versions had support for Fangraphs, I have decided to remove support for this source as they have recently implemented very aggressive anti-scraping measures that have made it very difficult to scrape data from their site. I may consider adding support for this source again in the future if they change their anti-scraping measures, but for now I have decided to focus on the other sources that are more reliable and easier to scrape data from.

Installation

pybaseballstats can be installed using pip or any other package manager (I use uv).

Examples:

uv add pybaseballstats

or:

pip install pybaseballstats

Documentation

Usage documentation can be found in this folder. This documentation is a work in progress and will be updated as I add more functionality to the package.

General Documentation (Things of Note)

  1. This project uses Polars internally. This means that all data returned from functions in this package will be in the form of a Polars DataFrame. If you want to convert the data to a Pandas DataFrame, you can do so by using the .to_pandas() method on the Polars DataFrame. For example:
  2. The BREF functions use a singleton pattern to guarantee that you won't exceed rate limits and face a longer timeout. So: don't be surprised if when you are making multiple calls to BREF functions that these calls may be a little slower than expected. This is to be expected as the singleton pattern is used to ensure that only one instance of the BREF scraper is created and used throughout the lifetime of your program. This is done to avoid exceeding rate limits and being blocked by BREF.
import pybaseballstats.umpire_scorecards as us
df_polars = us.game_data(start_date="2023-04-01", end_date="2023-04-30")
# Convert to Pandas DataFrame
df_pandas = df_polars.to_pandas()

Contributing

Improvements and bug fixes are welcome! Please open an issue or submit a pull request. If you are opening an issue please keep in mind that I am enrolled in university full-time and may not be able to respond immediately. I work on this in my free time, but I will do my best to fix any issues that are opened. To submit a pull request, please fork the repository and make your changes on a new branch. Make your changes and please create new tests if you are adding new functionality (updates to my own tests are more than welcome as well). Make sure all tests pass and once you are finished, submit a pull request and I will review your changes. Please include a detailed description of the changes you made and why you made them as a part of your pull request. Finally, before submitting any changes, please either use the just mypy tests command to check that your code is properly typed and passes all tests (if you have just installed), or run the mypy and testing commands manually (check the justfile for details on these commands).

Credit and Acknowledgement

This project was directly inspired by the pybaseball package by James LeDoux. The goal of this project is to provide a similar set of functionality with continual updates and improvements, as the original pybaseball package has lagged behind with updates and some key functionality has been broken.

All of the data scraped by this package is publicly available and free to use. All credit for the data goes to the organizations from which it was scraped.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pybaseballstats-0.4.11.tar.gz (35.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pybaseballstats-0.4.11-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file pybaseballstats-0.4.11.tar.gz.

File metadata

  • Download URL: pybaseballstats-0.4.11.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.4

File hashes

Hashes for pybaseballstats-0.4.11.tar.gz
Algorithm Hash digest
SHA256 e5fb999a88819b6c853e047ebc3a05f3b22800a82e963e16b3fce7086c37ae6c
MD5 1de91a9457791775bdd5f0376f6aa530
BLAKE2b-256 3d1a35aca92cacce40e016f3aa1a0ff0c4bba3030c1f54a2916a6d13096b583a

See more details on using hashes here.

File details

Details for the file pybaseballstats-0.4.11-py3-none-any.whl.

File metadata

File hashes

Hashes for pybaseballstats-0.4.11-py3-none-any.whl
Algorithm Hash digest
SHA256 d459b80d1ea14739130538b45506b3cbca27b53b58fd86e8e8a460de479b9f63
MD5 03bfdcdac2df903e4a8b50d066e4712f
BLAKE2b-256 8681eb280851a0cff529c2f37571b46773f211f4bb265604dae1122c599a0742

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page