Skip to main content

Python tool to download, ingest, and sample trip-data from NYC's Citi Bike network

Project description

Citibike-Sampler

A Python tool to facilitate work with data from NYC's Citi Bike network.


Why use this?

Data from the 'Citi Bike' system in NYC captures real-world patterns of urban mobility at very high resolution. As such, the data is widely used in research and practical applications.

However, working with the raw source data can be tedious. In a single year, the Citi Bike system records tens of millions of bike rides, equating to several GB worth of data. Furthermore, historical trip records are spread over hundreds of CSV files that use an inconsistent archive layout over time (annual bundles before 2024, monthly archives after).

Citibike-Sampler streamlines your workflow by providing:

  • a convenient data downloader with consistent local caching;
  • a data loader for accessing the full trip records; and
  • a random sampler to draw representative subsets of the full Citi Bike data spanning multiple months or years.

Random sampling allows you to quickly explore multi-year trends in the Citi Bike data, without having to load hundreds of millions of records into memory.


Installation

pip

Citibike-Sampler is available on PyPI and can be installed using pip:

pip install citibike-sampler

pipx (for CLI use)

If you only need data sampling from the command-line, installation is best done using pipx:

pipx install git+https://github.com/lungoruscello/Citibike-Sampler.git

Usage

Python API

from citibike_sampler import sample, load_all, get_cache_dir

# Randomly sample 1% of all trip records from the first half of 2025.
# (Will automatically download data from AWS if not already cached.)
sample_df = sample(start='2025-1', end='2025-6', fraction=0.01, seed=42)

# Plot daily aggregates of sampled trips (assumes matplotlib is available)
sample_df.set_index('ended_at').resample('1D').ride_id.count().plot()

# Load the full dataset (be careful: millions of rides per month!)
full_df = load_all(start='2025-1', end='2025-6') 

print(len(sample_df) / len(full_df))  # check the sampling fraction

print(get_cache_dir())  # inspect the local cache location  

CLI

Generate a random sample of Citi Bike data directly from the terminal:

cbike_sampler --start 2025-1 --end 2025-6 --fraction 0.01 --seed 42 --output sampled.csv

This will create a sampled.csv file containing roughly 1% of all trip records from the first half of 2025. To store the sampling result as a Feather or Parquet file, simply change the suffix of the output filename accordingly (e.g., sampled.parquet).

Requirements

  • Python 3.9 or higher
  • requests
  • pandas
  • tqdm
  • pyarrow (optional, for Parquet/Feather export)

Licence

MIT Licence. See LICENSE.txt for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citibike_sampler-0.1.0.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citibike_sampler-0.1.0-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file citibike_sampler-0.1.0.tar.gz.

File metadata

  • Download URL: citibike_sampler-0.1.0.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for citibike_sampler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 052c7c86750a5fd5eca37063cf0bd914ebed404b168d6f3d9702a686a75ec762
MD5 c5bd1cd05c0bba071f022cdc4ffabe59
BLAKE2b-256 fc2bef1e90327f8e998ad1b127d10c93a6e1813b7733454c368c7f67cde08160

See more details on using hashes here.

File details

Details for the file citibike_sampler-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for citibike_sampler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea2acc39ebc4aa81f4468ab0a7511854f6bd9533a8a96eee3c00bd080dc9902e
MD5 e6d3f8dc7c72dfaabb7640776b07c368
BLAKE2b-256 d70d81f3cfba7a9009c44b351c7f3cfce4991c327d56e8305f430b7c15cf5c5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page