Skip to main content

Research-grade ingestion and semantic quality audit (A1–A7) for GBFS bike-share feeds

Project description

gbfs-toolkit

CI License: MIT Python 3.10+

Research-grade ingestion and semantic quality audit for GBFS bike-share feeds.

MobilityData's gbfs-validator checks that a feed is syntactically valid. gbfs-toolkit checks whether it is semantically trustworthy and analysis-ready — the A1–A7 quality taxonomy of Fossé & Pallares (gbfs-audit-catalogue) — and normalises feeds into a stable, version-independent data model you can reuse across studies.

Why

Every bike-share study re-implements the same plumbing — discover feeds, normalise GBFS 1.x/2.x/3.x, and (the hard part) cope with the semantic defects the syntactic validator cannot see: placeholder capacities, phantom docks, transposed coordinates, out-of-perimeter stations. This package consolidates that into one tested interface so the audit is a verdict per station, not a re-run of someone's notebook.

Install

pip install gbfs-toolkit            # from PyPI (when released)
pip install -e ".[dev]"            # from a local clone

Core depends only on numpy / scipy / pandas. Network discovery/fetch uses the optional [fetch] extra (requests).

Quick start

import gbfs_toolkit as gb

info, status = gb.load_example()          # bundled sample — no network needed
av = info.gbfs.join_status(status)        # fluent .gbfs accessor (or gb.join_availability)
clean = info.gbfs.drop_flagged()          # audit A1–A7 and keep the trustworthy stations
av.gbfs.occupancy()                       # bikes / (bikes + docks), NaN-safe

From your own feed:

import json

raw = json.load(open("station_information.json"))
stations = gb.to_canonical_station_info(raw, system_id="velib")   # version-independent frame
verdict  = gb.audit_static(stations)                              # A1–A7 per station
clean    = stations[~verdict["flagged"].to_numpy()]              # quality filter in one line

Every function is also a .gbfs accessor method, and pure (so df.pipe(gb.occupancy) works). gb.show_versions() prints an environment report for bug reports.

Command line (the semantic counterpart to gbfs-validator):

gbfs audit station_information.json --system-id velib --out verdict.csv

The A1–A7 semantic taxonomy

Flag Rule Signature Level
A1 Out-of-domain inclusion car-sharing advertised as bike-sharing station
A2 Placeholder capacity constant non-zero capacity across a whole system system
A3 Structural over-capacity free-floating fleet anchors station
A4 Geospatial error transposed coords / stations far from neighbours (3σ) station
A5 Out-of-perimeter system bounding box > 50,000 km² system
A6 Zero-capacity dock ≥1% of docked stations declare capacity = 0 system
A7 Null capacity field ≥50% of stations declare capacity = NaN system

Thresholds match the published catalogue, so verdicts reproduce.

Canonical data model (the stable contract)

Ingestion is normalised once into version-independent frames; audit and analysis then operate purely on these. Downstream code depends on these column names, never on raw GBFS JSON.

  • StationInfo: system_id, station_id, name, lat, lon, capacity, station_type, is_virtual_station
  • StationStatus: system_id, station_id, num_bikes_available, num_docks_available, is_renting, is_returning, last_reported, fetched_at, gbfs_version
  • VehicleStatus: system_id, vehicle_id, vehicle_type_id, lat, lon, is_reserved, is_disabled, fetched_at, gbfs_version
  • AuditVerdict: system_id, station_id, A1…A7, flagged, reason

last_reported and fetched_at are tz-aware UTC timestamps (datetime64[ns, UTC]) so feeds from different cities merge unambiguously.

Daily ergonomics

import gbfs_toolkit as gb

# discover by city (you rarely know the system_id)
cat   = gb.systems_catalog()
paris = gb.filter_catalog(cat, country_code="FR", city="Paris")

feed  = gb.GBFSFeed.from_url(url)
feed.summary()                       # one-glance card: stations, bikes, staleness, version
avail = feed.availability()          # bikes/docks + name/coords/capacity, one frame
avail["state"] = gb.station_state(avail)          # empty / full / disabled / normal
problems = gb.audit_dynamic(avail)                # negative counts, over-capacity, stale
near  = gb.find_nearest_stations(48.85, 2.35, feed.station_information(), k=3)

# many systems at once (threaded), broken feeds isolated as Exceptions
feeds = gb.fetch_multiple(["velib", "bixi", "lyon"], max_workers=5)

Longitudinal data lake

Turn a stream of snapshots into an analysis-ready panel. The library owns the formatting / dedup / I/O; your orchestrator (cron, Airflow…) owns the polling loop. Requires the optional [parquet] extra (pyarrow).

import gbfs_toolkit as gb

# in your poller (every N minutes):
gb.append_to_parquet(feed.station_status(), "lake/")   # Hive-partitioned by system_id/date

# in your analysis:
panel = gb.build_availability_panel("lake/", system_id="velib",
                                    start_time="2026-06-01", resample_freq="5min")
flow  = gb.calculate_net_flow(panel)   # Δ bikes/station per poll (observed flow only)

build_availability_panel filters partitions before loading (memory-bounded), de-duplicates redundant polls (same station_id + last_reported), and optionally resamples each station to a fixed cadence.

Station clustering ([cluster])

Three lenses on "which stations belong together" — spatial, topological, behavioural:

gb.cluster_spatial(info, method="hdbscan")          # density zones (projected metres)
gb.cluster_spectral(info, k=6)                       # network/topology groups
gb.cluster_diurnal_profiles(panel, n_clusters=4)    # daily-rhythm typologies ⭐

cluster_diurnal_profiles turns the longitudinal panel into station typologies — e.g. "morning commuter origin" (full at night, empty by day) vs "recreational" — from each station's 24-hour occupancy profile (robust to irregular sampling). Modern options: auto-k by silhouette, shape clustering (normalize="zscore"), soft GMM, DTW (method="dtw", extra [dtw]), weekday/weekend split. And label_diurnal_typology turns clusters into named types. The payoff of the data lake.

Multimodal — bikeshare ↔ transit

stops = pd.read_csv("gtfs/stops.txt")               # bring your own GTFS stops
linked = gb.link_transit_stops(info, stops, radius_m=200)
feeders = linked[linked["is_transit_feeder"]]       # first/last-mile docks near rail/bus

Pure spatial proximity on GeoKDTree (no transit API, no schedules) — is_transit_feeder, nearest_stop_dist_m, n_transit_within.

Station surroundings — what's around each dock ([osm])

# generic "what's nearby" — works for any point dataset (POIs, shops, …)
gb.features_within(info, pois, radius_m=300, category_col="amenity")  # n_within, n_cafe, …

# bring your own OSM frame (fetch it yourself, e.g. osmnx.features_from_point)
# one-shot context: transit feeders + OSM features, in one frame
ctx = gb.station_surroundings(info, transit=stops, osm=osm_gdf, radius_m=300)

The radius summarisation (counts + per-category breakdown + nearest distance) is the durable, tested core; data acquisition is Bring Your Own GeoDataFrame so the library never depends on a live Overpass endpoint. Routing / isochrones stay out of scope (use OSMnx / pandana).

Descriptive stats — the bikeshare describe()

gb.system_profile(av)                       # stations, capacity, occupancy, % empty/full/…
gb.compare_systems({"velib": av1, "bixi": av2})   # one comparison row per city
gb.concentration_metrics(info)              # capacity Gini + top-decile hub share (equity)
gb.coverage_stats(info, zones=zones)        # density, nearest-neighbour, Clark–Evans dispersion
gb.availability_stats(panel)                # per-station: occupancy, peak hour, volatility

Standard spatial / inequality algorithms (numpy/scipy only, deterministic):

gb.morans_i(info, "occupancy")              # spatial autocorrelation (+ z-score / p-value)
gb.ripley_k(info, radii=[100, 250, 500])    # multi-scale clustering: L>0 clustered, <0 dispersed
gb.lorenz_curve(info)                       # inequality curve to plot (Gini/Theil in concentration_metrics)

Readable, comparable summaries — strictly descriptive (no OD/trip inference). system_profile is a one-glance numeric card of a snapshot; concentration_metrics is an equity lens (kept outside the published A1–A7 audit, since it's a metric not a quality verdict); availability_stats turns a longitudinal panel into per-station scalars (pass a target_tz panel for local-time peaks).

Fleet reconciliation — where are the bikes, really?

tally = gb.reconcile_fleet_state(status, vehicles)   # or feed.reconcile_fleet()
tally["total_deployed"]        # on the street: stations + free-floating, overlap excluded
tally["total_rentable"]        # available in stations + available free-floating
tally["double_count_avoided"]  # vehicles a naive sum would have counted twice

GBFS reports the same fleet twice — aggregate docked counts in station_status and individual units (some parked at stations) in vehicle_status. Naively adding them double-counts every vehicle sitting at a dock. The reconciler excludes station-parked vehicles from the deployed total and surfaces the overlap instead of hiding it.

Geofencing / service areas ([geo])

zones = gb.to_canonical_geofencing(raw, system_id="lime")  # GeoDataFrame of operator polygons
tagged = gb.zones_for_points(info, zones)                   # which zone each station sits in
density = len(info) / gb.zone_area_km2(zones).sum()         # bikes per km² of *real* service area
no_park = tagged[tagged["station_parking"] == False]        # stations in park-restricted zones

For free-floating / hybrid systems the real footprint is the operator's polygons, not a convex hull of stations. to_canonical_geofencing parses geofencing_zones.json (v2.x ride_allowed and v3.x ride_start/ride_end_allowed reconciled), zones_for_points is the point-in-zone spatial join, and zone_area_km2 reprojects to an equal-area CRS so density is metric and latitude-comparable. The full per-vehicle-type rules list is preserved.

Polite scraping & provenance (research-grade)

session = gb.build_session()                 # pooled, retry/backoff on 429/5xx (default in fetch_multiple)
resp = gb.fetch_feed_json(url, etag=prev_etag)   # conditional GET; raises GBFSNotModified on HTTP 304
...
gb.coverage_report(panel, expected_freq="5min")  # per-station uptime / longest gap (no imputation)
gb.generate_manifest("lake/")                # SHA-256 per partition + summary → cite on Zenodo

Built for scrapers that run for months: retries/backoff, conditional GETs (skip unchanged snapshots), an offline catalogue cache, a GBFSError exception hierarchy, and provenance tools so a dataset is citable and verifiable. Missing data stays missing — coverage_report quantifies it rather than imputing.

Examples

Runnable, end-to-end scripts live in examples/ — auditing an unknown feed, cron-driven collection into a Parquet lake, longitudinal analysis (coverage, typologies, turnover), and a network equity/coverage report.

Roadmap

  • v0.1 — canonical model, catalogue discovery, cross-version normalisation, static audit (A1–A7), CLI.
  • v0.2 — fetch/scrape (GBFSFeed, one-liners, fetch_multiple), dynamic audit (D1–D3), station_state, geo (GeoKDTree, find_nearest_stations), schema hardening.
  • v0.3 (this) — longitudinal data lake: append_to_parquet, build_availability_panel, calculate_net_flow.
  • v0.4cluster (spatial / spectral / diurnal profiles + named typologies).
  • v0.5multimodal (bikeshare ↔ transit feeders, BYOG GTFS).
  • v0.6osm / surroundings: features_within, station_surroundings, enrich_with_osm (BYOG infrastructure enrichment within a radius).
  • v0.7 — hardening (nullable dtypes, dockless-aware A7, antimeridian A5, mass-conservation net flow) + geofencing (service-area polygons, point-in-zone joins, equal-area density), fleet reconciliation (docked ↔ free-floating dedup), and parquet column/predicate pushdown for large panels.

Methodology & limitations

METHODOLOGY.md documents the A1–A7 thresholds, the dynamic checks, the polling/aliasing limit on flows, and what the spatial statistics can and cannot claim — read it before building a study on the toolkit.

How to cite

See CITATION.cff. The semantic taxonomy is from the gbfs-audit-catalogue dataset paper (Fossé & Pallares, 2026).

License

MIT. Affiliated with CESI LINEACT (EA 7527), Montpellier, France.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gbfs_toolkit-1.1.0.tar.gz (93.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gbfs_toolkit-1.1.0-py3-none-any.whl (75.7 kB view details)

Uploaded Python 3

File details

Details for the file gbfs_toolkit-1.1.0.tar.gz.

File metadata

  • Download URL: gbfs_toolkit-1.1.0.tar.gz
  • Upload date:
  • Size: 93.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gbfs_toolkit-1.1.0.tar.gz
Algorithm Hash digest
SHA256 fab4b5202876ce96c67f00c824ad127e1f73845b9ec5baba82ed3edf68892141
MD5 bdaf4930df48543cc1a1b968d4b7ba20
BLAKE2b-256 a8004d76f17f01102210588c0dc3cbdd4843cde0a594c746dbebb7f2a32654e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for gbfs_toolkit-1.1.0.tar.gz:

Publisher: release.yml on cycling-data-lab/gbfs-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gbfs_toolkit-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: gbfs_toolkit-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 75.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gbfs_toolkit-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a42e9a5d1bd2701058e1e937e926f29e06549fa7f29b4d2312eea2fb990a18db
MD5 f9f627eb3d6298d7a45441137b8d1071
BLAKE2b-256 9842faf2d140b4a77022708fa36ce08c15f3614a70a18931a734fc18e4a92a09

See more details on using hashes here.

Provenance

The following attestation bundles were made for gbfs_toolkit-1.1.0-py3-none-any.whl:

Publisher: release.yml on cycling-data-lab/gbfs-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page