Research-grade ingestion and semantic quality audit (A1–A7) for GBFS bike-share feeds
Project description
gbfs-toolkit
Research-grade ingestion and semantic quality audit for GBFS bike-share feeds.
MobilityData's gbfs-validator checks
that a feed is syntactically valid. gbfs-toolkit checks whether it is semantically
trustworthy and analysis-ready — the A1–A7 quality taxonomy of Fossé & Pallares
(gbfs-audit-catalogue) — and
normalises feeds into a stable, version-independent data model you can reuse across
studies.
Why
Every bike-share study re-implements the same plumbing — discover feeds, normalise GBFS 1.x/2.x/3.x, and (the hard part) cope with the semantic defects the syntactic validator cannot see: placeholder capacities, phantom docks, transposed coordinates, out-of-perimeter stations. This package consolidates that into one tested interface so the audit is a verdict per station, not a re-run of someone's notebook.
Install
pip install gbfs-toolkit # from PyPI (when released)
pip install -e ".[dev]" # from a local clone
Core depends only on numpy / scipy / pandas. Network discovery/fetch uses the optional
[fetch] extra (requests).
Quick start
import gbfs_toolkit as gb
info, status = gb.load_example() # bundled sample — no network needed
av = info.gbfs.join_status(status) # fluent .gbfs accessor (or gb.join_availability)
clean = info.gbfs.drop_flagged() # audit A1–A7 and keep the trustworthy stations
av.gbfs.occupancy() # bikes / (bikes + docks), NaN-safe
From your own feed:
import json
raw = json.load(open("station_information.json"))
stations = gb.to_canonical_station_info(raw, system_id="velib") # version-independent frame
verdict = gb.audit_static(stations) # A1–A7 per station
clean = stations[~verdict["flagged"].to_numpy()] # quality filter in one line
Every function is also a .gbfs accessor method, and pure (so df.pipe(gb.occupancy) works).
gb.show_versions() prints an environment report for bug reports.
Command line (the semantic counterpart to gbfs-validator):
gbfs audit station_information.json --system-id velib --out verdict.csv
The A1–A7 semantic taxonomy
| Flag | Rule | Signature | Level |
|---|---|---|---|
| A1 | Out-of-domain inclusion | car-sharing advertised as bike-sharing | station |
| A2 | Placeholder capacity | constant non-zero capacity across a whole system | system |
| A3 | Structural over-capacity | free-floating fleet anchors | station |
| A4 | Geospatial error | transposed coords / stations far from neighbours (3σ) | station |
| A5 | Out-of-perimeter | system bounding box > 50,000 km² | system |
| A6 | Zero-capacity dock | ≥1% of docked stations declare capacity = 0 | system |
| A7 | Null capacity field | ≥50% of stations declare capacity = NaN | system |
Thresholds match the published catalogue, so verdicts reproduce.
Canonical data model (the stable contract)
Ingestion is normalised once into version-independent frames; audit and analysis then operate purely on these. Downstream code depends on these column names, never on raw GBFS JSON.
- StationInfo:
system_id, station_id, name, lat, lon, capacity, station_type, is_virtual_station - StationStatus:
system_id, station_id, num_bikes_available, num_docks_available, is_renting, is_returning, last_reported, fetched_at, gbfs_version - VehicleStatus:
system_id, vehicle_id, vehicle_type_id, lat, lon, is_reserved, is_disabled, fetched_at, gbfs_version - AuditVerdict:
system_id, station_id, A1…A7, flagged, reason
last_reported and fetched_at are tz-aware UTC timestamps (datetime64[ns, UTC]) so
feeds from different cities merge unambiguously.
Daily ergonomics
import gbfs_toolkit as gb
# discover by city (you rarely know the system_id)
cat = gb.systems_catalog()
paris = gb.filter_catalog(cat, country_code="FR", city="Paris")
feed = gb.GBFSFeed.from_url(url)
feed.summary() # one-glance card: stations, bikes, staleness, version
avail = feed.availability() # bikes/docks + name/coords/capacity, one frame
avail["state"] = gb.station_state(avail) # empty / full / disabled / normal
problems = gb.audit_dynamic(avail) # negative counts, over-capacity, stale
near = gb.find_nearest_stations(48.85, 2.35, feed.station_information(), k=3)
# many systems at once (threaded), broken feeds isolated as Exceptions
feeds = gb.fetch_multiple(["velib", "bixi", "lyon"], max_workers=5)
Longitudinal data lake
Turn a stream of snapshots into an analysis-ready panel. The library owns the
formatting / dedup / I/O; your orchestrator (cron, Airflow…) owns the polling loop.
Requires the optional [parquet] extra (pyarrow).
import gbfs_toolkit as gb
# in your poller (every N minutes):
gb.append_to_parquet(feed.station_status(), "lake/") # Hive-partitioned by system_id/date
# in your analysis:
panel = gb.build_availability_panel("lake/", system_id="velib",
start_time="2026-06-01", resample_freq="5min")
flow = gb.calculate_net_flow(panel) # Δ bikes/station per poll (observed flow only)
build_availability_panel filters partitions before loading (memory-bounded),
de-duplicates redundant polls (same station_id + last_reported), and optionally
resamples each station to a fixed cadence.
Station clustering ([cluster])
Three lenses on "which stations belong together" — spatial, topological, behavioural:
gb.cluster_spatial(info, method="hdbscan") # density zones (projected metres)
gb.cluster_spectral(info, k=6) # network/topology groups
gb.cluster_diurnal_profiles(panel, n_clusters=4) # daily-rhythm typologies ⭐
cluster_diurnal_profiles turns the longitudinal panel into station typologies —
e.g. "morning commuter origin" (full at night, empty by day) vs "recreational" — from each
station's 24-hour occupancy profile (robust to irregular sampling). Modern options:
auto-k by silhouette, shape clustering (normalize="zscore"), soft GMM, DTW
(method="dtw", extra [dtw]), weekday/weekend split. And label_diurnal_typology
turns clusters into named types. The payoff of the data lake.
Multimodal — bikeshare ↔ transit
stops = pd.read_csv("gtfs/stops.txt") # bring your own GTFS stops
linked = gb.link_transit_stops(info, stops, radius_m=200)
feeders = linked[linked["is_transit_feeder"]] # first/last-mile docks near rail/bus
Pure spatial proximity on GeoKDTree (no transit API, no schedules) — is_transit_feeder,
nearest_stop_dist_m, n_transit_within.
Station surroundings — what's around each dock ([osm])
# generic "what's nearby" — works for any point dataset (POIs, shops, …)
gb.features_within(info, pois, radius_m=300, category_col="amenity") # n_within, n_cafe, …
# bring your own OSM frame (fetch it yourself, e.g. osmnx.features_from_point)
# one-shot context: transit feeders + OSM features, in one frame
ctx = gb.station_surroundings(info, transit=stops, osm=osm_gdf, radius_m=300)
The radius summarisation (counts + per-category breakdown + nearest distance) is the durable, tested core; data acquisition is Bring Your Own GeoDataFrame so the library never depends on a live Overpass endpoint. Routing / isochrones stay out of scope (use OSMnx / pandana).
Descriptive stats — the bikeshare describe()
gb.system_profile(av) # stations, capacity, occupancy, % empty/full/…
gb.compare_systems({"velib": av1, "bixi": av2}) # one comparison row per city
gb.concentration_metrics(info) # capacity Gini + top-decile hub share (equity)
gb.coverage_stats(info, zones=zones) # density, nearest-neighbour, Clark–Evans dispersion
gb.availability_stats(panel) # per-station: occupancy, peak hour, volatility
Standard spatial / inequality algorithms (numpy/scipy only, deterministic):
gb.morans_i(info, "occupancy") # spatial autocorrelation (+ z-score / p-value)
gb.ripley_k(info, radii=[100, 250, 500]) # multi-scale clustering: L>0 clustered, <0 dispersed
gb.lorenz_curve(info) # inequality curve to plot (Gini/Theil in concentration_metrics)
Readable, comparable summaries — strictly descriptive (no OD/trip inference). system_profile
is a one-glance numeric card of a snapshot; concentration_metrics is an equity lens (kept
outside the published A1–A7 audit, since it's a metric not a quality verdict);
availability_stats turns a longitudinal panel into per-station scalars (pass a target_tz
panel for local-time peaks).
Fleet reconciliation — where are the bikes, really?
tally = gb.reconcile_fleet_state(status, vehicles) # or feed.reconcile_fleet()
tally["total_deployed"] # on the street: stations + free-floating, overlap excluded
tally["total_rentable"] # available in stations + available free-floating
tally["double_count_avoided"] # vehicles a naive sum would have counted twice
GBFS reports the same fleet twice — aggregate docked counts in station_status and
individual units (some parked at stations) in vehicle_status. Naively adding them
double-counts every vehicle sitting at a dock. The reconciler excludes station-parked
vehicles from the deployed total and surfaces the overlap instead of hiding it.
Geofencing / service areas ([geo])
zones = gb.to_canonical_geofencing(raw, system_id="lime") # GeoDataFrame of operator polygons
tagged = gb.zones_for_points(info, zones) # which zone each station sits in
density = len(info) / gb.zone_area_km2(zones).sum() # bikes per km² of *real* service area
no_park = tagged[tagged["station_parking"] == False] # stations in park-restricted zones
For free-floating / hybrid systems the real footprint is the operator's polygons, not a
convex hull of stations. to_canonical_geofencing parses geofencing_zones.json (v2.x
ride_allowed and v3.x ride_start/ride_end_allowed reconciled), zones_for_points is the
point-in-zone spatial join, and zone_area_km2 reprojects to an equal-area CRS so density is
metric and latitude-comparable. The full per-vehicle-type rules list is preserved.
Polite scraping & provenance (research-grade)
session = gb.build_session() # pooled, retry/backoff on 429/5xx (default in fetch_multiple)
resp = gb.fetch_feed_json(url, etag=prev_etag) # conditional GET; raises GBFSNotModified on HTTP 304
...
gb.coverage_report(panel, expected_freq="5min") # per-station uptime / longest gap (no imputation)
gb.generate_manifest("lake/") # SHA-256 per partition + summary → cite on Zenodo
Built for scrapers that run for months: retries/backoff, conditional GETs (skip unchanged
snapshots), an offline catalogue cache, a GBFSError exception hierarchy, and provenance tools
so a dataset is citable and verifiable. Missing data stays missing — coverage_report
quantifies it rather than imputing.
Examples
Runnable, end-to-end scripts live in examples/ — auditing an unknown feed,
cron-driven collection into a Parquet lake, longitudinal analysis (coverage, typologies,
turnover), and a network equity/coverage report.
Roadmap
- v0.1 — canonical model, catalogue discovery, cross-version normalisation, static audit (A1–A7), CLI.
- v0.2 — fetch/scrape (
GBFSFeed, one-liners,fetch_multiple), dynamic audit (D1–D3),station_state, geo (GeoKDTree,find_nearest_stations), schema hardening. - v0.3 (this) — longitudinal data lake:
append_to_parquet,build_availability_panel,calculate_net_flow. - v0.4 —
cluster(spatial / spectral / diurnal profiles + named typologies). - v0.5 —
multimodal(bikeshare ↔ transit feeders, BYOG GTFS). - v0.6 —
osm/ surroundings:features_within,station_surroundings,enrich_with_osm(BYOG infrastructure enrichment within a radius). - v0.7 — hardening (nullable dtypes, dockless-aware A7, antimeridian A5,
mass-conservation net flow) +
geofencing(service-area polygons, point-in-zone joins, equal-area density),fleetreconciliation (docked ↔ free-floating dedup), and parquet column/predicate pushdown for large panels.
Methodology & limitations
METHODOLOGY.md documents the A1–A7 thresholds, the dynamic checks, the
polling/aliasing limit on flows, and what the spatial statistics can and cannot claim — read it
before building a study on the toolkit.
How to cite
See CITATION.cff. The semantic taxonomy is from the
gbfs-audit-catalogue dataset paper (Fossé & Pallares, 2026).
License
MIT. Affiliated with CESI LINEACT (EA 7527), Montpellier, France.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gbfs_toolkit-1.0.1.tar.gz.
File metadata
- Download URL: gbfs_toolkit-1.0.1.tar.gz
- Upload date:
- Size: 91.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac62c648e050d159d9adb94507d6a775cc714441261a07c1c16b4a8e225f0183
|
|
| MD5 |
6aa27e67d770fcfee946f27dee72f443
|
|
| BLAKE2b-256 |
7598e3e81958ae4932e214f4104fa83700e561a97daeef256b04bc65778242d4
|
Provenance
The following attestation bundles were made for gbfs_toolkit-1.0.1.tar.gz:
Publisher:
release.yml on cycling-data-lab/gbfs-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gbfs_toolkit-1.0.1.tar.gz -
Subject digest:
ac62c648e050d159d9adb94507d6a775cc714441261a07c1c16b4a8e225f0183 - Sigstore transparency entry: 1983078674
- Sigstore integration time:
-
Permalink:
cycling-data-lab/gbfs-toolkit@d226dab1dc9d948f9583b458045bbfd03b4d2805 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/cycling-data-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d226dab1dc9d948f9583b458045bbfd03b4d2805 -
Trigger Event:
push
-
Statement type:
File details
Details for the file gbfs_toolkit-1.0.1-py3-none-any.whl.
File metadata
- Download URL: gbfs_toolkit-1.0.1-py3-none-any.whl
- Upload date:
- Size: 74.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8418cb0c72ccb83cedbc7e7b5b08972627cc5bc9f57e35aad8bf1a7da083aee
|
|
| MD5 |
11af37fb5a130965b4b0ef53bf60a6ee
|
|
| BLAKE2b-256 |
a9bbafc81b55cb8f150f4e1f9d2e2c9e7ba16e54afaedea3913c9d3953b7c553
|
Provenance
The following attestation bundles were made for gbfs_toolkit-1.0.1-py3-none-any.whl:
Publisher:
release.yml on cycling-data-lab/gbfs-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gbfs_toolkit-1.0.1-py3-none-any.whl -
Subject digest:
f8418cb0c72ccb83cedbc7e7b5b08972627cc5bc9f57e35aad8bf1a7da083aee - Sigstore transparency entry: 1983078797
- Sigstore integration time:
-
Permalink:
cycling-data-lab/gbfs-toolkit@d226dab1dc9d948f9583b458045bbfd03b4d2805 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/cycling-data-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d226dab1dc9d948f9583b458045bbfd03b4d2805 -
Trigger Event:
push
-
Statement type: