Research-grade ingestion and semantic quality audit (A1–A7) for GBFS bike-share feeds

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rohanfosse

These details have not been verified by PyPI

Project description

gbfs-toolkit

Research-grade ingestion and semantic quality audit for GBFS bike-share feeds.

MobilityData's gbfs-validator checks that a feed is syntactically valid. gbfs-toolkit checks whether it is semantically trustworthy and analysis-ready — the A1–A7 quality taxonomy of Fossé & Pallares (gbfs-audit-catalogue) — and normalises feeds into a stable, version-independent data model you can reuse across studies.

Why

Every bike-share study re-implements the same plumbing — discover feeds, normalise GBFS 1.x/2.x/3.x, and (the hard part) cope with the semantic defects the syntactic validator cannot see: placeholder capacities, phantom docks, transposed coordinates, out-of-perimeter stations. This package consolidates that into one tested interface so the audit is a verdict per station, not a re-run of someone's notebook.

Install

pip install gbfs-toolkit            # from PyPI (when released)
pip install -e ".[dev]"            # from a local clone

Core depends only on numpy / scipy / pandas. Network discovery/fetch uses the optional [fetch] extra (requests).

Quick start

import gbfs_toolkit as gb

info, status = gb.load_example()          # bundled sample — no network needed
av = info.gbfs.join_status(status)        # fluent .gbfs accessor (or gb.join_availability)
clean = info.gbfs.drop_flagged()          # audit A1–A7 and keep the trustworthy stations
av.gbfs.occupancy()                       # bikes / (bikes + docks), NaN-safe

From your own feed:

import json

raw = json.load(open("station_information.json"))
stations = gb.to_canonical_station_info(raw, system_id="velib")   # version-independent frame
verdict  = gb.audit_static(stations)                              # A1–A7 per station
clean    = stations[~verdict["flagged"].to_numpy()]              # quality filter in one line

Every function is also a .gbfs accessor method, and pure (so df.pipe(gb.occupancy) works). gb.show_versions() prints an environment report for bug reports.

Command line (the semantic counterpart to gbfs-validator):

gbfs audit station_information.json --system-id velib --out verdict.csv

The A1–A7 semantic taxonomy

Flag	Rule	Signature	Level
A1	Out-of-domain inclusion	car-sharing advertised as bike-sharing	station
A2	Placeholder capacity	constant non-zero capacity across a whole system	system
A3	Structural over-capacity	free-floating fleet anchors	station
A4	Geospatial error	transposed coords / stations far from neighbours (3σ)	station
A5	Out-of-perimeter	system bounding box > 50,000 km²	system
A6	Zero-capacity dock	≥1% of docked stations declare capacity = 0	system
A7	Null capacity field	≥50% of stations declare capacity = NaN	system

Thresholds match the published catalogue, so verdicts reproduce.

Canonical data model (the stable contract)

Ingestion is normalised once into version-independent frames; audit and analysis then operate purely on these. Downstream code depends on these column names, never on raw GBFS JSON.

StationInfo: system_id, station_id, name, lat, lon, capacity, station_type, is_virtual_station
StationStatus: system_id, station_id, num_bikes_available, num_docks_available, is_renting, is_returning, last_reported, fetched_at, gbfs_version
VehicleStatus: system_id, vehicle_id, vehicle_type_id, lat, lon, is_reserved, is_disabled, fetched_at, gbfs_version
AuditVerdict: system_id, station_id, A1…A7, flagged, reason

last_reported and fetched_at are tz-aware UTC timestamps (datetime64[ns, UTC]) so feeds from different cities merge unambiguously.

Daily ergonomics

import gbfs_toolkit as gb

# discover by city (you rarely know the system_id)
cat   = gb.systems_catalog()
paris = gb.filter_catalog(cat, country_code="FR", city="Paris")

feed  = gb.GBFSFeed.from_url(url)
feed.summary()                       # one-glance card: stations, bikes, staleness, version
avail = feed.availability()          # bikes/docks + name/coords/capacity, one frame
avail["state"] = gb.station_state(avail)          # empty / full / disabled / normal
problems = gb.audit_dynamic(avail)                # negative counts, over-capacity, stale
near  = gb.find_nearest_stations(48.85, 2.35, feed.station_information(), k=3)

# many systems at once (threaded), broken feeds isolated as Exceptions
feeds = gb.fetch_multiple(["velib", "bixi", "lyon"], max_workers=5)

Longitudinal data lake

Turn a stream of snapshots into an analysis-ready panel. The library owns the formatting / dedup / I/O; your orchestrator (cron, Airflow…) owns the polling loop. Requires the optional [parquet] extra (pyarrow).

import gbfs_toolkit as gb

# in your poller (every N minutes):
gb.append_to_parquet(feed.station_status(), "lake/")   # Hive-partitioned by system_id/date

# in your analysis:
panel = gb.build_availability_panel("lake/", system_id="velib",
                                    start_time="2026-06-01", resample_freq="5min")
flow  = gb.calculate_net_flow(panel)   # Δ bikes/station per poll (observed flow only)

build_availability_panel filters partitions before loading (memory-bounded), de-duplicates redundant polls (same station_id + last_reported), and optionally resamples each station to a fixed cadence.

Station clustering (`[cluster]`)

Three lenses on "which stations belong together" — spatial, topological, behavioural:

gb.cluster_spatial(info, method="hdbscan")          # density zones (projected metres)
gb.cluster_spectral(info, k=6)                       # network/topology groups
gb.cluster_diurnal_profiles(panel, n_clusters=4)    # daily-rhythm typologies ⭐

cluster_diurnal_profiles turns the longitudinal panel into station typologies — e.g. "morning commuter origin" (full at night, empty by day) vs "recreational" — from each station's 24-hour occupancy profile (robust to irregular sampling). Modern options: auto-k by silhouette, shape clustering (normalize="zscore"), soft GMM, DTW (method="dtw", extra [dtw]), weekday/weekend split. And label_diurnal_typology turns clusters into named types. The payoff of the data lake.

Multimodal — bikeshare ↔ transit

stops = pd.read_csv("gtfs/stops.txt")               # bring your own GTFS stops
linked = gb.link_transit_stops(info, stops, radius_m=200)
feeders = linked[linked["is_transit_feeder"]]       # first/last-mile docks near rail/bus

Pure spatial proximity on GeoKDTree (no transit API, no schedules) — is_transit_feeder, nearest_stop_dist_m, n_transit_within.

Station surroundings — what's around each dock (`[osm]`)

# generic "what's nearby" — works for any point dataset (POIs, shops, …)
gb.features_within(info, pois, radius_m=300, category_col="amenity")  # n_within, n_cafe, …

# bring your own OSM frame (fetch it yourself, e.g. osmnx.features_from_point)
# one-shot context: transit feeders + OSM features, in one frame
ctx = gb.station_surroundings(info, transit=stops, osm=osm_gdf, radius_m=300)

The radius summarisation (counts + per-category breakdown + nearest distance) is the durable, tested core; data acquisition is Bring Your Own GeoDataFrame so the library never depends on a live Overpass endpoint. Routing / isochrones stay out of scope (use OSMnx / pandana).

Descriptive stats — the bikeshare `describe()`

gb.system_profile(av)                       # stations, capacity, occupancy, % empty/full/…
gb.compare_systems({"velib": av1, "bixi": av2})   # one comparison row per city
gb.concentration_metrics(info)              # capacity Gini + top-decile hub share (equity)
gb.coverage_stats(info, zones=zones)        # density, nearest-neighbour, Clark–Evans dispersion
gb.availability_stats(panel)                # per-station: occupancy, peak hour, volatility

Standard spatial / inequality algorithms (numpy/scipy only, deterministic):

gb.morans_i(info, "occupancy")              # spatial autocorrelation (+ z-score / p-value)
gb.ripley_k(info, radii=[100, 250, 500])    # multi-scale clustering: L>0 clustered, <0 dispersed
gb.lorenz_curve(info)                       # inequality curve to plot (Gini/Theil in concentration_metrics)

Readable, comparable summaries — strictly descriptive (no OD/trip inference). system_profile is a one-glance numeric card of a snapshot; concentration_metrics is an equity lens (kept outside the published A1–A7 audit, since it's a metric not a quality verdict); availability_stats turns a longitudinal panel into per-station scalars (pass a target_tz panel for local-time peaks).

Fleet reconciliation — where are the bikes, really?

tally = gb.reconcile_fleet_state(status, vehicles)   # or feed.reconcile_fleet()
tally["total_deployed"]        # on the street: stations + free-floating, overlap excluded
tally["total_rentable"]        # available in stations + available free-floating
tally["double_count_avoided"]  # vehicles a naive sum would have counted twice

GBFS reports the same fleet twice — aggregate docked counts in station_status and individual units (some parked at stations) in vehicle_status. Naively adding them double-counts every vehicle sitting at a dock. The reconciler excludes station-parked vehicles from the deployed total and surfaces the overlap instead of hiding it.

Geofencing / service areas (`[geo]`)

zones = gb.to_canonical_geofencing(raw, system_id="lime")  # GeoDataFrame of operator polygons
tagged = gb.zones_for_points(info, zones)                   # which zone each station sits in
density = len(info) / gb.zone_area_km2(zones).sum()         # bikes per km² of *real* service area
no_park = tagged[tagged["station_parking"] == False]        # stations in park-restricted zones

For free-floating / hybrid systems the real footprint is the operator's polygons, not a convex hull of stations. to_canonical_geofencing parses geofencing_zones.json (v2.x ride_allowed and v3.x ride_start/ride_end_allowed reconciled), zones_for_points is the point-in-zone spatial join, and zone_area_km2 reprojects to an equal-area CRS so density is metric and latitude-comparable. The full per-vehicle-type rules list is preserved.

Polite scraping & provenance (research-grade)

session = gb.build_session()                 # pooled, retry/backoff on 429/5xx (default in fetch_multiple)
resp = gb.fetch_feed_json(url, etag=prev_etag)   # conditional GET; raises GBFSNotModified on HTTP 304
...
gb.coverage_report(panel, expected_freq="5min")  # per-station uptime / longest gap (no imputation)
gb.generate_manifest("lake/")                # SHA-256 per partition + summary → cite on Zenodo

Built for scrapers that run for months: retries/backoff, conditional GETs (skip unchanged snapshots), an offline catalogue cache, a GBFSError exception hierarchy, and provenance tools so a dataset is citable and verifiable. Missing data stays missing — coverage_report quantifies it rather than imputing.

Examples

Runnable, end-to-end scripts live in examples/ — auditing an unknown feed, cron-driven collection into a Parquet lake, longitudinal analysis (coverage, typologies, turnover), and a network equity/coverage report.

Roadmap

v0.1 — canonical model, catalogue discovery, cross-version normalisation, static audit (A1–A7), CLI.
v0.2 — fetch/scrape (GBFSFeed, one-liners, fetch_multiple), dynamic audit (D1–D3), station_state, geo (GeoKDTree, find_nearest_stations), schema hardening.
v0.3 (this) — longitudinal data lake: append_to_parquet, build_availability_panel, calculate_net_flow.
v0.4 — cluster (spatial / spectral / diurnal profiles + named typologies).
v0.5 — multimodal (bikeshare ↔ transit feeders, BYOG GTFS).
v0.6 — osm / surroundings: features_within, station_surroundings, enrich_with_osm (BYOG infrastructure enrichment within a radius).
v0.7 — hardening (nullable dtypes, dockless-aware A7, antimeridian A5, mass-conservation net flow) + geofencing (service-area polygons, point-in-zone joins, equal-area density), fleet reconciliation (docked ↔ free-floating dedup), and parquet column/predicate pushdown for large panels.

Methodology & limitations

METHODOLOGY.md documents the A1–A7 thresholds, the dynamic checks, the polling/aliasing limit on flows, and what the spatial statistics can and cannot claim — read it before building a study on the toolkit.

How to cite

See CITATION.cff. The semantic taxonomy is from the gbfs-audit-catalogue dataset paper (Fossé & Pallares, 2026).

License

MIT. Affiliated with CESI LINEACT (EA 7527), Montpellier, France.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rohanfosse

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.3.0

Jun 27, 2026

1.2.0

Jun 27, 2026

1.1.0

Jun 27, 2026

This version

1.0.1

Jun 27, 2026

1.0.0

Jun 27, 2026

1.0.0rc1 pre-release

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gbfs_toolkit-1.0.1.tar.gz (91.4 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gbfs_toolkit-1.0.1-py3-none-any.whl (74.7 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file gbfs_toolkit-1.0.1.tar.gz.

File metadata

Download URL: gbfs_toolkit-1.0.1.tar.gz
Upload date: Jun 27, 2026
Size: 91.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gbfs_toolkit-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ac62c648e050d159d9adb94507d6a775cc714441261a07c1c16b4a8e225f0183`
MD5	`6aa27e67d770fcfee946f27dee72f443`
BLAKE2b-256	`7598e3e81958ae4932e214f4104fa83700e561a97daeef256b04bc65778242d4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gbfs_toolkit-1.0.1.tar.gz:

Publisher: release.yml on cycling-data-lab/gbfs-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gbfs_toolkit-1.0.1.tar.gz
- Subject digest: ac62c648e050d159d9adb94507d6a775cc714441261a07c1c16b4a8e225f0183
- Sigstore transparency entry: 1983078674
- Sigstore integration time: Jun 27, 2026
Source repository:
- Permalink: cycling-data-lab/gbfs-toolkit@d226dab1dc9d948f9583b458045bbfd03b4d2805
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/cycling-data-lab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d226dab1dc9d948f9583b458045bbfd03b4d2805
- Trigger Event: push

File details

Details for the file gbfs_toolkit-1.0.1-py3-none-any.whl.

File metadata

Download URL: gbfs_toolkit-1.0.1-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 74.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gbfs_toolkit-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f8418cb0c72ccb83cedbc7e7b5b08972627cc5bc9f57e35aad8bf1a7da083aee`
MD5	`11af37fb5a130965b4b0ef53bf60a6ee`
BLAKE2b-256	`a9bbafc81b55cb8f150f4e1f9d2e2c9e7ba16e54afaedea3913c9d3953b7c553`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gbfs_toolkit-1.0.1-py3-none-any.whl:

Publisher: release.yml on cycling-data-lab/gbfs-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gbfs_toolkit-1.0.1-py3-none-any.whl
- Subject digest: f8418cb0c72ccb83cedbc7e7b5b08972627cc5bc9f57e35aad8bf1a7da083aee
- Sigstore transparency entry: 1983078797
- Sigstore integration time: Jun 27, 2026
Source repository:
- Permalink: cycling-data-lab/gbfs-toolkit@d226dab1dc9d948f9583b458045bbfd03b4d2805
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/cycling-data-lab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d226dab1dc9d948f9583b458045bbfd03b4d2805
- Trigger Event: push

gbfs-toolkit 1.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

gbfs-toolkit

Why

Install

Quick start

The A1–A7 semantic taxonomy

Canonical data model (the stable contract)

Daily ergonomics

Longitudinal data lake

Station clustering ([cluster])

Multimodal — bikeshare ↔ transit

Station surroundings — what's around each dock ([osm])

Descriptive stats — the bikeshare describe()

Fleet reconciliation — where are the bikes, really?

Geofencing / service areas ([geo])

Polite scraping & provenance (research-grade)

Examples

Roadmap

Methodology & limitations

How to cite

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Station clustering (`[cluster]`)

Station surroundings — what's around each dock (`[osm]`)

Descriptive stats — the bikeshare `describe()`

Geofencing / service areas (`[geo]`)