Parse GTFS feeds to/from Parquet via Polars — fast, compact, typed.
Project description
gtfs-parquet
Parse GTFS feeds to/from Parquet via Polars — fast, compact, typed.
Features
- Parse GTFS from zip, directory, or URL
- Write to Parquet (zstd-compressed, sorted for optimal compression) or back to GTFS
- Strongly typed schemas with optimised dtypes (Float32 coords, Int16 sequences)
- Built on Polars — zero-copy reads, lazy evaluation ready
- Operations: calendar expansion, network analysis, route/stop/trip stats, timetable graphs, CSA connections
Installation
pip install gtfs-parquet
Quick start
from gtfs_parquet import parse_gtfs, write_parquet, read_parquet, write_gtfs
# Parse a GTFS zip (local path or URL)
feed = parse_gtfs("gtfs.zip")
# Write to Parquet — directory, .zip, or .tar (auto-detected by extension)
write_parquet(feed, "output/") # directory of .parquet files
write_parquet(feed, "output.zip") # zip archive (no extra compression)
write_parquet(feed, "output.tar") # tar archive (single file, no extra compression)
# Read back (same formats)
feed = read_parquet("output/")
feed = read_parquet("output.zip")
feed = read_parquet("output.tar")
# Convert back to GTFS zip
write_gtfs(feed, "roundtrip.zip")
Compression
Parquet output is significantly smaller than the original GTFS zip thanks to zstd compression, sorted row groups, and optimised column types:
| Feed | GTFS zip | Parquet | Saving |
|---|---|---|---|
| STIB | 5.5 MB | 3.2 MB | 42.8 % |
| TEC | 95.2 MB | 23.9 MB | 75.0 % |
| De Lijn | 195 MB | 55.0 MB | 71.8 % |
Feed object
Feed is a plain dataclass with one optional polars.DataFrame attribute per GTFS
file (e.g. feed.stops, feed.routes, feed.stop_times). Only files present
in the source feed are populated.
feed.tables() # dict of all non-None tables
feed.validate() # check required files and columns
Operations
All operations are standalone functions that take a Feed as their first argument.
Import from the ops submodules:
from gtfs_parquet.ops.calendar import get_dates, get_active_services, compute_busiest_date
from gtfs_parquet.ops.trips import get_trips, compute_trip_stats
from gtfs_parquet.ops.routes import get_routes, compute_route_stats
from gtfs_parquet.ops.stops import get_stops, compute_stop_stats
from gtfs_parquet.ops.network import describe, compute_network_stats
from gtfs_parquet.ops.restrict import restrict_to_routes, restrict_to_dates
from gtfs_parquet.ops.clean import clean
from gtfs_parquet.ops.graph import (
build_timetable_graph, get_service_day_counts, build_stop_lookup,
compute_segment_frequencies, compute_connections, served_stations,
)
dates = get_dates(feed)
week = get_first_week(feed)
services = get_active_services(feed, dates[0])
trip_stats = compute_trip_stats(feed)
route_stats = compute_route_stats(feed, [dates[0]], trip_stats)
stop_stats = compute_stop_stats(feed, [dates[0]])
# Timetable graph for routing
graph = build_timetable_graph(feed, services, hour_filter=(6, 22))
# CSA-compatible connections
connections = compute_connections(feed, services)
# Segment frequencies weighted by service days
day_counts = get_service_day_counts(feed, dates)
freqs = compute_segment_frequencies(feed, services, service_day_counts=day_counts)
The API is inspired by gtfs-kit, re-implemented on Polars for significantly better performance.
Performance vs gtfs-kit
Benchmarked on the STIB (Brussels) feed (~5.5 MB, ~9 000 trips):
| Operation | gtfs-kit (pandas) | gtfs-parquet (Polars) | Speedup |
|---|---|---|---|
| Load feed | 2.97 s | 0.40 s | 7x |
compute_trip_stats |
57.46 s | 0.05 s | 1149x |
compute_stop_stats |
9.67 s | 0.19 s | 51x |
compute_route_stats |
2.12 s | 0.08 s | 27x |
compute_busiest_date |
0.07 s | 0.06 s | 1x |
Peak process memory: 1020 MB (gtfs-kit) vs 744 MB (gtfs-parquet).
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gtfs_parquet-0.4.0.tar.gz.
File metadata
- Download URL: gtfs_parquet-0.4.0.tar.gz
- Upload date:
- Size: 41.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6981668e87c8914e0fc9dbde6e2875291f331429d15879b4f71a2dae8c528e04
|
|
| MD5 |
7b7b4070c6aacf7af9a2e3f89be3565a
|
|
| BLAKE2b-256 |
5e98825078341eca8d3fd14448a0db006f8a3995349a2c559ffb63a06b5d3ef9
|
Provenance
The following attestation bundles were made for gtfs_parquet-0.4.0.tar.gz:
Publisher:
publish.yml on GaspardMerten/gtfs-parquet
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gtfs_parquet-0.4.0.tar.gz -
Subject digest:
6981668e87c8914e0fc9dbde6e2875291f331429d15879b4f71a2dae8c528e04 - Sigstore transparency entry: 1261976046
- Sigstore integration time:
-
Permalink:
GaspardMerten/gtfs-parquet@acccaaa60376663ed8a48489da410e670cfa3229 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/GaspardMerten
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@acccaaa60376663ed8a48489da410e670cfa3229 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gtfs_parquet-0.4.0-py3-none-any.whl.
File metadata
- Download URL: gtfs_parquet-0.4.0-py3-none-any.whl
- Upload date:
- Size: 32.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2be3b8e6f9f07dcb68b50bca510c709976ec3909aa25ef9b1f3b9686f6c05ec2
|
|
| MD5 |
4587a0b09ef58614ba18a87c2d261289
|
|
| BLAKE2b-256 |
3cf5edeb0350e5caa183a872c3abd657ae581d9419eb345a15c586b50a4f7810
|
Provenance
The following attestation bundles were made for gtfs_parquet-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on GaspardMerten/gtfs-parquet
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gtfs_parquet-0.4.0-py3-none-any.whl -
Subject digest:
2be3b8e6f9f07dcb68b50bca510c709976ec3909aa25ef9b1f3b9686f6c05ec2 - Sigstore transparency entry: 1261976071
- Sigstore integration time:
-
Permalink:
GaspardMerten/gtfs-parquet@acccaaa60376663ed8a48489da410e670cfa3229 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/GaspardMerten
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@acccaaa60376663ed8a48489da410e670cfa3229 -
Trigger Event:
release
-
Statement type: