Skip to main content

Parse GTFS feeds to/from Parquet via Polars — fast, compact, typed.

Project description

gtfs-parquet

Parse GTFS feeds to/from Parquet via Polars — fast, compact, typed.

Features

  • Parse GTFS from zip, directory, or URL
  • Write to Parquet (zstd-compressed, sorted for optimal compression) or back to GTFS
  • Strongly typed schemas with optimised dtypes (Float32 coords, Int16 sequences)
  • Built on Polars — zero-copy reads, lazy evaluation ready
  • Operations: calendar expansion, network analysis, route/stop/trip stats, timetable graphs, CSA connections

Installation

pip install gtfs-parquet

Quick start

from gtfs_parquet import parse_gtfs, write_parquet, read_parquet, write_gtfs

# Parse a GTFS zip (local path or URL)
feed = parse_gtfs("gtfs.zip")

# Write to Parquet — directory, .zip, or .tar (auto-detected by extension)
write_parquet(feed, "output/")           # directory of .parquet files
write_parquet(feed, "output.zip")        # zip archive (no extra compression)
write_parquet(feed, "output.tar")        # tar archive (single file, no extra compression)

# Read back (same formats)
feed = read_parquet("output/")
feed = read_parquet("output.zip")
feed = read_parquet("output.tar")

# Convert back to GTFS zip
write_gtfs(feed, "roundtrip.zip")

Compression

Parquet output is significantly smaller than the original GTFS zip thanks to zstd compression, sorted row groups, and optimised column types:

Feed GTFS zip Parquet Saving
STIB 5.5 MB 3.2 MB 42.8 %
TEC 95.2 MB 23.9 MB 75.0 %
De Lijn 195 MB 55.0 MB 71.8 %

Feed object

Feed is a plain dataclass with one optional polars.DataFrame attribute per GTFS file (e.g. feed.stops, feed.routes, feed.stop_times). Only files present in the source feed are populated.

feed.tables()    # dict of all non-None tables
feed.validate()  # check required files and columns

Operations

All operations are standalone functions that take a Feed as their first argument. Import from the ops submodules:

from gtfs_parquet.ops.calendar import get_dates, get_active_services, compute_busiest_date
from gtfs_parquet.ops.trips import get_trips, compute_trip_stats
from gtfs_parquet.ops.routes import get_routes, compute_route_stats
from gtfs_parquet.ops.stops import get_stops, compute_stop_stats
from gtfs_parquet.ops.network import describe, compute_network_stats
from gtfs_parquet.ops.restrict import restrict_to_routes, restrict_to_dates
from gtfs_parquet.ops.clean import clean
from gtfs_parquet.ops.graph import (
    build_timetable_graph, get_service_day_counts, build_stop_lookup,
    compute_segment_frequencies, compute_connections, served_stations,
)

dates = get_dates(feed)
week = get_first_week(feed)
services = get_active_services(feed, dates[0])

trip_stats = compute_trip_stats(feed)
route_stats = compute_route_stats(feed, [dates[0]], trip_stats)
stop_stats = compute_stop_stats(feed, [dates[0]])

# Timetable graph for routing
graph = build_timetable_graph(feed, services, hour_filter=(6, 22))

# CSA-compatible connections
connections = compute_connections(feed, services)

# Segment frequencies weighted by service days
day_counts = get_service_day_counts(feed, dates)
freqs = compute_segment_frequencies(feed, services, service_day_counts=day_counts)

The API is inspired by gtfs-kit, re-implemented on Polars for significantly better performance.

Performance vs gtfs-kit

Benchmarked on the STIB (Brussels) feed (~5.5 MB, ~9 000 trips):

Operation gtfs-kit (pandas) gtfs-parquet (Polars) Speedup
Load feed 2.97 s 0.40 s 7x
compute_trip_stats 57.46 s 0.05 s 1149x
compute_stop_stats 9.67 s 0.19 s 51x
compute_route_stats 2.12 s 0.08 s 27x
compute_busiest_date 0.07 s 0.06 s 1x

Peak process memory: 1020 MB (gtfs-kit) vs 744 MB (gtfs-parquet).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gtfs_parquet-0.4.0.tar.gz (41.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gtfs_parquet-0.4.0-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file gtfs_parquet-0.4.0.tar.gz.

File metadata

  • Download URL: gtfs_parquet-0.4.0.tar.gz
  • Upload date:
  • Size: 41.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gtfs_parquet-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6981668e87c8914e0fc9dbde6e2875291f331429d15879b4f71a2dae8c528e04
MD5 7b7b4070c6aacf7af9a2e3f89be3565a
BLAKE2b-256 5e98825078341eca8d3fd14448a0db006f8a3995349a2c559ffb63a06b5d3ef9

See more details on using hashes here.

Provenance

The following attestation bundles were made for gtfs_parquet-0.4.0.tar.gz:

Publisher: publish.yml on GaspardMerten/gtfs-parquet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gtfs_parquet-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: gtfs_parquet-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 32.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gtfs_parquet-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2be3b8e6f9f07dcb68b50bca510c709976ec3909aa25ef9b1f3b9686f6c05ec2
MD5 4587a0b09ef58614ba18a87c2d261289
BLAKE2b-256 3cf5edeb0350e5caa183a872c3abd657ae581d9419eb345a15c586b50a4f7810

See more details on using hashes here.

Provenance

The following attestation bundles were made for gtfs_parquet-0.4.0-py3-none-any.whl:

Publisher: publish.yml on GaspardMerten/gtfs-parquet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page