Skip to main content

Pyramid (Geo)Parquet Generator

Project description

Yosegi - Pyramid (Geo)Parquet Generator

Yosegi is a tool to generate Pyramid (Geo)Parquet files - optimized for streaming large geospatial datasets.

Usage

pip install yosegi
yosegi input.parquet output.pyramid.parquet # GeoParquet
yosegi input.shp output.pyramid.parquet # you can use GDAL/OGR formats
# yosegi -h

Yosegi: Pyramid Parquet Generator

positional arguments:
  input_file            Path to the input file
  output_file           Path to the output file

options:
  -h, --help            show this help message and exit
  --minzoom MINZOOM     Minimum zoom level (default: 0)
  --maxzoom MAXZOOM     Maximum zoom level (default: 16)
  --resolution-base RESOLUTION_BASE
                        Base resolution (default: 2.5)
  --resolution-multiplier RESOLUTION_MULTIPLIER
                        Resolution multiplier (default: 2.0)
  --geometry-column GEOMETRY_COLUMN
                        Geometry column name (auto-detected if not specified)
  --parquet-row-group-size PARQUET_ROW_GROUP_SIZE
                        Parquet row group size (default: 10240)
  --sort-by SORT_BY     Sort key for feature thinning
                        (default: ST_Area DESC for polygons,
                                  ST_Length DESC for lines,
                                  hash(_uid) otherwise)
  --min-visible-size-factor MIN_VISIBLE_SIZE_FACTOR
                        Skip features at low zoom whose bbox is smaller than
                        the grid resolution times this factor (default: 1.0,
                        0 to disable)

Overview of Pyramid (Geo)Parquet

Concept

  • Pre-calculate which features are visible at each zoomlevel (density-based thinning).
  • Emit a per-row bbox covering column referenced via GeoParquet 1.1 covering metadata so spatial readers can prune row groups via column statistics.
  • Sort the output so that:
    • Each row group contains rows from exactly one zoomlevel (zoom-aligned RGs → tight zoomlevel stats).
    • Within each zoomlevel, rows follow Sort-Tile-Recursive (STR) bbox packing: bbox-center-x is divided into ~√(N/M) strips, then bbox-center-y inside each strip is ordered with boustrophedon (alternating direction). This gives tight per-RG bbox stats so that bbox-covering predicates prune effectively.

By these steps, generate Pyramid-structure in a single Parquet file, just like a GeoTIFF pyramid.

  • Like GeoTIFF, an overview of the entire dataset can be obtained quickly.
  • Unlike GeoTIFF, lower-resolution data is not repeated because this is vector.

QGIS: read Pyramid parquet on Amazon S3. Blue to Red means zoomlevel. Data: OvertureMaps

https://github.com/user-attachments/assets/4df86816-559d-4b34-b57a-2f3d4b8bd14c

QGIS: loading raw parquet (sorted only spatially)

https://github.com/user-attachments/assets/4e7a61f2-eb78-4658-a55f-8de31e2796c9

Well sorted spatially but it takes too much time to obtain an overview of the entire dataset.

Browser (DeckGL + DuckDB): load that parquet and render with GeoArrowScatterPlotLayer

https://github.com/user-attachments/assets/26e2f662-474b-4d11-ab56-f73587ef8b2e

Table Structure

Original input columns are preserved. Yosegi appends two columns:

Column Type Purpose
bbox STRUCT<xmin, ymin, xmax, ymax: DOUBLE> GeoParquet 1.1 covering bbox per feature; enables row-group pruning
zoomlevel INT32 Lowest zoom at which the feature is visible (pyramid level)

The geometry column is rewritten as WKB. GeoParquet covering.bbox metadata points to the bbox STRUCT child fields so that compliant readers automatically use it for spatial pushdown.

If the input already has a column named bbox or zoomlevel, the input column is dropped and overwritten by yosegi's value (a notice is logged).

┌──────────────────────┬──────────────────────┬───┬──────────────────────────────────────┬───────────┐
│          id          │       geometry       │ … │                 bbox                 │ zoomlevel │
│       varchar        │         blob         │   │ struct(xmin double, ymin double, …)  │   int32   │
├──────────────────────┼──────────────────────┼───┼──────────────────────────────────────┼───────────┤
│ 5e1da825-ef9b-45dd…  │ <wkb POINT>          │ … │ {xmin: 122.30, ymin: 45.40, xmax: …} │         0 │
│ 8eb4aa8c-81fb-4a9b…  │ <wkb POINT>          │ … │ {xmin: 122.29, ymin: 45.41, xmax: …} │         0 │
│ … (zoom 0 features) │ …                    │ … │ …                                    │         0 │
│ … (zoom 1 features) │ …                    │ … │ …                                    │         1 │
│ …                    │ …                    │ … │ …                                    │         · │
│ … (zoom 16 features)│ …                    │ … │ …                                    │        16 │
└──────────────────────┴──────────────────────┴───┴──────────────────────────────────────┴───────────┘

Querying tiles

Use the bbox covering column for spatial pushdown plus zoomlevel for pyramid pruning:

-- DuckDB: features visible at z<=10 inside a tile bbox
SELECT * FROM 'example.pyramid.parquet'
WHERE zoomlevel <= 10
  AND bbox.xmax >= :tile_minx AND bbox.xmin <= :tile_maxx
  AND bbox.ymax >= :tile_miny AND bbox.ymin <= :tile_maxy;

DuckDB's parquet reader prunes row groups using the bbox.* and zoomlevel column statistics, so only the matching RGs are downloaded over HTTP.

Note: geometry && ST_MakeEnvelope(...) returns the same rows but does not push down to row-group statistics, so it reads geometry pages for every candidate RG. Prefer the explicit bbox.* predicates above.

Physical layout: STR-pack

Within each zoomlevel, yosegi orders rows with Sort-Tile-Recursive (STR) bbox packing. The goal is to keep each row group's bbox as compact as possible so that bbox-covering predicates prune effectively.

The algorithm

Let N be the number of rows in this zoom and M the row group size (--parquet-row-group-size).

  1. Strip by bbox-center-x. Sort all N rows by (xmin + xmax) / 2 and chunk them into strips of strip_size = M × round(√(N/M)) rows. There are roughly √(N/M) strips, each containing roughly √(N/M) row groups.
  2. Sort by bbox-center-y inside each strip, with boustrophedon. Even-numbered strips are sorted ascending in y, odd-numbered strips descending. Consecutive features at strip boundaries stay close in y.

The SQL pattern (simplified):

ORDER BY
  FLOOR((ROW_NUMBER() OVER (ORDER BY xc) - 1) / strip_size),
  CASE WHEN strip_id % 2 = 0 THEN  yc ELSE -yc END

Strip boundaries align with RG boundaries

strip_size is always a multiple of M, so when pyarrow writes one row group every M rows, every spatial discontinuity (jumping from one strip's x-range to the next) lands on a row group boundary. No row group ever spans two strips. Per-RG bbox extent is bounded by:

  • x extent ≤ one strip width ≈ total x range / √(N/M)
  • y extent ≤ one tile height ≈ total y range / √(N/M)

That is, each RG covers a near-square tile of area ≈ M / N of the data extent — within a constant factor of the theoretical optimum.

file order →
┌──── strip 0 (low x) ────┐┌──── strip 1 (mid x) ────┐┌──── strip 2 ────
│  RG0  RG1  RG2 ... RG13 ││  RG14  RG15 ... RG27    ││  ...
│   ↑                      ││   ↓                      ││   ↑
│   y low → → → → → y high ││   y high → → → → → y low ││   y low → ...
└──────────────────────────┘└──────────────────────────┘└─────────────
                            ★ strip boundary == RG boundary; x jumps,
                              y stays at high (boustrophedon)

Why not Hilbert curve?

Earlier versions used ST_Hilbert(point, bounds) over a representative point. STR-pack consistently touches 27–37% fewer row groups for tile-bbox queries on heavy-tailed feature-size data, regardless of whether Hilbert is fed the centroid or the bbox-center. The gap is structural, not about the input point:

  • Curve jumps. Hilbert can place spatially distant points consecutively at recursive sub-quadrant boundaries. If such a jump lands inside a row group window, that RG's bbox spans both sub-quadrants. STR has no such jumps — strip and tile boundaries are contiguous by construction and are pinned to RG boundaries.
  • Aspect ratio. Hilbert quantizes to a square grid. STR's strip count adapts to the actual row count, not to the bounding box shape.
  • No worst-case bound. Hilbert's per-RG extent is small on average but unbounded in the worst case. STR gives an analytic upper bound (one strip × one tile).

Empirical comparison on 200k synthetic features with heavy-tailed sizes (RG size 1000, 240 random tile bbox queries at z=8/10/12):

layout mean RGs touched mean bytes touched ratio vs STR
STR-pack ~7–8 ~1.0 MB 1.00×
Hilbert (PointOnSurface) ~10–12 ~1.4 MB 1.30–1.37×
Hilbert (bbox-center) ~10–12 ~1.4 MB 1.33–1.38×

The two Hilbert variants are within noise of each other — the loss is intrinsic to mapping 2D to 1D, not to the choice of input point.

Demo

https://d3ks1i00ysei8.cloudfront.net/

  • Client - Lambda - S3(Pyramid Parquet) architecture.

Benefits

  • Single Parquet can be used for both storage and streaming.
  • Overview of the entire dataset is obtained quickly — much faster than with a plain spatial sort (quadkey / Hilbert) because lower zooms contain only the features needed at that resolution.
  • Efficient row-group pruning thanks to GeoParquet 1.1 bbox covering + zoom-aligned row groups, so HTTP-range reads of huge files (100s of MB) are minimized.

How is zoomlevel calculated?

A density-based clustering approach is used. At lower zoom we don't need to render every feature, so only essential features are kept at lower zoomlevels. Repeating this process from --minzoom to --maxzoom decides which features are visible at each zoom.

--sort-by controls the priority order used during thinning (default keeps larger polygons / longer lines first; smaller ones are pushed to deeper zooms).

--min-visible-size-factor additionally skips features whose bbox is too small to be visible at a given zoom's grid resolution.

Why not pre-rendered tiles?

  • Building a tileset purely for streaming is operational overhead; streaming directly from one source file is simpler.
  • Lower-zoom tile contents are wasted once you zoom in past them, and the same feature is repeated across zooms.
  • Mapbox Vector Tiles are a lossy format.

Why not FlatGeobuf?

FlatGeobuf is also oriented to single-file storage and streaming, but it has no pyramid structure, so:

  • An overview of the whole dataset can't be obtained quickly.
  • A dense area's tile-based query may pull too many features.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yosegi-0.12.0.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yosegi-0.12.0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file yosegi-0.12.0.tar.gz.

File metadata

  • Download URL: yosegi-0.12.0.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for yosegi-0.12.0.tar.gz
Algorithm Hash digest
SHA256 7e6c2ea8cec7cd9f5c472d932a9ad5e0d49a68a929ca43678cbe4b35fcc50c44
MD5 e1c4630b4849151511ba5a9a80fbf10f
BLAKE2b-256 ada28d7441e7bce62f3db0beed1e6bcaee859ddfc1f42270ed214c3b706b3fcd

See more details on using hashes here.

File details

Details for the file yosegi-0.12.0-py3-none-any.whl.

File metadata

  • Download URL: yosegi-0.12.0-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for yosegi-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed81f35e076702d1f1f6ae9f9a575d81f0dc280a85ac69d65417c8bb93781881
MD5 4b3a4f436eb6134f5943630f9c66014f
BLAKE2b-256 142b7bd5f257355c5eabd7ed281a8ee4be56e38ee54cc5df92772c6ab7d6f8b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page