Pyramid (Geo)Parquet Generator
Project description
Yosegi - Pyramid (Geo)Parquet Generator
Yosegi is a tool to generate Pyramid (Geo)Parquet files - optimized for streaming large geospatial datasets.
Usage
pip install yosegi
yosegi input.parquet output.pyramid.parquet # GeoParquet
yosegi input.shp output.pyramid.parquet # you can use GDAL/OGR formats
# yosegi -h
Yosegi: Pyramid Parquet Generator
positional arguments:
input_file Path to the input file
output_file Path to the output file
options:
-h, --help show this help message and exit
--minzoom MINZOOM Minimum zoom level (default: 0)
--maxzoom MAXZOOM Maximum zoom level (default: 16)
--resolution-base RESOLUTION_BASE
Base resolution (default: 2.5)
--resolution-multiplier RESOLUTION_MULTIPLIER
Resolution multiplier (default: 2.0)
--geometry-column GEOMETRY_COLUMN
Geometry column name (auto-detected if not specified)
--parquet-row-group-size PARQUET_ROW_GROUP_SIZE
Parquet row group size (default: 10240)
--sort-by SORT_BY Sort key for feature thinning
(default: ST_Area DESC for polygons,
ST_Length DESC for lines,
hash(_uid) otherwise)
--min-visible-size-factor MIN_VISIBLE_SIZE_FACTOR
Skip features at low zoom whose bbox is smaller than
the grid resolution times this factor (default: 1.0,
0 to disable)
Overview of Pyramid (Geo)Parquet
Concept
- Pre-calculate which features are visible at each zoomlevel (density-based thinning).
- Emit a per-row
bboxcovering column referenced via GeoParquet 1.1coveringmetadata so spatial readers can prune row groups via column statistics. - Sort the output so that:
- Each row group contains rows from exactly one zoomlevel (zoom-aligned RGs → tight
zoomlevelstats). - Within each zoomlevel, rows follow Sort-Tile-Recursive (STR) bbox packing: bbox-center-x is divided into ~√(N/M) strips, then bbox-center-y inside each strip is ordered with boustrophedon (alternating direction). This gives tight per-RG bbox stats so that bbox-covering predicates prune effectively.
- Each row group contains rows from exactly one zoomlevel (zoom-aligned RGs → tight
By these steps, generate Pyramid-structure in a single Parquet file, just like a GeoTIFF pyramid.
- Like GeoTIFF, an overview of the entire dataset can be obtained quickly.
- Unlike GeoTIFF, lower-resolution data is not repeated because this is vector.
QGIS: read Pyramid parquet on Amazon S3. Blue to Red means zoomlevel. Data: OvertureMaps
https://github.com/user-attachments/assets/4df86816-559d-4b34-b57a-2f3d4b8bd14c
QGIS: loading raw parquet (sorted only spatially)
https://github.com/user-attachments/assets/4e7a61f2-eb78-4658-a55f-8de31e2796c9
Well sorted spatially but it takes too much time to obtain an overview of the entire dataset.
Browser (DeckGL + DuckDB): load that parquet and render with GeoArrowScatterPlotLayer
https://github.com/user-attachments/assets/26e2f662-474b-4d11-ab56-f73587ef8b2e
Table Structure
Original input columns are preserved. Yosegi appends two columns:
| Column | Type | Purpose |
|---|---|---|
bbox |
STRUCT<xmin, ymin, xmax, ymax: DOUBLE> |
GeoParquet 1.1 covering bbox per feature; enables row-group pruning |
zoomlevel |
INT32 |
Lowest zoom at which the feature is visible (pyramid level) |
The geometry column is rewritten as WKB. GeoParquet covering.bbox metadata points to the bbox STRUCT child fields so that compliant readers automatically use it for spatial pushdown.
If the input already has a column named bbox or zoomlevel, the input column is dropped and overwritten by yosegi's value (a notice is logged).
┌──────────────────────┬──────────────────────┬───┬──────────────────────────────────────┬───────────┐
│ id │ geometry │ … │ bbox │ zoomlevel │
│ varchar │ blob │ │ struct(xmin double, ymin double, …) │ int32 │
├──────────────────────┼──────────────────────┼───┼──────────────────────────────────────┼───────────┤
│ 5e1da825-ef9b-45dd… │ <wkb POINT> │ … │ {xmin: 122.30, ymin: 45.40, xmax: …} │ 0 │
│ 8eb4aa8c-81fb-4a9b… │ <wkb POINT> │ … │ {xmin: 122.29, ymin: 45.41, xmax: …} │ 0 │
│ … (zoom 0 features) │ … │ … │ … │ 0 │
│ … (zoom 1 features) │ … │ … │ … │ 1 │
│ … │ … │ … │ … │ · │
│ … (zoom 16 features)│ … │ … │ … │ 16 │
└──────────────────────┴──────────────────────┴───┴──────────────────────────────────────┴───────────┘
Querying tiles
Use the bbox covering column for spatial pushdown plus zoomlevel for pyramid pruning:
-- DuckDB: features visible at z<=10 inside a tile bbox
SELECT * FROM 'example.pyramid.parquet'
WHERE zoomlevel <= 10
AND bbox.xmax >= :tile_minx AND bbox.xmin <= :tile_maxx
AND bbox.ymax >= :tile_miny AND bbox.ymin <= :tile_maxy;
DuckDB's parquet reader prunes row groups using the bbox.* and zoomlevel column statistics, so only the matching RGs are downloaded over HTTP.
Note:
geometry && ST_MakeEnvelope(...)returns the same rows but does not push down to row-group statistics, so it reads geometry pages for every candidate RG. Prefer the explicitbbox.*predicates above.
Physical layout: STR-pack
Within each zoomlevel, yosegi orders rows with Sort-Tile-Recursive (STR) bbox packing. The goal is to keep each row group's bbox as compact as possible so that bbox-covering predicates prune effectively.
The algorithm
Let N be the number of rows in this zoom and M the row group size (--parquet-row-group-size).
- Strip by bbox-center-x. Sort all N rows by
(xmin + xmax) / 2and chunk them into strips ofstrip_size = M × round(√(N/M))rows. There are roughly√(N/M)strips, each containing roughly√(N/M)row groups. - Sort by bbox-center-y inside each strip, with boustrophedon. Even-numbered strips are sorted ascending in y, odd-numbered strips descending. Consecutive features at strip boundaries stay close in y.
The SQL pattern (simplified):
ORDER BY
FLOOR((ROW_NUMBER() OVER (ORDER BY xc) - 1) / strip_size),
CASE WHEN strip_id % 2 = 0 THEN yc ELSE -yc END
Strip boundaries align with RG boundaries
strip_size is always a multiple of M, so when pyarrow writes one row group every M rows, every spatial discontinuity (jumping from one strip's x-range to the next) lands on a row group boundary. No row group ever spans two strips. Per-RG bbox extent is bounded by:
- x extent ≤ one strip width ≈ total x range /
√(N/M) - y extent ≤ one tile height ≈ total y range /
√(N/M)
That is, each RG covers a near-square tile of area ≈ M / N of the data extent — within a constant factor of the theoretical optimum.
file order →
┌──── strip 0 (low x) ────┐┌──── strip 1 (mid x) ────┐┌──── strip 2 ────
│ RG0 RG1 RG2 ... RG13 ││ RG14 RG15 ... RG27 ││ ...
│ ↑ ││ ↓ ││ ↑
│ y low → → → → → y high ││ y high → → → → → y low ││ y low → ...
└──────────────────────────┘└──────────────────────────┘└─────────────
★ strip boundary == RG boundary; x jumps,
y stays at high (boustrophedon)
Why not Hilbert curve?
Earlier versions used ST_Hilbert(point, bounds) over a representative point. STR-pack consistently touches 27–37% fewer row groups for tile-bbox queries on heavy-tailed feature-size data, regardless of whether Hilbert is fed the centroid or the bbox-center. The gap is structural, not about the input point:
- Curve jumps. Hilbert can place spatially distant points consecutively at recursive sub-quadrant boundaries. If such a jump lands inside a row group window, that RG's bbox spans both sub-quadrants. STR has no such jumps — strip and tile boundaries are contiguous by construction and are pinned to RG boundaries.
- Aspect ratio. Hilbert quantizes to a square grid. STR's strip count adapts to the actual row count, not to the bounding box shape.
- No worst-case bound. Hilbert's per-RG extent is small on average but unbounded in the worst case. STR gives an analytic upper bound (one strip × one tile).
Empirical comparison on 200k synthetic features with heavy-tailed sizes (RG size 1000, 240 random tile bbox queries at z=8/10/12):
| layout | mean RGs touched | mean bytes touched | ratio vs STR |
|---|---|---|---|
| STR-pack | ~7–8 | ~1.0 MB | 1.00× |
| Hilbert (PointOnSurface) | ~10–12 | ~1.4 MB | 1.30–1.37× |
| Hilbert (bbox-center) | ~10–12 | ~1.4 MB | 1.33–1.38× |
The two Hilbert variants are within noise of each other — the loss is intrinsic to mapping 2D to 1D, not to the choice of input point.
Demo
https://d3ks1i00ysei8.cloudfront.net/
- Client - Lambda - S3(Pyramid Parquet) architecture.
Benefits
- Single Parquet can be used for both storage and streaming.
- Overview of the entire dataset is obtained quickly — much faster than with a plain spatial sort (quadkey / Hilbert) because lower zooms contain only the features needed at that resolution.
- Efficient row-group pruning thanks to GeoParquet 1.1 bbox covering + zoom-aligned row groups, so HTTP-range reads of huge files (100s of MB) are minimized.
How is zoomlevel calculated?
A density-based clustering approach is used. At lower zoom we don't need to render every feature, so only essential features are kept at lower zoomlevels. Repeating this process from --minzoom to --maxzoom decides which features are visible at each zoom.
--sort-by controls the priority order used during thinning (default keeps larger polygons / longer lines first; smaller ones are pushed to deeper zooms).
--min-visible-size-factor additionally skips features whose bbox is too small to be visible at a given zoom's grid resolution.
Why not pre-rendered tiles?
- Building a tileset purely for streaming is operational overhead; streaming directly from one source file is simpler.
- Lower-zoom tile contents are wasted once you zoom in past them, and the same feature is repeated across zooms.
- Mapbox Vector Tiles are a lossy format.
Why not FlatGeobuf?
FlatGeobuf is also oriented to single-file storage and streaming, but it has no pyramid structure, so:
- An overview of the whole dataset can't be obtained quickly.
- A dense area's tile-based query may pull too many features.
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yosegi-0.12.0.tar.gz.
File metadata
- Download URL: yosegi-0.12.0.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e6c2ea8cec7cd9f5c472d932a9ad5e0d49a68a929ca43678cbe4b35fcc50c44
|
|
| MD5 |
e1c4630b4849151511ba5a9a80fbf10f
|
|
| BLAKE2b-256 |
ada28d7441e7bce62f3db0beed1e6bcaee859ddfc1f42270ed214c3b706b3fcd
|
File details
Details for the file yosegi-0.12.0-py3-none-any.whl.
File metadata
- Download URL: yosegi-0.12.0-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed81f35e076702d1f1f6ae9f9a575d81f0dc280a85ac69d65417c8bb93781881
|
|
| MD5 |
4b3a4f436eb6134f5943630f9c66014f
|
|
| BLAKE2b-256 |
142b7bd5f257355c5eabd7ed281a8ee4be56e38ee54cc5df92772c6ab7d6f8b4
|