Pyramid (Geo)Parquet Generator

Project description

Yosegi - Pyramid (Geo)Parquet Generator

Yosegi is a tool to generate Pyramid (Geo)Parquet files - optimized for streaming large geospatial datasets.

Usage

pip install yosegi
yosegi input.parquet output.pyramid.parquet # GeoParquet
yosegi input.shp output.pyramid.parquet # you can use GDAL/OGR formats

# yosegi -h

Yosegi: Pyramid Parquet Generator

positional arguments:
  input_file            Path to the input file
  output_file           Path to the output file

options:
  -h, --help            show this help message and exit
  --minzoom MINZOOM     Minimum zoom level (default: 0)
  --maxzoom MAXZOOM     Maximum zoom level (default: 16)
  --resolution-base RESOLUTION_BASE
                        Base resolution (default: 2.5)
  --resolution-multiplier RESOLUTION_MULTIPLIER
                        Resolution multiplier (default: 2.0)
  --geometry-column GEOMETRY_COLUMN
                        Geometry column name (auto-detected if not specified)
  --parquet-row-group-size PARQUET_ROW_GROUP_SIZE
                        Parquet row group size (default: 10240)
  --sort-by SORT_BY     Sort key for feature thinning
                        (default: ST_Area DESC for polygons,
                                  ST_Length DESC for lines,
                                  hash(_uid) otherwise)
  --min-visible-size-factor MIN_VISIBLE_SIZE_FACTOR
                        Skip features at low zoom whose bbox is smaller than
                        the grid resolution times this factor (default: 1.0,
                        0 to disable)

Overview of Pyramid (Geo)Parquet

Concept

Pre-calculate which features are visible at each zoomlevel (density-based thinning).
Emit a per-row bbox covering column referenced via GeoParquet 1.1 covering metadata so spatial readers can prune row groups via column statistics.
Sort the output so that:
- Each row group contains rows from exactly one zoomlevel (zoom-aligned RGs → tight zoomlevel stats).
- Within each zoomlevel, rows follow Sort-Tile-Recursive (STR) bbox packing: bbox-center-x is divided into ~√(N/M) strips, then bbox-center-y inside each strip is ordered with boustrophedon (alternating direction). This gives tight per-RG bbox stats so that bbox-covering predicates prune effectively.

By these steps, generate Pyramid-structure in a single Parquet file, just like a GeoTIFF pyramid.

Like GeoTIFF, an overview of the entire dataset can be obtained quickly.
Unlike GeoTIFF, lower-resolution data is not repeated because this is vector.

QGIS: read Pyramid parquet on Amazon S3. Blue to Red means zoomlevel. Data: OvertureMaps

https://github.com/user-attachments/assets/4df86816-559d-4b34-b57a-2f3d4b8bd14c

QGIS: loading raw parquet (sorted only spatially)

https://github.com/user-attachments/assets/4e7a61f2-eb78-4658-a55f-8de31e2796c9

Well sorted spatially but it takes too much time to obtain an overview of the entire dataset.

Browser (DeckGL + DuckDB): load that parquet and render with GeoArrowScatterPlotLayer

https://github.com/user-attachments/assets/26e2f662-474b-4d11-ab56-f73587ef8b2e

Table Structure

Original input columns are preserved. Yosegi appends two columns:

Column	Type	Purpose
`bbox`	`STRUCT<xmin, ymin, xmax, ymax: DOUBLE>`	GeoParquet 1.1 covering bbox per feature; enables row-group pruning
`zoomlevel`	`INT32`	Lowest zoom at which the feature is visible (pyramid level)

The geometry column is rewritten as WKB. GeoParquet covering.bbox metadata points to the bbox STRUCT child fields so that compliant readers automatically use it for spatial pushdown.

If the input already has a column named bbox or zoomlevel, the input column is dropped and overwritten by yosegi's value (a notice is logged).

┌──────────────────────┬──────────────────────┬───┬──────────────────────────────────────┬───────────┐
│          id          │       geometry       │ … │                 bbox                 │ zoomlevel │
│       varchar        │         blob         │   │ struct(xmin double, ymin double, …)  │   int32   │
├──────────────────────┼──────────────────────┼───┼──────────────────────────────────────┼───────────┤
│ 5e1da825-ef9b-45dd…  │ <wkb POINT>          │ … │ {xmin: 122.30, ymin: 45.40, xmax: …} │         0 │
│ 8eb4aa8c-81fb-4a9b…  │ <wkb POINT>          │ … │ {xmin: 122.29, ymin: 45.41, xmax: …} │         0 │
│ … (zoom 0 features) │ …                    │ … │ …                                    │         0 │
│ … (zoom 1 features) │ …                    │ … │ …                                    │         1 │
│ …                    │ …                    │ … │ …                                    │         · │
│ … (zoom 16 features)│ …                    │ … │ …                                    │        16 │
└──────────────────────┴──────────────────────┴───┴──────────────────────────────────────┴───────────┘

Querying tiles

Use the bbox covering column for spatial pushdown plus zoomlevel for pyramid pruning:

-- DuckDB: features visible at z<=10 inside a tile bbox
SELECT * FROM 'example.pyramid.parquet'
WHERE zoomlevel <= 10
  AND bbox.xmax >= :tile_minx AND bbox.xmin <= :tile_maxx
  AND bbox.ymax >= :tile_miny AND bbox.ymin <= :tile_maxy;

DuckDB's parquet reader prunes row groups using the bbox.* and zoomlevel column statistics, so only the matching RGs are downloaded over HTTP.

Note: geometry && ST_MakeEnvelope(...) returns the same rows but does not push down to row-group statistics, so it reads geometry pages for every candidate RG. Prefer the explicit bbox.* predicates above.

Physical layout: STR-pack

Within each zoomlevel, yosegi orders rows with Sort-Tile-Recursive (STR) bbox packing. The goal is to keep each row group's bbox as compact as possible so that bbox-covering predicates prune effectively.

The algorithm

Let N be the number of rows in this zoom and M the row group size (--parquet-row-group-size).

Strip by bbox-center-x. Sort all N rows by (xmin + xmax) / 2 and chunk them into strips of strip_size = M × round(√(N/M)) rows. There are roughly √(N/M) strips, each containing roughly √(N/M) row groups.
Sort by bbox-center-y inside each strip, with boustrophedon. Even-numbered strips are sorted ascending in y, odd-numbered strips descending. Consecutive features at strip boundaries stay close in y.

The SQL pattern (simplified):

ORDER BY
  FLOOR((ROW_NUMBER() OVER (ORDER BY xc) - 1) / strip_size),
  CASE WHEN strip_id % 2 = 0 THEN  yc ELSE -yc END

Strip boundaries align with RG boundaries

strip_size is always a multiple of M, so when pyarrow writes one row group every M rows, every spatial discontinuity (jumping from one strip's x-range to the next) lands on a row group boundary. No row group ever spans two strips. Per-RG bbox extent is bounded by:

x extent ≤ one strip width ≈ total x range / √(N/M)
y extent ≤ one tile height ≈ total y range / √(N/M)

That is, each RG covers a near-square tile of area ≈ M / N of the data extent — within a constant factor of the theoretical optimum.

file order →
┌──── strip 0 (low x) ────┐┌──── strip 1 (mid x) ────┐┌──── strip 2 ────
│  RG0  RG1  RG2 ... RG13 ││  RG14  RG15 ... RG27    ││  ...
│   ↑                      ││   ↓                      ││   ↑
│   y low → → → → → y high ││   y high → → → → → y low ││   y low → ...
└──────────────────────────┘└──────────────────────────┘└─────────────
                            ★ strip boundary == RG boundary; x jumps,
                              y stays at high (boustrophedon)

Why not Hilbert curve?

Earlier versions used ST_Hilbert(point, bounds) over a representative point. STR-pack consistently touches 27–37% fewer row groups for tile-bbox queries on heavy-tailed feature-size data, regardless of whether Hilbert is fed the centroid or the bbox-center. The gap is structural, not about the input point:

Curve jumps. Hilbert can place spatially distant points consecutively at recursive sub-quadrant boundaries. If such a jump lands inside a row group window, that RG's bbox spans both sub-quadrants. STR has no such jumps — strip and tile boundaries are contiguous by construction and are pinned to RG boundaries.
Aspect ratio. Hilbert quantizes to a square grid. STR's strip count adapts to the actual row count, not to the bounding box shape.
No worst-case bound. Hilbert's per-RG extent is small on average but unbounded in the worst case. STR gives an analytic upper bound (one strip × one tile).

Empirical comparison on 200k synthetic features with heavy-tailed sizes (RG size 1000, 240 random tile bbox queries at z=8/10/12):

layout	mean RGs touched	mean bytes touched	ratio vs STR
STR-pack	~7–8	~1.0 MB	1.00×
Hilbert (PointOnSurface)	~10–12	~1.4 MB	1.30–1.37×
Hilbert (bbox-center)	~10–12	~1.4 MB	1.33–1.38×

The two Hilbert variants are within noise of each other — the loss is intrinsic to mapping 2D to 1D, not to the choice of input point.

Demo

https://d3ks1i00ysei8.cloudfront.net/

Client - Lambda - S3(Pyramid Parquet) architecture.

Benefits

Single Parquet can be used for both storage and streaming.
Overview of the entire dataset is obtained quickly — much faster than with a plain spatial sort (quadkey / Hilbert) because lower zooms contain only the features needed at that resolution.
Efficient row-group pruning thanks to GeoParquet 1.1 bbox covering + zoom-aligned row groups, so HTTP-range reads of huge files (100s of MB) are minimized.

How is zoomlevel calculated?

A density-based clustering approach is used. At lower zoom we don't need to render every feature, so only essential features are kept at lower zoomlevels. Repeating this process from --minzoom to --maxzoom decides which features are visible at each zoom.

--sort-by controls the priority order used during thinning (default keeps larger polygons / longer lines first; smaller ones are pushed to deeper zooms).

--min-visible-size-factor additionally skips features whose bbox is too small to be visible at a given zoom's grid resolution.

Why not pre-rendered tiles?

Building a tileset purely for streaming is operational overhead; streaming directly from one source file is simpler.
Lower-zoom tile contents are wasted once you zoom in past them, and the same feature is repeated across zooms.
Mapbox Vector Tiles are a lossy format.

Why not FlatGeobuf?

FlatGeobuf is also oriented to single-file storage and streaming, but it has no pyramid structure, so:

An overview of the whole dataset can't be obtained quickly.
A dense area's tile-based query may pull too many features.

References

Project details

Release history Release notifications | RSS feed

This version

0.12.0

May 10, 2026

0.11.0

May 6, 2026

0.9.0

May 5, 2026

0.8.0

May 4, 2026

0.7.0

May 3, 2026

0.6.0

May 3, 2026

0.5.0

May 3, 2026

0.4.0

May 1, 2026

0.3.0

May 1, 2026

0.2.1

May 1, 2026

0.2.0

Apr 18, 2026

0.1.0

Dec 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yosegi-0.12.0.tar.gz (11.1 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yosegi-0.12.0-py3-none-any.whl (12.2 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file yosegi-0.12.0.tar.gz.

File metadata

Download URL: yosegi-0.12.0.tar.gz
Upload date: May 10, 2026
Size: 11.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for yosegi-0.12.0.tar.gz
Algorithm	Hash digest
SHA256	`7e6c2ea8cec7cd9f5c472d932a9ad5e0d49a68a929ca43678cbe4b35fcc50c44`
MD5	`e1c4630b4849151511ba5a9a80fbf10f`
BLAKE2b-256	`ada28d7441e7bce62f3db0beed1e6bcaee859ddfc1f42270ed214c3b706b3fcd`

See more details on using hashes here.

File details

Details for the file yosegi-0.12.0-py3-none-any.whl.

File metadata

Download URL: yosegi-0.12.0-py3-none-any.whl
Upload date: May 10, 2026
Size: 12.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for yosegi-0.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed81f35e076702d1f1f6ae9f9a575d81f0dc280a85ac69d65417c8bb93781881`
MD5	`4b3a4f436eb6134f5943630f9c66014f`
BLAKE2b-256	`142b7bd5f257355c5eabd7ed281a8ee4be56e38ee54cc5df92772c6ab7d6f8b4`

See more details on using hashes here.

yosegi 0.12.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Yosegi - Pyramid (Geo)Parquet Generator

Usage

Overview of Pyramid (Geo)Parquet

Concept

QGIS: read Pyramid parquet on Amazon S3. Blue to Red means zoomlevel. Data: OvertureMaps

QGIS: loading raw parquet (sorted only spatially)

Browser (DeckGL + DuckDB): load that parquet and render with GeoArrowScatterPlotLayer

Table Structure

Querying tiles

Physical layout: STR-pack

The algorithm

Strip boundaries align with RG boundaries

Why not Hilbert curve?

Demo

Benefits

How is zoomlevel calculated?

Why not pre-rendered tiles?

Why not FlatGeobuf?

References

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes