Skip to main content

Declarative spatial query layer for Polars

Project description

PyCanopy

PyPI version Python versions CI License: MIT Docs

A spatial query layer for Polars. Rust core, Python API.


[!NOTE] Highly competitive on Apache SpatialBench (single node spatial query benchmark): fastest on 7/12 queries at SF1 and 5/12 at SF10 despite never leaving Polars-like syntax

PyCanopy vs SedonaDB, DuckDB, and GeoPandas on Apache SpatialBench SF1

Apache SpatialBench SF1 · lower is better · bars past the cap truncated with their value · TIMEOUT / ERROR annotated


Installation

pip install pycanopy

Pre-built wheels for Linux, macOS, and Windows. No Rust toolchain required.

import polars as pl
from pycanopy import SpatialFrame

sf = SpatialFrame(pl.read_parquet("cities.parquet"), x_col="lon", y_col="lat")
result = sf.lazy().filter(pl.col("population") > 100_000).range_query(-10.0, 35.0, 40.0, 70.0).collect()

Why PyCanopy

The only spatial engine with a Polars-native API, cost-model-driven index selection, and a full spatial query planner.

PyCanopy GeoPandas DuckDB SedonaDB Spatial Polars
Polars-native, no SQL or conversion ✗ (SQL) ✗ (SQL)
Spatial query planner (reorder, fuse, pushdown) ✓ (SQL)
Index vs scan decided by cost model
Adaptive index (KD-tree / R-tree / grid) ✗ STRtree ✗ R-tree ✗ Quadtree ✗ STRtree / KDTree

Operations

Point datasets

Operation Call Returns
Range query .range_query(min_x, min_y, max_x, max_y) Rows inside the bounding box
Range filter .range_filter(min_x, min_y, max_x, max_y) New SpatialFrame with only rows inside the bounding box
k-nearest neighbours .knn(x, y, k) The k rows nearest a point
kNN join .knn_join(df, x_col, y_col, k) The k nearest rows for each query point
Within-distance join .within_distance_join(df, x_col, y_col, distance) Rows within distance of each query point
Convex-hull area SpatialFrame.convex_hull_area(xs, ys) Area of the convex hull of a point set
Batch convex-hull area Engine.group_convex_hull_areas(xs_series, ys_series) Convex hull area for each group, given Polars List[Float64] columns
WKB point distance wkb_point_distance(series_a, series_b) Euclidean distance between two WKB point columns

Polygon datasets

Operation Call Returns
Point in polygon .contains(x, y) Polygons that contain the point
MBR range .range_query(min_x, min_y, max_x, max_y) Polygons whose bounding box meets the query box
Range filter .range_filter(min_x, min_y, max_x, max_y) New SpatialFrame with only polygons intersecting the bounding box
Within join .within_join(df, x_col, y_col) Polygons that contain each query point
Point-to-polygon distance join .polygon_within_distance_join(df, x_col, y_col, distance) Polygons within distance of each query point
Point-to-polygon kNN join .polygon_knn_join(df, x_col, y_col, k) The k nearest polygons for each query point
Intersects self-join .intersects_pairs(key_col=None) Intersecting polygon pairs with overlap area and IoU; key_col replaces positional indices with values from that column
Area .polygon_areas() Area of each polygon
Points near a polygon .points_within_distance_of_polygon(polygon, distance) Points within distance of a single polygon

Reductions and streaming (compose with any join)

Operation Call Returns
Aggregate-join .group_by(keys).agg(pc.agg.count/sum/mean/min/max(...)) One row per group, reduced over the join with no pair frame
Projection pushdown .select(cols) Narrows both join sides before the gather
Stream in batches .collect_batched() An iterator of result morsels, bounded memory
Stream to Parquet .sink_parquet(path) Writes the result to disk in bounded memory
Out-of-core pipeline .lazy_source() A Polars source that fuses join + sort + sink, spilling to disk

Usage

Point dataset: range and KNN

import polars as pl
from pycanopy import SpatialFrame

df = pl.read_parquet("cities.parquet")
sf = SpatialFrame(df, x_col="lon", y_col="lat")

# Bounding-box filter combined with a scalar predicate.
# Optimizer places the scalar filter first, then runs the range query
# on the reduced row set.
result = (
    sf.lazy()
    .filter(pl.col("population") > 100_000)
    .range_query(min_x=-10.0, min_y=35.0, max_x=40.0, max_y=70.0)
    .collect()
)

# k-nearest neighbours
nearest = sf.lazy().knn(x=2.35, y=48.85, k=5).collect()

Inspecting the plan

# Declare ops in any order. explain() shows what the optimizer will actually run.
lf = (
    sf.lazy()
    .range_query(min_x=-10.0, min_y=35.0, max_x=40.0, max_y=70.0)
    .filter(pl.col("population") > 100_000)
)

print(lf.explain())
# RANGE_QUERY [(-10, 35) → (40, 70)]
# FROM
#   FILTER [(col("population")) > (dyn int: 100000)]
#   FROM
#     DF [N=100,000; path: EXPR]

The optimizer flipped the declaration order. The scalar filter runs first on all rows, then the spatial query runs on the smaller survivor set. Plans follow Polars' FROM-chain convention, so the bottom runs first and the top is the final result.

Aggregate over a join

import pycanopy as pc

# Count trips per zone and average their fare, reduced over a streamed
# point-in-polygon join. The full pair frame is never materialised: each
# morsel reduces to per-group partials that combine into the final result.
stats = (
    zones.lazy()
    .within_join(trips, x_col="lon", y_col="lat")
    .group_by(["zone_id", "zone_name"])
    .agg(trip_count=pc.agg.count(), avg_fare=pc.agg.mean("fare"))
)

Out-of-core joins (larger than RAM)

# A join whose result exceeds memory: stream it straight to Parquet,
# bounded to one morsel at a time.
sf.lazy().polygon_knn_join(trips, "lon", "lat", k=5).sink_parquet("nearest.parquet")

# Or fuse the join with a sort and sink into a single spilling Polars
# pipeline, so even an ordered result larger than RAM never materialises.
(
    sf.lazy()
    .polygon_knn_join(trips, "lon", "lat", k=5)
    .select(["trip_id", "building_id", "distance_to_polygon"])
    .lazy_source()
    .sort("distance_to_polygon")
    .sink_parquet("nearest_sorted.parquet")
)
More examples: point and polygon joins, aggregations, branching, delta buffer, index modes

Chaining multiple spatial predicates

# Two range predicates are fused into a single index build on large datasets.
result = (
    sf.lazy()
    .range_query(0.0, 0.0, 50.0, 50.0)
    .range_query(10.0, 10.0, 40.0, 40.0)
    .collect()
)

KNN join

query_df = pl.DataFrame({"qx": [2.35, 13.4], "qy": [48.85, 52.5]})

# For each row in query_df, find the 3 nearest rows in sf.
result = sf.lazy().knn_join(query_df, x_col="qx", y_col="qy", k=3).collect()

Polygon dataset: contains and range

from shapely.geometry import box
from pycanopy import SpatialFrame

polygons = [box(i, 0, i + 0.9, 0.9) for i in range(100_000)]
df = pl.DataFrame({"id": list(range(100_000)), "geom": polygons})
sf = SpatialFrame.from_polygons(df, geometry_col="geom")

# Which polygons contain this point?
containing = sf.lazy().contains(x=5.5, y=0.5).collect()

# Which polygon MBRs intersect this bbox?
intersecting = sf.lazy().range_query(0.0, 0.0, 10.0, 1.0).collect()

Polygon holes

from shapely.geometry import Polygon

# Interior rings (holes) are fully supported.
outer = [(0, 0), (10, 0), (10, 10), (0, 10)]
hole  = [(2, 2), (8, 2),  (8, 8),  (2, 8)]
donut = Polygon(outer, [hole])

sf = SpatialFrame.from_polygons(pl.DataFrame({"id": [0], "geom": [donut]}), geometry_col="geom")

# Point inside the hole is NOT contained.
sf.lazy().contains(x=5.0, y=5.0).collect()   # empty

# Point outside the hole but inside the outer ring IS contained.
sf.lazy().contains(x=1.0, y=1.0).collect()   # returns the polygon row

Within join

# For each query point, find which polygons in sf contain it.
query_df = pl.DataFrame({"qx": [5.5, 12.3], "qy": [0.5, 0.5]})
result = sf.lazy().within_join(query_df, x_col="qx", y_col="qy").collect()

Within-distance join

# For each query point, find all sf points within 50 km.
query_df = pl.DataFrame({"qx": [2.35, 13.4], "qy": [48.85, 52.5]})
result = sf.lazy().within_distance_join(query_df, x_col="qx", y_col="qy", distance=50.0).collect()

Point-to-polygon joins

# (polygon SpatialFrame) For each query point, the polygons within a distance
# of it. Distance is to the polygon boundary, and zero when the point is inside.
query_df = pl.DataFrame({"qx": [5.5, 12.3], "qy": [0.5, 0.5]})
near = sf.lazy().polygon_within_distance_join(query_df, x_col="qx", y_col="qy", distance=2.0).collect()

# For each query point, its k nearest polygons (adds a distance_to_polygon column).
nearest = sf.lazy().polygon_knn_join(query_df, x_col="qx", y_col="qy", k=3).collect()

Polygon aggregations

# Area of every polygon (appends an 'area' column).
areas = sf.polygon_areas()

# All intersecting polygon pairs, with overlap area and IoU.
overlaps = sf.intersects_pairs()

# (point SpatialFrame) rows whose point lies within a distance of one polygon.
from shapely.geometry import box
pts = point_sf.points_within_distance_of_polygon(box(0.0, 0.0, 1.0, 1.0), distance=0.5)

Convex-hull area

import numpy as np

# Area of the convex hull of a standalone point set (no frame needed).
area = SpatialFrame.convex_hull_area(np.array([0.0, 1.0, 0.5]), np.array([0.0, 0.0, 1.0]))

Index mode

# Fixed per frame. "auto" lets the cost model choose index vs scan per query;
# "auto" (default) builds when justified and reuses for free after. "eager" always builds. "none" always scans.
sf = SpatialFrame(df, x_col="lon", y_col="lat", index_mode="auto")

Branching from a shared base

from pycanopy import SpatialFrame, SpatialLazyFrame

# Expensive filter applied once; two queries branch from the result.
base = sf.lazy().filter(pl.col("population") > 100_000).range_query(-10.0, 35.0, 40.0, 70.0)

major = base.filter(pl.col("population") > 1_000_000)
minor = base.filter(pl.col("population") <= 1_000_000)

# collect_all detects the shared prefix, caches it in Polars,
# and executes both branches in a single pass.
results = SpatialLazyFrame.collect_all([major, minor])
df_major, df_minor = results

Live updates via delta buffer

# Append new points -- visible to queries immediately, no index rebuild yet.
import numpy as np
sf.engine.append_delta(np.array([2.5]), np.array([48.9]))

# Queries probe the main index and scan the delta in parallel.
result = sf.lazy().range_query(-10.0, 35.0, 40.0, 70.0).collect()

# The buffer flushes automatically when accumulated query cost exceeds
# the estimated index rebuild cost, or when it exceeds 10% of N.
# Force a flush manually if needed.
sf.engine.flush()

Benchmarks

Apache SpatialBench

Run on a single m7i.2xlarge (8 vCPU, 32 GB), the same hardware used by Apache SpatialBench. PyCanopy is measured live with index_mode="auto".

SF1 (~6M trips). PyCanopy wins 7/12 testcases.

PyCanopy vs SedonaDB, DuckDB, and GeoPandas on Apache SpatialBench SF1

Apache SpatialBench SF1 · lower is better · linear axis, bars past the cap truncated with their value · TIMEOUT / ERROR annotated

SF10 (~60M trips). PyCanopy wins 5/12 testcases.

PyCanopy vs SedonaDB, DuckDB, and GeoPandas on Apache SpatialBench SF10

Apache SpatialBench SF10 · lower is better · linear axis, bars past the cap truncated with their value · TIMEOUT / ERROR annotated

All times in seconds. Bold = fastest on that query. SedonaDB, DuckDB, and GeoPandas baselines from published SpatialBench results.

SF1

QueryPyCanopySedonaDBDuckDBGeoPandas
q11.410.660.9612.78
q23.948.079.9520.74
q31.220.801.1713.59
q410.888.419.8325.24
q51.775.101.8047.08
q65.578.599.3624.43
q72.221.661.82137.00
q81.061.101.0816.08
q90.230.2350.150.28
q1011.6218.79207.8446.13
q1112.4332.98TIMEOUT51.01
q1214.0014.55ERRORTIMEOUT

SF10

QueryPyCanopySedonaDBDuckDBGeoPandas
q18.593.044.58ERROR
q28.958.898.26ERROR
q37.124.095.17TIMEOUT
q421.347.528.51ERROR
q515.2250.8114.40ERROR
q611.199.1110.67ERROR
q722.7314.4414.03ERROR
q87.037.247.57TIMEOUT
q90.340.38942.980.49
q1028.4142.02ERRORERROR
q1137.3097.52ERRORERROR
q12147.67145.66ERRORTIMEOUT

How It Works

The engine has dedicated components for query planning / execution and ultimately returns a Polars DataFrame.

Query flow

flowchart LR
    A[User chain] --> B[SpatialOptimizer] --> C[SpatialExecutor] --> F[pl.DataFrame]

Logical planning

  • Predicate pushdown: scalar filters run first, reducing rows before any spatial work.
  • Fusion: consecutive range/contains predicates merge into a single operation.
  • Join side: indexes on the side that makes the join most efficient.
  • Projection pushdown: a terminal .select() narrows both join sides before the gather.
  • IO path: low-selectivity queries return results as a direct slice, bypassing the Polars expression pipeline.
  • EXPR path: runs the spatial engine as a Polars map_batches expression over the query set.

Cost model

index_mode determines how we use the cost model:

Mode Behaviour
auto (default) build index when cost model allows it
eager always build the selected index type, skip the cost check
none always scan

When index_mode="auto", the planner picks the minimum-cost option ($Q$ queries, $N$ items):

$$ \text{winner} = \arg\min \begin{cases} \text{Cost}{\text{probe}}(\text{built index}) & \text{build already paid} \ \text{Cost}{\text{build}} + \text{Cost}{\text{probe}}(\text{best new index}) \ \text{Cost}{\text{probe}}(\text{brute force}) \end{cases} $$


Selectivity (fraction of the dataset expected to match):

$$ \text{sel} = \begin{cases} \text{hist}(\text{bbox}) / N & \text{range (32×32 density histogram)} \ k / N & \text{kNN} \ 1 / N & \text{contains} \end{cases} $$


Probe cost ($Q$ warm queries against a built index):

$$ \text{Cost}{\text{probe}} = Q \times \begin{cases} N \cdot c{\text{scan}} & \text{brute force} \ (\log_2 N + \text{sel} \cdot N) \cdot c_{\text{tree}} & \text{KD-tree or R-tree} \ \text{sel} \cdot N \cdot c_{\text{grid}} & \text{grid} \end{cases} $$


Build cost (paid once):

$$ \text{Cost}{\text{build}} = \begin{cases} 0 & \text{brute force} \ N \cdot c{\text{build}} & \text{grid} \ N \log_2 N \cdot c_{\text{build}} & \text{KD-tree or R-tree} \end{cases} $$

The empirical constants ($c_{\text{scan}}$, $c_{\text{tree}}$, $c_{\text{grid}}$, $c_{\text{build}}$) are calibrated from benchmark runs in bench/ops.

Index selection

select_index is a rule-based pre-filter that picks a candidate index type:

flowchart LR
    A[Query arrives] --> B{N < 500\nor sel > 50%?}
    B -- yes --> BF[Brute force]
    B -- no --> C{kNN and\nk/N > 10%?}
    C -- yes --> BF
    C -- no --> D{Polygon\ndataset?}
    D -- yes --> RT[R-tree]
    D -- no --> E{Query type}
    E -- kNN / contains --> KD[KD-tree]
    E -- range --> F{Uniform?}
    F -- yes --> GR[Grid]
    F -- no --> KD

All index types share the same coordinate arrays with no duplication.

Why Rust

The hot paths need packed immutable index structures, zero-copy array slices at the Python boundary, and loop-level parallelism. C++ would require a separate FFI layer and would lose the native Polars plugin integration that PyO3/Maturin provides for free.


Accepted input formats

Format Example
numpy (N, 2) array np.array([[x, y], ...])
GeoArrow PyArrow array pa.StructArray or FixedSizeList<2>
geopandas GeoSeries gdf.geometry
shapely Points / Polygons / MultiPolygons [Point(x, y), ...]
list of (x, y) tuples [(x, y), ...]
Separate coordinate sequences Engine.from_coords(xs, ys)
WKB point column (Binary) SpatialFrame.from_wkb_points(df, "geom")
WKB polygon column (Binary) SpatialFrame.from_wkb_polygons(df, "geom")

Acknowledgements

Some works that inspired this project:

  • Polars: a columnar DataFrame engine that PyCanopy builds on
  • geo-index: provides packed, immutable, zero-copy KD-tree and R-tree structures used
  • Spatial Polars: an earlier effort to bring spatial functionality to Polars
  • Apache Sedona: state-of-the-art spatial SQL engine + benchmark for evals

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycanopy-0.3.1.tar.gz (408.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pycanopy-0.3.1-cp310-abi3-win_amd64.whl (723.1 kB view details)

Uploaded CPython 3.10+Windows x86-64

pycanopy-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (847.5 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

pycanopy-0.3.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (801.3 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

pycanopy-0.3.1-cp310-abi3-macosx_11_0_arm64.whl (754.0 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

pycanopy-0.3.1-cp310-abi3-macosx_10_12_x86_64.whl (805.6 kB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file pycanopy-0.3.1.tar.gz.

File metadata

  • Download URL: pycanopy-0.3.1.tar.gz
  • Upload date:
  • Size: 408.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pycanopy-0.3.1.tar.gz
Algorithm Hash digest
SHA256 3353e780394bf5ef486ac702e0d430cd5f0d02aad1d01fa2a0f0afbfadc4f95d
MD5 1fb817eb9b128f772f010881feffb3e0
BLAKE2b-256 4c5304b244166f5cb1ee8887278f38aeeb3d8a3dde78b570dd5bc03f32224a5c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.3.1.tar.gz:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.3.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: pycanopy-0.3.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 723.1 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pycanopy-0.3.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 bd32c22bd86be319e05a7894271a091c957f6020212dd69b6aea50c19f4cbc3d
MD5 295ff23846b77f4518baec33c6ed24d8
BLAKE2b-256 a0d8ff90a3fb5527f76d605dae0073a0f7e65e6c188a55962c641d3f9bd7eac2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.3.1-cp310-abi3-win_amd64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pycanopy-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5fec5cdce3cddc8bdd52f7966971e6f7a404570cde1d1f66e8b3bed49ade5141
MD5 75347a736fcc4916a054bcd886c0a0d8
BLAKE2b-256 7d647e65815265bc4706e308811b03acce1e2ae888890f308a688c9540a7c586

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.3.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pycanopy-0.3.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fa511aeb34b01d8485191827b9edf7b0223d2e4fb21565ab63c202e83bfb5d64
MD5 93a78a1201583ee87fce88bb47914705
BLAKE2b-256 dfaeae763a685231260f594d91622825c5dc7125b543c165f3be34a777fdb706

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.3.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.3.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pycanopy-0.3.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64af34501eb2bd4cec5499aeedae84cd15cf95bc274093118d22f0b816b178ba
MD5 1658257f3905cb9ca9f7f6b8e3246bef
BLAKE2b-256 bb3108964a0282fb98fe51b67a4c6438814ff62d881230e229447bd5e24493db

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.3.1-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.3.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pycanopy-0.3.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a1afd047f6ad304613430762b2a168cfdf3d30b69c8281ebc48b4ab2edc7b982
MD5 286cfa3ccbbd9931c0c149392c49ec25
BLAKE2b-256 571b827bccd1d5a37ef7099b69db1b34bca332157c9c40c78adf2d8247772100

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.3.1-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page