Skip to main content

Declarative spatial query layer for Polars

Project description

PyCanopy

PyPI version Python versions CI License: MIT

A spatial query layer for Polars. Rust core, Python API.


[!NOTE] Up to 155x on range queries · up to 1,949x on kNN · up to 1,521x on polygon contains · up to 8,522x on within joins · Full benchmarks


Installation

pip install pycanopy

Pre-built wheels for Linux, macOS, and Windows. No Rust toolchain required.

import polars as pl
from pycanopy import SpatialFrame

sf = SpatialFrame(pl.read_parquet("cities.parquet"), x_col="lon", y_col="lat")
result = sf.lazy().filter(pl.col("population") > 100_000).range_query(-10.0, 35.0, 40.0, 70.0).collect()

Why PyCanopy

Polars has no native spatial query support. Getting bounding-box filters, k-nearest neighbours, or point-in-polygon tests typically means converting to GeoPandas, managing an index manually, or scanning every row in Python. GeoPandas applies linear scans by default for containment and range tests; its STRtree requires explicit opt-in via .sindex and is the only available index type regardless of data distribution. KNN has no built-in path at all.

PyCanopy adds a declarative lazy query layer directly on Polars DataFrames. You describe the operations you want, and PyCanopy decides which index to build, in what order to run each operation, and what to hand off to Polars to execute. It is designed for in-memory workloads at the moment.

PyCanopy GeoPandas Manual STRtree
Works natively in Polars
Lazy / declarative API
Auto index selection ✗ (STR only)
KNN join built-in
Delta buffer (live append)
  • Lazy planning -- declare ops, the optimizer reorders and fuses them before any index is built
  • Auto index selection -- KD-tree, R-tree, uniform grid, or brute force chosen per query
  • Native Polars -- results are pl.DataFrame, no round-trip through GeoPandas
  • Rust hot paths -- zero-copy at the Python boundary, loop-level parallelism via Rayon
  • Delta buffer -- append new points and query immediately without rebuilding the index

Usage

Point dataset: range and KNN

import polars as pl
from pycanopy import SpatialFrame

df = pl.read_parquet("cities.parquet")
sf = SpatialFrame(df, x_col="lon", y_col="lat")

# Bounding-box filter combined with a scalar predicate.
# Optimizer places the scalar filter first, then runs the range query
# on the reduced row set.
result = (
    sf.lazy()
    .filter(pl.col("population") > 100_000)
    .range_query(min_x=-10.0, min_y=35.0, max_x=40.0, max_y=70.0)
    .collect()
)

# k-nearest neighbours
nearest = sf.lazy().knn(x=2.35, y=48.85, k=5).collect()
More examples -- KNN join, polygon contains, within-distance join, branching, delta buffer

Chaining multiple spatial predicates

# Two range predicates are fused into a single index build on large datasets.
result = (
    sf.lazy()
    .range_query(0.0, 0.0, 50.0, 50.0)
    .range_query(10.0, 10.0, 40.0, 40.0)
    .collect()
)

Inspecting the optimizer plan

# Declare ops in any order — explain() shows what the optimizer will actually run.
lf = (
    sf.lazy()
    .range_query(min_x=-10.0, min_y=35.0, max_x=40.0, max_y=70.0)
    .filter(pl.col("population") > 100_000)
)

print(lf.explain())
# RANGE_QUERY [(-10, 35) → (40, 70)]
# FROM
#   FILTER [(col("population")) > (dyn int: 100000)]
#   FROM
#     DF [N=100,000; path: EXPR]

print(lf.explain(optimized=False))
# FILTER [(col("population")) > (dyn int: 100000)]
# FROM
#   RANGE_QUERY [(-10, 35) → (40, 70)]
#   FROM
#     DF [N=100,000]

Follows Polars' FROM-chain convention: bottom = runs first, top = outermost result. In the optimized plan, FILTER appears below RANGE_QUERY — the scalar filter runs first on raw data, and RANGE_QUERY receives the already-filtered subset. explain(optimized=False) shows declaration order for comparison.

KNN join

query_df = pl.DataFrame({"qx": [2.35, 13.4], "qy": [48.85, 52.5]})

# For each row in query_df, find the 3 nearest rows in sf.
result = sf.lazy().knn_join(query_df, x_col="qx", y_col="qy", k=3).collect()

Polygon dataset: contains and range

from shapely.geometry import box
from pycanopy import SpatialFrame

polygons = [box(i, 0, i + 0.9, 0.9) for i in range(100_000)]
df = pl.DataFrame({"id": list(range(100_000)), "geom": polygons})
sf = SpatialFrame.from_polygons(df, geometry_col="geom")

# Which polygons contain this point?
containing = sf.lazy().contains(x=5.5, y=0.5).collect()

# Which polygon MBRs intersect this bbox?
intersecting = sf.lazy().range_query(0.0, 0.0, 10.0, 1.0).collect()

Polygon holes

from shapely.geometry import Polygon

# Interior rings (holes) are fully supported.
outer = [(0, 0), (10, 0), (10, 10), (0, 10)]
hole  = [(2, 2), (8, 2),  (8, 8),  (2, 8)]
donut = Polygon(outer, [hole])

sf = SpatialFrame.from_polygons(pl.DataFrame({"id": [0], "geom": [donut]}), geometry_col="geom")

# Point inside the hole is NOT contained.
sf.lazy().contains(x=5.0, y=5.0).collect()   # empty

# Point outside the hole but inside the outer ring IS contained.
sf.lazy().contains(x=1.0, y=1.0).collect()   # returns the polygon row

Within join

# For each query point, find which polygons in sf contain it.
query_df = pl.DataFrame({"qx": [5.5, 12.3], "qy": [0.5, 0.5]})
result = sf.lazy().within_join(query_df, x_col="qx", y_col="qy").collect()

Within-distance join

# For each query point, find all sf points within 50 km.
query_df = pl.DataFrame({"qx": [2.35, 13.4], "qy": [48.85, 52.5]})
result = sf.lazy().within_distance_join(query_df, x_col="qx", y_col="qy", distance=50.0).collect()

Branching from a shared base

from pycanopy import SpatialFrame, SpatialLazyFrame

# Expensive filter applied once; two queries branch from the result.
base = sf.lazy().filter(pl.col("population") > 100_000).range_query(-10.0, 35.0, 40.0, 70.0)

major = base.filter(pl.col("population") > 1_000_000)
minor = base.filter(pl.col("population") <= 1_000_000)

# collect_all detects the shared prefix, caches it in Polars,
# and executes both branches in a single pass.
results = SpatialLazyFrame.collect_all([major, minor])
df_major, df_minor = results

Live updates via delta buffer

# Append new points -- visible to queries immediately, no index rebuild yet.
import numpy as np
sf.engine.append_delta(np.array([2.5]), np.array([48.9]))

# Queries probe the main index and scan the delta in parallel.
result = sf.lazy().range_query(-10.0, 35.0, 40.0, 70.0).collect()

# The buffer flushes automatically when accumulated query cost exceeds
# the estimated index rebuild cost, or when it exceeds 10% of N.
# Force a flush manually if needed.
sf.engine.flush()

Benchmarks

Apple M-series used for benchmarking. Warm = cached index, second call. Index build = one-time cost, amortised across queries. Uniform distribution; clustered note below.

Single-query operations

Operation N Index build Warm Naive Speedup Idx mem
Range query (points) 100,000 1.3 ms 29 µs 4.4 ms 155x 783 KB
kNN k=10 100,000 9.3 ms 3 µs 5.4 ms 1,949x 1.9 MB
Polygon contains 100,000 6.2 ms 5 µs 7.0 ms 1,521x 3.7 MB
Polygon range 100,000 5.6 ms 8 µs 3.3 ms 391x 3.7 MB
kNN join k=5 10,000 7.3 ms 2.1 ms 5.4 s 2,601x 180 KB
Within-distance join 10,000 0.5 ms 12.6 ms 1.3 s 102x
Within join (polygons) 10,000 1.6 ms 0.52 ms 4.4 s 8,522x 354 KB

Chained lazy queries (N = 100,000, uniform)

The optimizer reorders scalars before spatial ops regardless of declared order, and fuses consecutive wide spatial predicates into one index pass.

Chain Optimizer action Index build Warm GeoPandas Speedup
circ_scalar → range³ scalar first 2.5 ms 0.19 ms 9.2 ms 50x
range² → 3× scalar (spatial declared first) scalars first 1.0 ms 0.23 ms 6.0 ms 26x
range⁴ at 10% selectivity fused 1.0 ms 0.92 ms 13 ms 14x
wide_scalar (95%) → tight_range (1%) scalar first 4.1 ms 0.30 ms 3.1 ms 11x
circ_scalar + diag_scalar → kNN k=50 scalar first 15 ms 1.25 ms 3.6 ms 3x

How It Works

Query flow

  sf.lazy().filter(...).range_query(...).knn_join(...).collect()
                            |
               +------------+------------+
               |     SpatialOptimizer    |
               |  * reorder ops by cost  |
               |  * fuse spatial preds   |
               |  * select index type    |
               |  * spatial join order   |
               +------------+------------+
                            |
               +------------+------------+
               |      Polars executes    |
               |  scalar filters first   |
               |  then spatial queries   |
               +------------+------------+
                            |
                      pl.DataFrame

Optimizer decisions

  • Predicate pushdown: scalar predicates are placed before spatial ones and sorted cheapest-first using AST cost estimation. They cost nothing extra and shrink the row count before any index is touched.
  • Fusion: consecutive spatial predicates on large datasets are merged into a single index build and one pass over the data.
  • Index type: selected per query based on geometry type, data distribution, and selectivity.
  • Join order: for symmetric joins (within_join, within_distance_join), the optimizer indexes the smaller side when it is less than half the size of the other. knn_join is asymmetric and always indexes the engine side.

Index management

Indexes are built lazily. Nothing is constructed at load time; stats (extent, point distribution, a 32x32 histogram) are computed eagerly and drive selection at the first query. The selected index is then cached for all subsequent queries.

Condition Index
N < 500, selectivity > 50%, or k/N > 10% Brute force
Point range, uniform distribution Uniform grid
Point range, clustered distribution KD-tree
Point KNN or contains KD-tree
Polygons, any query R-tree

All index types share the same underlying coordinate arrays with no duplication.

Why Rust

The hot paths need packed immutable index structures, zero-copy array slices at the Python boundary, and loop-level parallelism. C++ would require a separate FFI layer and loses the native Polars plugin integration that PyO3/Maturin provides for free.


Accepted input formats

Format Example
numpy (N, 2) array np.array([[x, y], ...])
GeoArrow PyArrow array pa.StructArray or FixedSizeList<2>
geopandas GeoSeries gdf.geometry
list of shapely Points or Polygons [Point(x, y), ...]
list of (x, y) tuples [(x, y), ...]
Separate coordinate sequences Engine.from_coords(xs, ys)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycanopy-0.2.2.tar.gz (140.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pycanopy-0.2.2-cp39-abi3-win_amd64.whl (348.1 kB view details)

Uploaded CPython 3.9+Windows x86-64

pycanopy-0.2.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (482.2 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

pycanopy-0.2.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (469.2 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

pycanopy-0.2.2-cp39-abi3-macosx_11_0_arm64.whl (427.7 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

pycanopy-0.2.2-cp39-abi3-macosx_10_12_x86_64.whl (442.6 kB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file pycanopy-0.2.2.tar.gz.

File metadata

  • Download URL: pycanopy-0.2.2.tar.gz
  • Upload date:
  • Size: 140.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pycanopy-0.2.2.tar.gz
Algorithm Hash digest
SHA256 9369a037795bcc0794cb98e04c649f5073acbc918c57ad9047bcb12ae911355e
MD5 852264855d866b5de2e8f575f67ba43d
BLAKE2b-256 3655dc55eab6540b70f995abaae509ace453fb9d20a1e8bc6daf9e06cd94102c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.2.2.tar.gz:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.2.2-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: pycanopy-0.2.2-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 348.1 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pycanopy-0.2.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 21e74388f52251c435fbcd95bf73a3f5656c3ae1e379b22e970cb48ea6742d39
MD5 ac10d2980083564c7d3f2d52383fd56e
BLAKE2b-256 02a5f9fea100435e16458f63d6c5ad71d36fabbe44e657112af6ae616d50097b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.2.2-cp39-abi3-win_amd64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.2.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pycanopy-0.2.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 faa165f869723ad150d15bf0e03ebd28db711e5af336f61a2daf20ecf3f99455
MD5 17553e2defa441f8e44a9f527e8bd827
BLAKE2b-256 cd68f7bc929ab575125da135c8ace7199d4da8072337987e9f231e463868a1bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.2.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.2.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pycanopy-0.2.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 622977f7bfde78490dd0156a97860f9d2d14d5ee436f03b8b00bacb6b546956e
MD5 e0e11f1cb109394a822aab616aaccf08
BLAKE2b-256 a51b1374209e0a6618a7f236e83cc9f81f32333dc81468389bba03b14fe3c51a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.2.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.2.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pycanopy-0.2.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4bd99b8bb3a079572caeb8435b8bd0429ff319220b733fb3cf6f5b479ea691d1
MD5 ddae2e57d8f15a2dd984563be1020f42
BLAKE2b-256 0fbeb1b19230b19064acc111e9b451892ac10a44739e4504444802b25857e98b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.2.2-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycanopy-0.2.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pycanopy-0.2.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 711a299c243ca30b1b71129f9db236fc2fdf5fc94904fdb97adaf83bf4758155
MD5 629adb0867b4d1d0cce007ac1cffa4c2
BLAKE2b-256 fee07026f5818f063a56cf372571e59298db3d61ae7d306c10af570b82755ba7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycanopy-0.2.2-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on pranav-walimbe/PyCanopy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page