DRB N-D lazy chunk access add-on
Project description
drb-chunk
DRB N-D lazy chunk access add-on. It exposes N-dimensional, lazily-read data
chunks (raster tiles, windows, byte ranges, time-series blocks) as first-class
DRB objects, driven declaratively from a topic's RDF/Turtle descriptor and
materialised on demand through numpy or a dask-backed xarray.
The add-on reads nothing until you ask for bytes. Selection (select,
__getitem__) only narrows a manifest; the actual I/O happens when you call
get_impl(...) / to_xarray(). Reads are windowed or byte-ranged — the
add-on refuses to silently fall back to a full read.
pip install drb-chunk
Requires Python 3.11–3.13. Materialisation pulls in rasterio (windowed raster
reads), dask and xarray (lazy assembly); these are only needed at read time.
Why it exists
DRB drivers already turn a file or URL into a tree of DrbNodes. But a single
raster band or a large array is not naturally a "node" — it is an N-D grid you
want to slice cheaply, possibly remotely, without loading the whole thing.
drb-chunk adds that missing layer:
- Declarative — a topic says, in its
cortex.ttl, what its chunks are (dims, dtype, tiling, where the bytes live). No code per product. - Lazy — narrowing a chunk is pure metadata arithmetic over the tiling scheme. I/O is deferred to materialisation.
- Driver-reusing — chunks read through the source node's existing
get_impl(e.g.rasterio.io.DatasetReader,io.BytesIO). No driver is modified to support chunking.
Core concepts
| Object | Role |
|---|---|
ChunkAddon |
Entry point (drb.addon = chunk). Reads a topic's descriptors and builds Chunks from a node. |
ChunkDescriptor |
The parsed drb:chunk declaration: name, source, dims, dtype, tiling scheme, optional reader/selection. |
Chunk |
A handle on one declared chunk of a node. Slice it, then materialise it. |
ChunkArray |
Pure metadata of the array: dims, shape, dtype, scheme. |
TilingScheme / RegularGrid |
How the array is cut into tiles; resolves a Selection into tile keys. |
ChunkManifest |
Lazy Mapping[key -> ChunkRef] over the scheme's keys. |
ChunkRef |
One tile's locator: key, source node, and a window or byte_range. |
Selection |
Serialisable, declarative constraint (isel, sel, window, band, range). |
ReaderStrategy |
Turns a ChunkRef into bytes/array: RasterWindowReader, ByteRangeReader. |
Data flow:
topic (cortex.ttl) node (DrbNode)
│ drb:chunk … │
▼ │
ChunkDescriptor ──ChunkAddon.apply(node)────────► Chunk
│ .select(Selection) (lazy: subsets the manifest)
▼
Chunk (narrowed)
│ .get_impl(np.ndarray) / .to_xarray()
▼
ReaderStrategy.read(ref) ──► numpy / xarray
Architecture
The add-on is organised in four layers that separate declaration,
geometry, laziness and I/O. Each layer ignores the details of the one
below it. The guiding rule, enforced in the code: nothing is read before
layer 4 — select() is pure key arithmetic; a ReaderStrategy.read() is the
first place that touches the source.
┌─────────────────────────────────────────────────────────────────────┐
│ LAYER 1 — Declaration / discovery │
│ core.py ChunkAddon (entry point drb.addon = "chunk") │
│ descriptor.py ChunkDescriptor + retrieve_chunks() ←── cortex.ttl │
│ "which chunks exist, where the bytes are, which dtype" │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 2 — Geometry (the only layer that knows the grid) │
│ tiling.py TilingScheme / RegularGrid │
│ selection.py Selection (isel/window/band/range/sel) + aggregator │
│ "how the array is cut; how a Selection becomes tile keys" │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 3 — Laziness (metadata only, zero bytes read) │
│ model.py ChunkArray, ChunkManifest, ChunkRef │
│ chunk.py Chunk (select / __getitem__ / tiles / locator) │
│ "a handle you narrow; a lazy key→ref manifest" │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 4 — I/O (actual materialisation) │
│ readers.py ReaderStrategy / RasterWindowReader / ByteRangeReader │
│ chunk.py get_impl(numpy) / to_xarray() + interop.to_kerchunk │
│ "turn a ChunkRef into bytes via the source driver's get_impl" │
└─────────────────────────────────────────────────────────────────────┘
Layer 1 — Declaration & discovery
ChunkAddon (core.py) is a singleton registered under the drb.addon entry
point; DRB loads it into AddonManager. It implements the Addon contract
(identifier, return_type, can_apply, apply). Its core job is the build
pipeline behind apply():
topic ──retrieve_chunks()──► {name: ChunkDescriptor}
│ for the requested chunk:
cd.source.extract(node) ───────────┤ where are the bytes?
_resolve_source() ─────────────────┤ "." → the node itself ; else resolver.create(url)
_infer_shape(source_node) ─────────┤ rasterio DatasetReader.height/width
▼
ChunkArray(dims, shape, dtype, scheme)
RegularGridManifest(array, source_node)
▼
Chunk(name, array, node, manifest, reader, topic_uri)
retrieve_chunks() (descriptor.py) inherits descriptors through
rdfs:subClassOf (recursing into parents) then overrides them with the
topic's own chunks, keyed by drb:chunkName — the same pattern as
MetadataAddon. It reads the public RDF graph exposed by the topic's ManagerDao.
A descriptor-level
drb:selectionis rejected at build time; apply selections explicitly viaChunk.select().
Layer 2 — Geometry
This is "the only place grid geometry lives" (TilingScheme docstring). Two
responsibilities: enumerate tile keys (keys(array)) and resolve a selection
into keys + residual (resolve(selection, array) → ResolvedSelection).
RegularGrid is the only v1 implementation:
grid_shape = ceil(shape / chunk_shape) per dim
keys = itertools.product(range(n) for n in grid_shape)
resolve(WindowSelection(x,y,w,h)):
x → [x, x+w) y → [y, y+h)
per dim: first = start // chunk ; last = (stop-1) // chunk
keys = product(range(first, last+1) ...)
Selection (selection.py) is deliberately pure, serialisable data
(to_dict() / parse_selection()) — it does not know how to turn itself into
keys; that is the scheme's job. This separation is what makes a Chunk
serialisable (locator) and a selection replayable. SelectionAggregator composes
several per-dimension constraints.
Layer 3 — Laziness
ChunkArray— pure metadata (dims,shape,dtype,scheme). No bytes.ChunkManifest— a lazyMapping[key → ChunkRef].RegularGridManifest.ref(key)computes the window on the fly (key * chunk_shape); it stores nothing.ChunkRef— one tile's locator:key,sourcenode, and either awindow(format-native) or abyte_range. This duality decides which reader applies.Chunk.select()— laziness in action: resolve the selection into keys, thenmanifest.subset(resolved)returns a_SubsetManifestkeeping only those keys, wrapped in a new immutableChunk. No read happens.
Layer 4 — I/O
Materialisation is the only moment bytes are read. select_reader dispatches:
explicit hint ("raster"/"range") → named strategy
else → first strategy whose can_read(ref) is true:
RasterWindowReader if ref.window → dataset.read(window=…)
ByteRangeReader if ref.byte_range → BytesIO.seek + read(length)
else → DrbChunkError (refuses to read the whole file)
Key design point: readers reuse the source driver's existing get_impl —
get_impl(rasterio.io.DatasetReader) for raster windows, get_impl(io.BytesIO)
for byte ranges. No driver is modified to support chunking. From Chunk there
are two exits: get_impl(np.ndarray) (single-tile direct read) and to_xarray()
(dask-backed da.from_delayed(dask.delayed(reader.read)), dims ("band",) + array.dims).
Design decisions to remember
| Decision | Why |
|---|---|
| Selection = pure data, resolution = scheme | makes Chunk serialisable and selections replayable; one home for geometry |
Lazy manifest (ref() computes the window) |
no materialised tile list for very large grids |
Readers via the source driver's get_impl |
zero driver changes; chunking is an additive layer |
| No fallback to a full read | guarantees a chunk stays a chunk (no accidental full read) |
| Explicit v1 deferrals with clear errors | bounded scope: regular only, single-tile; drb:selection in descriptor rejected |
Declaring chunks in a topic (cortex.ttl)
Chunks are declared on a DrbTopic with the drb:chunk predicate. Each
drb:chunk blank node describes one chunk. Descriptors are inherited through
rdfs:subClassOf and a child topic may override a parent's chunk by reusing its
drb:chunkName.
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drb: <http://www.gael.fr/drb#> .
drb:raster-base a owl:Class ;
rdfs:label "raster-base" ;
drb:chunk [
drb:chunkName "data" ;
drb:source "." ; # bare literal: the node itself
drb:dims ( "y" "x" ) ;
drb:dtype "uint16" ;
drb:tilingScheme "regular" ;
drb:tileHeight 512 ;
drb:tileWidth 512 ;
drb:reader "raster"
] .
drb:my-image a owl:Class ;
rdfs:label "my-image" ;
rdfs:subClassOf drb:raster-base ; # inherits "data", adds "b04"
drb:chunk [
drb:chunkName "b04" ;
drb:source [ drb:xquery
"GRANULE/*/IMG_DATA/R10m/*[fn:matches(fn:name(),'.*_B04_10m\\.jp2$')]"
] ; # typed blank node: XQuery navigates the product
drb:dims ( "y" "x" ) ;
drb:dtype "uint16" ;
drb:chunkShape ( 256 256 ) ; # equivalent to tileHeight/tileWidth
drb:reader "raster"
] .
drb:chunk vocabulary
| Predicate | Meaning | Required |
|---|---|---|
drb:chunkName |
Unique chunk identifier within the topic. | yes |
drb:source |
Where the bytes are. Bare literal: "." / "" = the node itself; a path/URL resolved through DRB (ConstantExtractor). Typed blank node: [ drb:xquery "…" ] — an XQuery evaluated against the product node, returning the band DrbNode (XQueryExtractor). Also accepts drb:python, drb:script, drb:constant via parse_extractor. Note: drb's XQuery engine uses full-match semantics, so patterns must carry a leading .* (e.g. fn:matches(fn:name(),'.*_B04_10m\\.jp2$')). |
yes |
drb:dims |
RDF list of dimension names, e.g. ( "y" "x" ). |
yes |
drb:dtype |
NumPy dtype string, e.g. "uint16". |
yes |
drb:tilingScheme |
Tiling scheme name. v1 supports only "regular" (the default). |
no |
drb:chunkShape |
RDF list giving the tile shape per dim. | one of chunkShape / (tileHeight + tileWidth) |
drb:tileHeight, drb:tileWidth |
Convenience for 2-D (y, x) grids; equivalent to drb:chunkShape ( h w ). |
|
drb:reader |
Reader hint: "raster" or "range". If omitted, a strategy is auto-selected from the ref. |
no |
drb:collection |
Logical group name. Chunks sharing the same drb:collection value can be built together with apply(collection=…). |
no |
drb:selection |
v1 deferral — a descriptor-level default selection is rejected at build time; apply selections explicitly via Chunk.select(). |
no |
Collections
A collection groups related chunks under a shared name so callers can build
all of them in one call. Declare it with drb:collection on each chunk that
belongs to the group:
drb:sentinel2-l2a-r10m a owl:Class ;
rdfs:label "sentinel2-l2a-r10m" ;
drb:chunk [
drb:chunkName "B02" ;
drb:source [ drb:xquery
"GRANULE/*/IMG_DATA/R10m/*[fn:matches(fn:name(),'.*_B02_10m\\.jp2$')]"
] ;
drb:dims ( "y" "x" ) ;
drb:dtype "uint16" ;
drb:tileHeight 512 ; drb:tileWidth 512 ;
drb:reader "raster" ;
drb:collection "R10m"
] ;
drb:chunk [
drb:chunkName "B04" ;
drb:source [ drb:xquery
"GRANULE/*/IMG_DATA/R10m/*[fn:matches(fn:name(),'.*_B04_10m\\.jp2$')]"
] ;
drb:dims ( "y" "x" ) ;
drb:dtype "uint16" ;
drb:tileHeight 512 ; drb:tileWidth 512 ;
drb:reader "raster" ;
drb:collection "R10m"
] .
ChunkAddon exposes two collection-aware API calls:
# Map collection name -> list of chunk names (None key = ungrouped chunks).
chunk_addon.available_collections(topic)
# {'R10m': ['B02', 'B04'], None: ['QI']}
# Build every chunk in a collection.
chunks = chunk_addon.apply(node, collection="R10m", topic=topic)
# -> [Chunk("B02"), Chunk("B04")]
apply(collection=…) raises DrbChunkError listing available collections if
the requested name is unknown. chunk_name and collection are mutually
exclusive — passing both raises DrbChunkError.
Quick start (through the DRB resolver)
from drb.topics import resolver
from drb.addons.addon import AddonManager
# Resolve any source DRB knows how to type.
topic, node = resolver.resolve("/data/S2/IMG_DATA/T31TCJ_B04.jp2")
chunk_addon = AddonManager().get_addon("chunk") # the registered singleton
if chunk_addon.can_apply(topic):
# What chunks does this topic declare?
for name, scheme in chunk_addon.available_chunks(topic):
print(name, scheme) # ('data', {'regular': {'chunk_shape': [512, 512]}})
# What collections are declared? (None key = ungrouped chunks)
print(chunk_addon.available_collections(topic))
# {'R10m': ['B02', 'B04'], None: ['QI']}
# Build one named chunk (or omit chunk_name to get a list of all chunks).
chunk = chunk_addon.apply(node, chunk_name="data")
# Build every chunk in a collection.
chunks = chunk_addon.apply(node, collection="R10m")
apply() returns a single Chunk when chunk_name is given, a list[Chunk]
when collection is given, or a list[Chunk] of every declared chunk when
neither is given. It raises DrbChunkError if the topic declares no chunk, if
chunk_name is unknown, if collection is unknown, or if both are given.
Working with a Chunk
Inspect the grid
chunk.name # "data"
chunk.array.dims # ("y", "x")
chunk.array.shape # (10980, 10980) — inferred from the source raster
chunk.array.dtype # "uint16"
chunk.grid_shape # (22, 22) — ceil(shape / chunk_shape), RegularGrid only
list(chunk.tiles()) # [(0, 0), (0, 1), …] — tile keys
chunk.tile((0, 0)) # ChunkRef(key=(0,0), source=…, window=((0,512),(0,512)))
Select (lazy — reads nothing)
select resolves the selection into tile keys via the scheme and returns a new
Chunk over the narrowed manifest. chunk[sel] is sugar for chunk.select(sel).
from drb.chunk.selection import WindowSelection, IselSelection, BandSelection
# A pixel window (x, y, w, h); maps to the x/y dims of a RegularGrid.
roi = chunk.select(WindowSelection(x=512, y=0, w=512, h=512))
# Integer-position selection per dim: int -> single index, [start, stop] -> range.
sub = chunk.select(IselSelection({"y": [0, 1024], "x": [0, 512]}))
Materialise (this is where I/O happens)
import numpy as np
# Single tile -> numpy. Multi-tile numpy materialisation is rejected on purpose.
arr = roi.get_impl(np.ndarray) # windowed rasterio read, shape (bands, h, w)
# Lazy, dask-backed xarray.DataArray (dims = ("band",) + array.dims).
xda = roi.to_xarray()
xda = roi.get_impl(__import__("xarray").DataArray) # equivalent
get_impl(np.ndarray) requires the selection to resolve to a single tile —
otherwise it raises and points you to to_xarray(). (In v1, to_xarray() itself
also only assembles a single tile; see Limitations.)
More usage examples
Windowed raster read (manual construction)
When you already hold a node whose get_impl(DatasetReader) opens the raster, you
can build a Chunk directly. The windowed read returns exactly the same data as a
full slice, without loading the whole image (proven in
tests/test_integration_image.py):
from drb.chunk.chunk import Chunk
from drb.chunk.model import ChunkArray, RegularGridManifest
from drb.chunk.selection import WindowSelection
from drb.chunk.tiling import RegularGrid
import numpy as np
array = ChunkArray(dims=("y", "x"), shape=(1024, 1024), dtype="uint16",
scheme=RegularGrid(chunk_shape=(512, 512)))
chunk = Chunk(name="data", array=array, node=node, # node.get_impl(DatasetReader)
manifest=RegularGridManifest(array, node), reader="raster")
out = chunk.select(WindowSelection(x=512, y=0, w=512, h=512)).get_impl(np.ndarray)
# out == full_data[:, 0:512, 512:1024]
Remote read by byte range (HTTP partial GET)
When the source exposes io.BytesIO, the "range" reader fetches only the
requested bytes — never the whole object (see tests/test_integration_remote.py):
array = ChunkArray(dims=("x",), shape=(20,), dtype="uint8",
scheme=RegularGrid(chunk_shape=(5,)))
chunk = Chunk(name="c", array=array, node=remote_node, # ref.byte_range = (offset, length)
manifest=byte_manifest, reader="range")
raw = chunk.get_impl(np.ndarray) # BytesIO.seek(offset); read(length) → b"ABCDE"
Lazy xarray / dask compute
xda = roi.to_xarray() # dask-backed DataArray, dims ("band", "y", "x")
result = xda.mean().compute() # the deferred read fires only on .compute()
Selection types
All selections are pure, serialisable data (to_dict() / parse_selection());
turning them into tile keys is the TilingScheme's job.
| Type | to_dict() shape |
Resolvable by RegularGrid in v1 |
|---|---|---|
IselSelection(per_dim) |
{"isel": {"y": [0,1024], "x": 5}} |
✅ |
WindowSelection(x,y,w,h) |
{"window": {"x":…, "y":…, "w":…, "h":…}} |
✅ (maps to x/y dims) |
SelSelection(per_dim) |
{"sel": {…}} (label/coord based) |
❌ (needs coords) |
BandSelection(bands) |
{"band": [0, 3]} |
❌ |
RangeSelection(offset,length) |
{"range": {"offset":…, "length":…}} |
❌ |
SelectionAggregator(parts) |
merged keys of its parts | per-part |
Round-tripping:
from drb.chunk.selection import parse_selection
sel = parse_selection({"window": {"x": 0, "y": 0, "w": 256, "h": 256}})
sel.to_dict() # {'window': {'x': 0, 'y': 0, 'w': 256, 'h': 256}}
# Several keys -> a SelectionAggregator composing leaf constraints.
parse_selection({"isel": {"y": [0, 512]}, "band": [0]})
Out-of-bounds or unknown-dimension selections raise DrbSelectionError.
Reader strategies
A ReaderStrategy materialises one ChunkRef. Selection happens by drb:reader
hint, or automatically from what the ref carries:
RasterWindowReader("raster") — for refs with awindow. Calls the source'sget_impl(rasterio.io.DatasetReader)and does a windoweddataset.read(window=…). The image driver is not modified.ByteRangeReader("range") — for refs with abyte_range. Calls the source'sget_impl(io.BytesIO), seeks, and reads exactlylengthbytes — e.g. an HTTP partial GET.
If no hint is given and neither a window nor a byte_range is present, the
add-on raises rather than reading the whole source.
Locators and interop
chunk.locator()
# {'source': '/data/…/B04.tif', 'topic': 'http://…#my-image',
# 'chunk': 'b04', 'selection': {'window': {...}} | None}
from drb.chunk import to_kerchunk
to_kerchunk(chunk) # {'version': 1, 'refs': {'1.0': [path, offset, length], …}}
to_kerchunk emits a kerchunk reference-spec
v1 dict. Because kerchunk addresses chunks by byte range, it only works for
byte-range chunks; window-only (format-native) chunks raise DrbChunkError.
Exceptions
from drb.chunk import DrbChunkError, DrbSelectionError
DrbChunkError— base error for the add-on (unknown chunk name, no applicable reader, unsupportedget_impl, v1 deferrals, …). Subclassesdrb.exceptions.core.DrbException.DrbSelectionError— unknown selection type or an out-of-bounds region. SubclassesDrbChunkError.
v1 limitations (intentional deferrals)
These are explicit, raise clear errors, and are tracked as follow-ups:
- Tiling scheme: only
"regular"(RegularGrid) is supported. - Multi-tile assembly:
to_xarray()assembles a single tile; concat/mosaic across tiles is not yet implemented.get_impl(np.ndarray)likewise requires a single-tile selection. - Descriptor-level
drb:selection: a default selection in the descriptor is rejected at build time — apply selections explicitly viaChunk.select(). Chunk.from_locator(): locator round-trip is deferred (locator()works).- Label/coord & band/range resolution:
SelSelection,BandSelectionandRangeSelectionare not resolved byRegularGridin v1.
Public API
from drb.chunk import (
Chunk, ChunkAddon, ChunkArray, ChunkRef, ChunkManifest,
Selection, parse_selection, TilingScheme, RegularGrid,
to_kerchunk, DrbChunkError, DrbSelectionError, __version__,
)
# ChunkAddon public methods (singleton via AddonManager().get_addon("chunk")):
# .can_apply(topic) -> bool
# .available_chunks(topic) -> list[tuple[str, dict]]
# .available_collections(topic) -> dict[str | None, list[str]]
# .apply(node, *, chunk_name=…) -> Chunk
# .apply(node, *, collection=…) -> list[Chunk]
# .apply(node) -> list[Chunk] (all declared chunks)
Development
python3 -m venv venv && source venv/bin/activate
pip install -e . -r requirements-test.txt
python3 -m unittest discover # unittest suite under tests/
License
LGPLv3 — see LICENCE.txt.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file drb_chunk-0.3.0.tar.gz.
File metadata
- Download URL: drb_chunk-0.3.0.tar.gz
- Upload date:
- Size: 62.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97be2162bd72b924c7ddfc7f542d34e63ebec24f4859f23e0b1bc8ad04236431
|
|
| MD5 |
a304b964379f0aaaf9434f345741b5bd
|
|
| BLAKE2b-256 |
f5725e37463d359c1f20c30431bf23d260d9ecbe1e9f854926f3f09ba12e5809
|
File details
Details for the file drb_chunk-0.3.0-py3-none-any.whl.
File metadata
- Download URL: drb_chunk-0.3.0-py3-none-any.whl
- Upload date:
- Size: 29.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74fce84dda1081d2a29f88a7ea535dfed3601a8bc8ab8ae28f607b00ebeaceb0
|
|
| MD5 |
114ff278c1dd2e9a5329022a6086f3ce
|
|
| BLAKE2b-256 |
868ba6e25a2b7d153c9e87cf7bb4f8b3908ad782265c465d06833f4f5db21d32
|