Skip to main content

DRB N-D lazy chunk access add-on

Project description

drb-chunk

DRB N-D lazy chunk access add-on. It exposes N-dimensional, lazily-read data chunks (raster tiles, windows, byte ranges, time-series blocks) as first-class DRB objects, driven declaratively from a topic's RDF/Turtle descriptor and materialised on demand through numpy or a dask-backed xarray.

The add-on reads nothing until you ask for bytes. Selection (select, __getitem__) only narrows a manifest; the actual I/O happens when you call get_impl(...) / to_xarray(). Reads are windowed or byte-ranged — the add-on refuses to silently fall back to a full read.

pip install drb-chunk

Requires Python 3.11–3.13. Materialisation pulls in rasterio (windowed raster reads), dask and xarray (lazy assembly); these are only needed at read time.


Why it exists

DRB drivers already turn a file or URL into a tree of DrbNodes. But a single raster band or a large array is not naturally a "node" — it is an N-D grid you want to slice cheaply, possibly remotely, without loading the whole thing.

drb-chunk adds that missing layer:

  • Declarative — a topic says, in its cortex.ttl, what its chunks are (dims, dtype, tiling, where the bytes live). No code per product.
  • Lazy — narrowing a chunk is pure metadata arithmetic over the tiling scheme. I/O is deferred to materialisation.
  • Driver-reusing — chunks read through the source node's existing get_impl (e.g. rasterio.io.DatasetReader, io.BytesIO). No driver is modified to support chunking.

Core concepts

Object Role
ChunkAddon Entry point (drb.addon = chunk). Reads a topic's descriptors and builds Chunks from a node.
ChunkDescriptor The parsed drb:chunk declaration: name, source, dims, dtype, tiling scheme, optional reader/selection.
Chunk A handle on one declared chunk of a node. Slice it, then materialise it.
ChunkArray Pure metadata of the array: dims, shape, dtype, scheme.
TilingScheme / RegularGrid How the array is cut into tiles; resolves a Selection into tile keys.
ChunkManifest Lazy Mapping[key -> ChunkRef] over the scheme's keys.
ChunkRef One tile's locator: key, source node, and a window or byte_range.
Selection Serialisable, declarative constraint (isel, sel, window, band, range).
ReaderStrategy Turns a ChunkRef into bytes/array: RasterWindowReader, ByteRangeReader.

Data flow:

topic (cortex.ttl)                       node (DrbNode)
      │  drb:chunk …                            │
      ▼                                          │
ChunkDescriptor ──ChunkAddon.apply(node)────────►  Chunk
                                                   │  .select(Selection)   (lazy: subsets the manifest)
                                                   ▼
                                                 Chunk (narrowed)
                                                   │  .get_impl(np.ndarray) / .to_xarray()
                                                   ▼
                                           ReaderStrategy.read(ref)  ──►  numpy / xarray

Architecture

The add-on is organised in four layers that separate declaration, geometry, laziness and I/O. Each layer ignores the details of the one below it. The guiding rule, enforced in the code: nothing is read before layer 4select() is pure key arithmetic; a ReaderStrategy.read() is the first place that touches the source.

┌─────────────────────────────────────────────────────────────────────┐
│  LAYER 1 — Declaration / discovery                                    │
│  core.py        ChunkAddon   (entry point  drb.addon = "chunk")       │
│  descriptor.py  ChunkDescriptor + retrieve_chunks()  ←── cortex.ttl   │
│        "which chunks exist, where the bytes are, which dtype"          │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 2 — Geometry (the only layer that knows the grid)              │
│  tiling.py      TilingScheme / RegularGrid                            │
│  selection.py   Selection (isel/window/band/range/sel) + aggregator   │
│        "how the array is cut; how a Selection becomes tile keys"       │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 3 — Laziness (metadata only, zero bytes read)                 │
│  model.py       ChunkArray, ChunkManifest, ChunkRef                   │
│  chunk.py       Chunk  (select / __getitem__ / tiles / locator)       │
│        "a handle you narrow; a lazy key→ref manifest"                  │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 4 — I/O (actual materialisation)                              │
│  readers.py     ReaderStrategy / RasterWindowReader / ByteRangeReader │
│  chunk.py       get_impl(numpy) / to_xarray()  + interop.to_kerchunk  │
│        "turn a ChunkRef into bytes via the source driver's get_impl"   │
└─────────────────────────────────────────────────────────────────────┘

Layer 1 — Declaration & discovery

ChunkAddon (core.py) is a singleton registered under the drb.addon entry point; DRB loads it into AddonManager. It implements the Addon contract (identifier, return_type, can_apply, apply). Its core job is the build pipeline behind apply():

topic ──retrieve_chunks()──► {name: ChunkDescriptor}
                                   │  for the requested chunk:
cd.source.extract(node) ───────────┤  where are the bytes?
_resolve_source() ─────────────────┤  "." → the node itself ; else resolver.create(url)
_infer_shape(source_node) ─────────┤  rasterio DatasetReader.height/width
                                   ▼
ChunkArray(dims, shape, dtype, scheme)
RegularGridManifest(array, source_node)
                                   ▼
              Chunk(name, array, node, manifest, reader, topic_uri)

retrieve_chunks() (descriptor.py) inherits descriptors through rdfs:subClassOf (recursing into parents) then overrides them with the topic's own chunks, keyed by drb:chunkName — the same pattern as MetadataAddon. It reads the public RDF graph exposed by the topic's ManagerDao.

A descriptor-level drb:selection is rejected at build time; apply selections explicitly via Chunk.select().

Layer 2 — Geometry

This is "the only place grid geometry lives" (TilingScheme docstring). Two responsibilities: enumerate tile keys (keys(array)) and resolve a selection into keys + residual (resolve(selection, array) → ResolvedSelection). RegularGrid is the only v1 implementation:

grid_shape = ceil(shape / chunk_shape)   per dim
keys       = itertools.product(range(n) for n in grid_shape)

resolve(WindowSelection(x,y,w,h)):
    x → [x, x+w)   y → [y, y+h)
    per dim:  first = start // chunk ;  last = (stop-1) // chunk
    keys = product(range(first, last+1) ...)

Selection (selection.py) is deliberately pure, serialisable data (to_dict() / parse_selection()) — it does not know how to turn itself into keys; that is the scheme's job. This separation is what makes a Chunk serialisable (locator) and a selection replayable. SelectionAggregator composes several per-dimension constraints.

Layer 3 — Laziness

  • ChunkArray — pure metadata (dims, shape, dtype, scheme). No bytes.
  • ChunkManifest — a lazy Mapping[key → ChunkRef]. RegularGridManifest.ref(key) computes the window on the fly (key * chunk_shape); it stores nothing.
  • ChunkRef — one tile's locator: key, source node, and either a window (format-native) or a byte_range. This duality decides which reader applies.
  • Chunk.select() — laziness in action: resolve the selection into keys, then manifest.subset(resolved) returns a _SubsetManifest keeping only those keys, wrapped in a new immutable Chunk. No read happens.

Layer 4 — I/O

Materialisation is the only moment bytes are read. select_reader dispatches:

explicit hint ("raster"/"range")  →  named strategy
else  →  first strategy whose can_read(ref) is true:
         RasterWindowReader   if ref.window     → dataset.read(window=…)
         ByteRangeReader      if ref.byte_range → BytesIO.seek + read(length)
else  →  DrbChunkError   (refuses to read the whole file)

Key design point: readers reuse the source driver's existing get_implget_impl(rasterio.io.DatasetReader) for raster windows, get_impl(io.BytesIO) for byte ranges. No driver is modified to support chunking. From Chunk there are two exits: get_impl(np.ndarray) (single-tile direct read) and to_xarray() (dask-backed da.from_delayed(dask.delayed(reader.read)), dims ("band",) + array.dims).

Design decisions to remember

Decision Why
Selection = pure data, resolution = scheme makes Chunk serialisable and selections replayable; one home for geometry
Lazy manifest (ref() computes the window) no materialised tile list for very large grids
Readers via the source driver's get_impl zero driver changes; chunking is an additive layer
No fallback to a full read guarantees a chunk stays a chunk (no accidental full read)
Explicit v1 deferrals with clear errors bounded scope: regular only, single-tile; drb:selection in descriptor rejected

Declaring chunks in a topic (cortex.ttl)

Chunks are declared on a DrbTopic with the drb:chunk predicate. Each drb:chunk blank node describes one chunk. Descriptors are inherited through rdfs:subClassOf and a child topic may override a parent's chunk by reusing its drb:chunkName.

@prefix owl:  <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drb:  <http://www.gael.fr/drb#> .

drb:raster-base a owl:Class ;
    rdfs:label "raster-base" ;
    drb:chunk [
        drb:chunkName    "data" ;
        drb:source       "." ;          # bare literal: the node itself
        drb:dims         ( "y" "x" ) ;
        drb:dtype        "uint16" ;
        drb:tilingScheme "regular" ;
        drb:tileHeight   512 ;
        drb:tileWidth    512 ;
        drb:reader       "raster"
    ] .

drb:my-image a owl:Class ;
    rdfs:label "my-image" ;
    rdfs:subClassOf drb:raster-base ;   # inherits "data", adds "b04"
    drb:chunk [
        drb:chunkName    "b04" ;
        drb:source       [ drb:xquery
            "GRANULE/*/IMG_DATA/R10m/*[fn:matches(fn:name(),'.*_B04_10m\\.jp2$')]"
        ] ;                             # typed blank node: XQuery navigates the product
        drb:dims         ( "y" "x" ) ;
        drb:dtype        "uint16" ;
        drb:chunkShape   ( 256 256 ) ;  # equivalent to tileHeight/tileWidth
        drb:reader       "raster"
    ] .

drb:chunk vocabulary

Predicate Meaning Required
drb:chunkName Unique chunk identifier within the topic. yes
drb:source Where the bytes are. Bare literal: "." / "" = the node itself; a path/URL resolved through DRB (ConstantExtractor). Typed blank node: [ drb:xquery "…" ] — an XQuery evaluated against the product node, returning the band DrbNode (XQueryExtractor). Also accepts drb:python, drb:script, drb:constant via parse_extractor. Note: drb's XQuery engine uses full-match semantics, so patterns must carry a leading .* (e.g. fn:matches(fn:name(),'.*_B04_10m\\.jp2$')). yes
drb:dims RDF list of dimension names, e.g. ( "y" "x" ). yes
drb:dtype NumPy dtype string, e.g. "uint16". yes
drb:tilingScheme Tiling scheme name. v1 supports only "regular" (the default). no
drb:chunkShape RDF list giving the tile shape per dim. one of chunkShape / (tileHeight + tileWidth)
drb:tileHeight, drb:tileWidth Convenience for 2-D (y, x) grids; equivalent to drb:chunkShape ( h w ).
drb:reader Reader hint: "raster" or "range". If omitted, a strategy is auto-selected from the ref. no
drb:collection Logical group name. Chunks sharing the same drb:collection value can be built together with apply(collection=…). no
drb:selection v1 deferral — a descriptor-level default selection is rejected at build time; apply selections explicitly via Chunk.select(). no

Collections

A collection groups related chunks under a shared name so callers can build all of them in one call. Declare it with drb:collection on each chunk that belongs to the group:

drb:sentinel2-l2a-r10m a owl:Class ;
    rdfs:label "sentinel2-l2a-r10m" ;
    drb:chunk [
        drb:chunkName  "B02" ;
        drb:source     [ drb:xquery
            "GRANULE/*/IMG_DATA/R10m/*[fn:matches(fn:name(),'.*_B02_10m\\.jp2$')]"
        ] ;
        drb:dims       ( "y" "x" ) ;
        drb:dtype      "uint16" ;
        drb:tileHeight 512 ; drb:tileWidth 512 ;
        drb:reader     "raster" ;
        drb:collection "R10m"
    ] ;
    drb:chunk [
        drb:chunkName  "B04" ;
        drb:source     [ drb:xquery
            "GRANULE/*/IMG_DATA/R10m/*[fn:matches(fn:name(),'.*_B04_10m\\.jp2$')]"
        ] ;
        drb:dims       ( "y" "x" ) ;
        drb:dtype      "uint16" ;
        drb:tileHeight 512 ; drb:tileWidth 512 ;
        drb:reader     "raster" ;
        drb:collection "R10m"
    ] .

ChunkAddon exposes two collection-aware API calls:

# Map collection name -> list of chunk names (None key = ungrouped chunks).
chunk_addon.available_collections(topic)
# {'R10m': ['B02', 'B04'], None: ['QI']}

# Build every chunk in a collection.
chunks = chunk_addon.apply(node, collection="R10m", topic=topic)
# -> [Chunk("B02"), Chunk("B04")]

apply(collection=…) raises DrbChunkError listing available collections if the requested name is unknown. chunk_name and collection are mutually exclusive — passing both raises DrbChunkError.


Quick start (through the DRB resolver)

from drb.topics import resolver
from drb.addons.addon import AddonManager

# Resolve any source DRB knows how to type.
topic, node = resolver.resolve("/data/S2/IMG_DATA/T31TCJ_B04.jp2")

chunk_addon = AddonManager().get_addon("chunk")   # the registered singleton

if chunk_addon.can_apply(topic):
    # What chunks does this topic declare?
    for name, scheme in chunk_addon.available_chunks(topic):
        print(name, scheme)            # ('data', {'regular': {'chunk_shape': [512, 512]}})

    # What collections are declared? (None key = ungrouped chunks)
    print(chunk_addon.available_collections(topic))
    # {'R10m': ['B02', 'B04'], None: ['QI']}

    # Build one named chunk (or omit chunk_name to get a list of all chunks).
    chunk = chunk_addon.apply(node, chunk_name="data")

    # Build every chunk in a collection.
    chunks = chunk_addon.apply(node, collection="R10m")

apply() returns a single Chunk when chunk_name is given, a list[Chunk] when collection is given, or a list[Chunk] of every declared chunk when neither is given. It raises DrbChunkError if the topic declares no chunk, if chunk_name is unknown, if collection is unknown, or if both are given.


Working with a Chunk

Inspect the grid

chunk.name              # "data"
chunk.array.dims        # ("y", "x")
chunk.array.shape       # (10980, 10980)   — inferred from the source raster
chunk.array.dtype       # "uint16"
chunk.grid_shape        # (22, 22)         — ceil(shape / chunk_shape), RegularGrid only

list(chunk.tiles())     # [(0, 0), (0, 1), …]   — tile keys
chunk.tile((0, 0))      # ChunkRef(key=(0,0), source=…, window=((0,512),(0,512)))

Select (lazy — reads nothing)

select resolves the selection into tile keys via the scheme and returns a new Chunk over the narrowed manifest. chunk[sel] is sugar for chunk.select(sel).

from drb.chunk.selection import WindowSelection, IselSelection, BandSelection

# A pixel window (x, y, w, h); maps to the x/y dims of a RegularGrid.
roi = chunk.select(WindowSelection(x=512, y=0, w=512, h=512))

# Integer-position selection per dim: int -> single index, [start, stop] -> range.
sub = chunk.select(IselSelection({"y": [0, 1024], "x": [0, 512]}))

Materialise (this is where I/O happens)

import numpy as np

# Single tile -> numpy. Multi-tile numpy materialisation is rejected on purpose.
arr = roi.get_impl(np.ndarray)          # windowed rasterio read, shape (bands, h, w)

# Lazy, dask-backed xarray.DataArray (dims = ("band",) + array.dims).
xda = roi.to_xarray()
xda = roi.get_impl(__import__("xarray").DataArray)   # equivalent

get_impl(np.ndarray) requires the selection to resolve to a single tile — otherwise it raises and points you to to_xarray(). (In v1, to_xarray() itself also only assembles a single tile; see Limitations.)


More usage examples

Windowed raster read (manual construction)

When you already hold a node whose get_impl(DatasetReader) opens the raster, you can build a Chunk directly. The windowed read returns exactly the same data as a full slice, without loading the whole image (proven in tests/test_integration_image.py):

from drb.chunk.chunk import Chunk
from drb.chunk.model import ChunkArray, RegularGridManifest
from drb.chunk.selection import WindowSelection
from drb.chunk.tiling import RegularGrid
import numpy as np

array = ChunkArray(dims=("y", "x"), shape=(1024, 1024), dtype="uint16",
                   scheme=RegularGrid(chunk_shape=(512, 512)))
chunk = Chunk(name="data", array=array, node=node,           # node.get_impl(DatasetReader)
              manifest=RegularGridManifest(array, node), reader="raster")

out = chunk.select(WindowSelection(x=512, y=0, w=512, h=512)).get_impl(np.ndarray)
# out == full_data[:, 0:512, 512:1024]

Remote read by byte range (HTTP partial GET)

When the source exposes io.BytesIO, the "range" reader fetches only the requested bytes — never the whole object (see tests/test_integration_remote.py):

array = ChunkArray(dims=("x",), shape=(20,), dtype="uint8",
                   scheme=RegularGrid(chunk_shape=(5,)))
chunk = Chunk(name="c", array=array, node=remote_node,   # ref.byte_range = (offset, length)
              manifest=byte_manifest, reader="range")

raw = chunk.get_impl(np.ndarray)   # BytesIO.seek(offset); read(length) → b"ABCDE"

Lazy xarray / dask compute

xda = roi.to_xarray()              # dask-backed DataArray, dims ("band", "y", "x")
result = xda.mean().compute()      # the deferred read fires only on .compute()

Selection types

All selections are pure, serialisable data (to_dict() / parse_selection()); turning them into tile keys is the TilingScheme's job.

Type to_dict() shape Resolvable by RegularGrid in v1
IselSelection(per_dim) {"isel": {"y": [0,1024], "x": 5}}
WindowSelection(x,y,w,h) {"window": {"x":…, "y":…, "w":…, "h":…}} ✅ (maps to x/y dims)
SelSelection(per_dim) {"sel": {…}} (label/coord based) ❌ (needs coords)
BandSelection(bands) {"band": [0, 3]}
RangeSelection(offset,length) {"range": {"offset":…, "length":…}}
SelectionAggregator(parts) merged keys of its parts per-part

Round-tripping:

from drb.chunk.selection import parse_selection

sel = parse_selection({"window": {"x": 0, "y": 0, "w": 256, "h": 256}})
sel.to_dict()           # {'window': {'x': 0, 'y': 0, 'w': 256, 'h': 256}}

# Several keys -> a SelectionAggregator composing leaf constraints.
parse_selection({"isel": {"y": [0, 512]}, "band": [0]})

Out-of-bounds or unknown-dimension selections raise DrbSelectionError.


Reader strategies

A ReaderStrategy materialises one ChunkRef. Selection happens by drb:reader hint, or automatically from what the ref carries:

  • RasterWindowReader ("raster") — for refs with a window. Calls the source's get_impl(rasterio.io.DatasetReader) and does a windowed dataset.read(window=…). The image driver is not modified.
  • ByteRangeReader ("range") — for refs with a byte_range. Calls the source's get_impl(io.BytesIO), seeks, and reads exactly length bytes — e.g. an HTTP partial GET.

If no hint is given and neither a window nor a byte_range is present, the add-on raises rather than reading the whole source.


Locators and interop

chunk.locator()
# {'source': '/data/…/B04.tif', 'topic': 'http://…#my-image',
#  'chunk': 'b04', 'selection': {'window': {...}} | None}

from drb.chunk import to_kerchunk
to_kerchunk(chunk)      # {'version': 1, 'refs': {'1.0': [path, offset, length], …}}

to_kerchunk emits a kerchunk reference-spec v1 dict. Because kerchunk addresses chunks by byte range, it only works for byte-range chunks; window-only (format-native) chunks raise DrbChunkError.


Exceptions

from drb.chunk import DrbChunkError, DrbSelectionError
  • DrbChunkError — base error for the add-on (unknown chunk name, no applicable reader, unsupported get_impl, v1 deferrals, …). Subclasses drb.exceptions.core.DrbException.
  • DrbSelectionError — unknown selection type or an out-of-bounds region. Subclasses DrbChunkError.

v1 limitations (intentional deferrals)

These are explicit, raise clear errors, and are tracked as follow-ups:

  • Tiling scheme: only "regular" (RegularGrid) is supported.
  • Multi-tile assembly: to_xarray() assembles a single tile; concat/mosaic across tiles is not yet implemented. get_impl(np.ndarray) likewise requires a single-tile selection.
  • Descriptor-level drb:selection: a default selection in the descriptor is rejected at build time — apply selections explicitly via Chunk.select().
  • Chunk.from_locator(): locator round-trip is deferred (locator() works).
  • Label/coord & band/range resolution: SelSelection, BandSelection and RangeSelection are not resolved by RegularGrid in v1.

Public API

from drb.chunk import (
    Chunk, ChunkAddon, ChunkArray, ChunkRef, ChunkManifest,
    Selection, parse_selection, TilingScheme, RegularGrid,
    to_kerchunk, DrbChunkError, DrbSelectionError, __version__,
)

# ChunkAddon public methods (singleton via AddonManager().get_addon("chunk")):
#   .can_apply(topic)                  -> bool
#   .available_chunks(topic)           -> list[tuple[str, dict]]
#   .available_collections(topic)      -> dict[str | None, list[str]]
#   .apply(node, *, chunk_name=…)      -> Chunk
#   .apply(node, *, collection=…)      -> list[Chunk]
#   .apply(node)                       -> list[Chunk]  (all declared chunks)

Development

python3 -m venv venv && source venv/bin/activate
pip install -e . -r requirements-test.txt
python3 -m unittest discover        # unittest suite under tests/

License

LGPLv3 — see LICENCE.txt.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

drb_chunk-0.3.0.tar.gz (62.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

drb_chunk-0.3.0-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file drb_chunk-0.3.0.tar.gz.

File metadata

  • Download URL: drb_chunk-0.3.0.tar.gz
  • Upload date:
  • Size: 62.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for drb_chunk-0.3.0.tar.gz
Algorithm Hash digest
SHA256 97be2162bd72b924c7ddfc7f542d34e63ebec24f4859f23e0b1bc8ad04236431
MD5 a304b964379f0aaaf9434f345741b5bd
BLAKE2b-256 f5725e37463d359c1f20c30431bf23d260d9ecbe1e9f854926f3f09ba12e5809

See more details on using hashes here.

File details

Details for the file drb_chunk-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: drb_chunk-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for drb_chunk-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74fce84dda1081d2a29f88a7ea535dfed3601a8bc8ab8ae28f607b00ebeaceb0
MD5 114ff278c1dd2e9a5329022a6086f3ce
BLAKE2b-256 868ba6e25a2b7d153c9e87cf7bb4f8b3908ad782265c465d06833f4f5db21d32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page