Skip to main content

Lightweight framework for geospatial data access and processing

Project description

pygeodata

Reproducible geospatial pipelines, locally.

pygeodata is built around the declarative programming principle: you should be able to describe what data you want, not how to produce it. You declare a spatial spec — a CRS, transform and shape — once. Every dataset and figure in your project receives that spec and knows how to adapt. The framework is the orchestrator: it decides whether to compute or return from cache, resolves reprojection, propagates invalidation when code changes, and keeps a permanent record of what ran and when.

This is the declarative model that SQL offers for databases, applied to local geospatial analysis in Python.

from pygeodata import Data, SpatialSpec, get_config, load
from pygeodata.processors import Reprojector
from dataclasses import dataclass

get_config().update(spec=SpatialSpec.from_raster_file('reference.tif'))

@dataclass
class ElevationLoader(Data):
    @property
    def processor(self):
        return Reprojector('data/elevation.tif')

@dataclass
class SlopeLoader(Data):
    def _process(self, spec):
        process(ElevationLoader(), spec)
        gdal.DEMProcessing(self.ensure_processed_path(spec), ElevationLoader().get_processed_path(spec), 'slope')

# Both lines check the cache, recompute only what's stale, and return the result.
# Change ElevationLoader's code → both caches are invalidated automatically.
elevation = load(ElevationLoader())
slope     = load(SlopeLoader())

How it works

Every Data or Figure class is fingerprinted by hashing the AST of its source code, the ASTs of all its transitive dependencies, and its parameter values. That fingerprint, combined with the spatial spec, is the cache key. If it matches what's on disk, the cached file is returned. If not, the class is reprocessed and the new fingerprint is written.

The hash is AST-based, not text-based: reformatting your code or editing a comment never triggers a rerun.

source_hash      SHA256(AST of the class)
dep_tree_hash    SHA256(source_hash of all transitive dependencies)
instance_hash    SHA256(dep_tree_hash + params)           
state_hash       SHA256(instance_hash + spec)             

Everything lands in a content-addressed store:

data_processed/
  {state_hash}/
    elevation_loader.tif
    meta.json                       ← cache key + metadata
    parameters.json                 ← params at time of run
    spec.json                       ← spec at time of run

.source/
  code/{source_hash}/
    source.py                       ← every version of every class
    source.json
  snapshots/{dep_tree_hash}/
    tree.json                       ← full dependency tree at time of run
    graph.pdf

Defining loaders

Subclass Data for datasets, Figure for plots. Use a processor property for standard reprojection/rasterization, or override _process directly for anything else. Parameters are plain instance attributes — dataclasses work perfectly.

# Delegate to the built-in Reprojector
@dataclass
class LAILoader(Data):
    moment: str   # 'min', 'mean', 'max', ...

    @property
    def processor(self):
        return Reprojector(
            f'data/lai/LAI_{self.moment}.vrt',
            resampling=Resampling.average,
            scales=0.1,
        )

# Full control
class SlopeLoader(Data):
    driver = RioXArrayDriver()

    def _process(self, spec):
        process(ElevationLoader(), spec)
        gdal.DEMProcessing(
            self.ensure_processed_path(spec),
            ElevationLoader().get_processed_path(spec),
            'slope',
        )

The SpatialSpec

SpatialSpec holds the CRS, an affine transform, and a pixel shape. It can be partially defined — just a CRS, no transform — and processors can resolve it from the source file. The spec flows through the pipeline automatically; you set it once in config and never pass it manually unless you need to.

# From an existing raster
spec = SpatialSpec.from_raster_file('reference.tif')

# Or build it explicitly
spec = SpatialSpec(
    crs=CRS.from_string('EPSG:3035'),
    transform=Affine.translation(left, top) * Affine.scale(1000, -1000),
    shape=(height, width),
)

get_config().update(spec=spec)

# After that, just ask for things
load(ElevationLoader())   # spec injected automatically
load(LAILoader('mean'))

# or ask for a specific spec
load(ElevationLoader(), spec)
load(LAILoader('mean'), spec)

Composing pipelines

Data instances can be parameters of other Data instances. The full dependency graph is tracked — changing any upstream class invalidates all downstream caches automatically.

@dataclass
class FeatureCorrelationLoader(Data):
    feature: Data    # e.g. ElevationLoader()
    variable: Data

    def _process(self, spec):
        feat = load(self.feature, spec)
        var  = load(self.variable, spec)
        # ... compute correlation ...

Co-outputs

When one computation produces multiple outputs — a regression that yields a slope, standard error, and p-value simultaneously — _process can yield sibling artifacts instead of returning. Each yielded artifact gets its own cache entry and is independently re-usable.

@dataclass
class RegressionLoader(Data):
    param: Literal['beta', 'se', 'p']

    def _process(self, spec):
        beta, se, p = run_regression(spec)
        write(RegressionLoader('beta'), beta)
        write(RegressionLoader('se'), se)
        write(RegressionLoader('p'), p)
        yield RegressionLoader('beta')
        yield RegressionLoader('se')
        yield RegressionLoader('p')

Figures

Figure works identically to Data — same caching, same hashing, same registry — but outputs go to figures/ and the default extension is png. There is no load() method; figures are outputs only.

@dataclass
class FigureElevation(Figure):
    def _process(self, spec):
        da = load(ElevationLoader(), spec)
        da.plot()
        plt.savefig(self.ensure_processed_path(spec), dpi=300)
        plt.close()

The registry browser

pygeodata browse opens a local web UI that shows the full state of your project — classes, cached results, versions of every class that run.

  • Classes — all tracked subclasses, with staleness indicators when code or dependencies have changed since the last run
  • Entries — every (class × params × spec) combination processed, with parameter tables, spatial spec details, and links to output files
  • Code view — the full version history of any class; browse what the code looked like when a specific result was produced

Cache management

from pygeodata import load_from_hash

# Load any cached output by its (truncated) state hash — no re-running required.
da = load_from_hash('cc71cf42816b')
from pygeodata import clean_cache

clean_cache(dry_run=True)    # preview what would be deleted
clean_cache(dry_run=False)   # delete stale entries

clean_cache walks data_processed/ and figures/ and removes any directory whose state hash no longer matches the live hash of the corresponding class. Classes that have been renamed or removed from the codebase are flagged for manual confirmation.

from pygeodata.cache import clean_source_registry

clean_source_registry(dry_run=True)    # preview orphaned .source/ entries
clean_source_registry(dry_run=False)   # delete them

clean_source_registry removes orphaned code snapshots and dependency trees from .source/. It keeps the latest snapshot per class and anything referenced by a live cache entry — everything else is prunable.

from pygeodata.cache import rebuild_registry

rebuild_registry()   # wipe and rewrite all .source/ entries from current classes

Installation

pip install pygeodata

Optional extras:

pip install pygeodata[viz]        # dependency graph plots (graphviz)
pip install pygeodata[dashboard]  # registry browser (Flask)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygeodata-0.1.0.tar.gz (94.2 kB view details)

Uploaded Source

File details

Details for the file pygeodata-0.1.0.tar.gz.

File metadata

  • Download URL: pygeodata-0.1.0.tar.gz
  • Upload date:
  • Size: 94.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pygeodata-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fefbf11a98193b0467a5adcdfa35eaf44f68891dd05a6db6b4c3865f93ff287c
MD5 cfaffe7b2e95b2553a563a36ffe542ca
BLAKE2b-256 3e9c8a7c800b5a31556ee785f6af15ea5b9ebf060f27b5373f821153400466d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page