Skip to main content

ODC Dataset File Cache

Project description

Dataset Cache

Random access cache of Dataset objects backed by disk storage.

  • Uses lmdb as key value store
    • UUID is the key
    • Compressed json blob is value
  • Uses zstandard compression (with pre-trained dictionaries)
    • Achieves pretty good compression (db size is roughly 3 times larger than .tar.gz of dataset yaml files), but, unlike tar archive, allows random access.
  • Keeps track of Product and Metadata objects
  • Has concept of "groups" (used for GridWorkFlow)

Installation

pip install odc-dscache

Exporting from Datacube

Using command line app

There is a CLI tool called slurpy that can export a set of products to a file

> slurpy --help
Usage: slurpy [OPTIONS] OUTPUT [PRODUCTS]...

Options:
  -E, --env TEXT  Datacube environment name
  -z INTEGER      Compression setting for zstandard 1-fast, 9+ good but slow
  --help          Show this message and exit.

Note that this app is not affected by issue#542, as it implements a properly lazy SQL query using cursors.

From python code

from odc import dscache

# create new file db, deleting old one if exists
cache = dscache.create_cache('sample.db', truncate=True)

# dataset stream from some query
dss = dc.find_datasets_lazy(..)

# tee off dataset stream into db file
dss = cache.tee(dss)

# then just process the stream of datasets
for ds in dss:
   do_stuff_with(ds)

# finally you can call `.close`
cache.close()

Reading from a file database

By default we assume that database file is read-only. If however some other process is writing to the db while this process is reading, you have to supply extra argument to open_ro(.., lock=True). You better not do that over network file system.

from odc import dscache

cache = dscache.open_ro("sample.db")

# access individual dataset: returns None if not found
ds = cache.get("005b0ab7-5454-4eef-829d-ed081135aefb")
if ds is not None:
    do_stuff_with(ds)

# stream all datasets
for ds in cache.get_all():
    do_stuff_with(ds)

For more details see notebook.

Groups

Group is a collection of datasets that are somehow related. It is essentially a simple index: a list of uuids stored under some name. For example we might want to group all datasets that overlap a certain Albers tile into a group with a name albers/{x}_{y}. One can query a list of all group names with .groups() method. One can add new group using .put_group(name, list_of_uuids). To read all datasets that belong to a given group .stream_group(group_name) can be used.

  • Get list of group names and their population counts: .groups() -> List((name, count))
  • Get datasets for a given group: .stream_group(group_name) -> lazy sequence of Dataset objects
  • To get just uuids: .get_group(group_name) -> List[UUID]

There is a cli tool dstiler that can group datasets based on GridSpec

Usage: dstiler [OPTIONS] DBFILE

  Add spatial grouping to file db.

  Default grid is Australian Albers (EPSG:3577) with 100k by 100k tiles. But
  you can also group by Landsat path/row (--native), or Google's map tiling
  regime (--web zoom_level)

Options:
  --native         Use Landsat Path/Row as grouping
  --native-albers  When datasets are in Albers grid already
  --web INTEGER    Use web map tiling regime at supplied zoom level
  --help           Show this message and exit.

Note that unlike tools like datacube-stats --save-tasks that rely on GridWorkflow.group_into_cells, dstiler is capable of processing large datasets since it does not keep the entire Dataset object in memory for every dataset observed, instead only UUID is kept in RAM until completion, drastically reducing RAM usage. There is also an optimization for ingested products, these are already tiled into Albers tiles so rather than doing relatively expensive geometry overlap checks we can simply extract Albers tile index directly from Dataset's .metadata.grid_spatial property. To use this option supply --native-albers to dstiler app.

Notes on performance

It took 26 minutes to slurp 2,627,779 wofs datasets from a local postgres server on AWS(r4.xlarge), this generated 1.4G database file.

Command being timed: "slurpy -E wofs wofs.db :all:"
User time (seconds): 1037.93
System time (seconds): 48.77
Percent of CPU this job got: 69%
Elapsed (wall clock) time (h:mm:ss or m:ss): 26:04.79

Adding Albers tile grouping to this took just over 4 minutes, that's a processing rate of ~10.6K datasets per second.

Command being timed: "dstiler --native-albers wofs.db"
User time (seconds): 234.57
System time (seconds): 2.65
Percent of CPU this job got: 95%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:08.70

Similar work load but on VDI node (2,747,870 wofs dataset from main db) took 23 minutes to dump all datasets from DB and 7 minutes to tile into Albers grid using "native grid" optimization. Read throughput from file db on VDI node is slower than on AWS, but is still a respectable 6.5K datasets per second. Database file was somewhat bigger too, 2G vs 1.4G on AWS, maybe there is a significant difference in zstandard library between two systems.

Command being timed: "slurpy wofs.db wofs_albers"
User time (seconds): 1077.74
System time (seconds): 49.75
Percent of CPU this job got: 81%
Elapsed (wall clock) time (h:mm:ss or m:ss): 23:01.20
Command being timed: "dstiler --native-albers wofs.db"
User time (seconds): 408.65
System time (seconds): 6.28
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:03.22

I'd like to point out that grouping datasets into Grids can very well happen during slurpy process without adding much overhead, so two step processing is not strictly necessary.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

odc-dscache-0.2.3.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

odc_dscache-0.2.3-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file odc-dscache-0.2.3.tar.gz.

File metadata

  • Download URL: odc-dscache-0.2.3.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for odc-dscache-0.2.3.tar.gz
Algorithm Hash digest
SHA256 b36f62d9c0c4371cf9c1526bb772a617bda85d338ca85cf4b0be057839db575a
MD5 cb8ebdf9c33d8bf170faabb49c405e5c
BLAKE2b-256 ef638631f937119b0643fd244bee99a6481912472eb5f8c799bfc1d134d405ce

See more details on using hashes here.

File details

Details for the file odc_dscache-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: odc_dscache-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for odc_dscache-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a9948f692fcac6438bfadac332bd45643d44a3cc24d0750b124a7b99e6e97a03
MD5 9b245f4b3edc4ba87188072ea9149c71
BLAKE2b-256 2e39c27c6c3cb725719da0ef42411c9d3c24f4c4423d37fb5fb28fecc101c16f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page