Skip to main content

ODC Dataset File Cache

Project description

Dataset Cache

Random access cache of Dataset objects backed by disk storage.

  • Uses lmdb as key value store
    • UUID is the key
    • Compressed json blob is value
  • Uses zstandard compression (with pre-trained dictionaries)
    • Achieves pretty good compression (db size is roughly 3 times larger than .tar.gz of dataset yaml files), but, unlike tar archive, allows random access.
  • Keeps track of Product and Metadata objects
  • Has concept of "groups" (used for GridWorkFlow)

Installation

pip install --extra-index-url="https://packages.dea.ga.gov.au" odc_dscache

Exporting from Datacube

Using command line app

There is a CLI tool called slurpy that can export a set of products to a file

> slurpy --help
Usage: slurpy [OPTIONS] OUTPUT [PRODUCTS]...

Options:
  -E, --env TEXT  Datacube environment name
  -z INTEGER      Compression setting for zstandard 1-fast, 9+ good but slow
  --help          Show this message and exit.

Note that this app is not affected by issue#542, as it implements a properly lazy SQL query using cursors.

From python code

from odc import dscache

# create new file db, deleting old one if exists
cache = dscache.create_cache('sample.db', truncate=True)

# dataset stream from some query
dss = dc.find_datasets_lazy(..)

# tee off dataset stream into db file
dss = cache.tee(dss)

# then just process the stream of datasets
for ds in dss:
   do_stuff_with(ds)

# finally you can call `.close`
cache.close()

Reading from a file database

By default we assume that database file is read-only. If however some other process is writing to the db while this process is reading, you have to supply extra argument to open_ro(.., lock=True). You better not do that over network file system.

from odc import dscache

cache = dscache.open_ro('sample.db')

# access individual dataset: returns None if not found
ds = cache.get('005b0ab7-5454-4eef-829d-ed081135aefb')
if ds is not None:
   do_stuff_with(ds)

# stream all datasets
for ds in cache.get_all():
   do_stuff_with(ds)

For more details see notebook.

Groups

Group is a collection of datasets that are somehow related. It is essentially a simple index: a list of uuids stored under some name. For example we might want to group all datasets that overlap a certain Albers tile into a group with a name albers/{x}_{y}. One can query a list of all group names with .groups() method. One can add new group using .put_group(name, list_of_uuids). To read all datasets that belong to a given group .stream_group(group_name) can be used.

  • Get list of group names and their population counts: .groups() -> List((name, count))
  • Get datasets for a given group: .stream_group(group_name) -> lazy sequence of Dataset objects
  • To get just uuids: .get_group(group_name) -> List[UUID]

There is a cli tool dstiler that can group datasets based on GridSpec

Usage: dstiler [OPTIONS] DBFILE

  Add spatial grouping to file db.

  Default grid is Australian Albers (EPSG:3577) with 100k by 100k tiles. But
  you can also group by Landsat path/row (--native), or Google's map tiling
  regime (--web zoom_level)

Options:
  --native         Use Landsat Path/Row as grouping
  --native-albers  When datasets are in Albers grid already
  --web INTEGER    Use web map tiling regime at supplied zoom level
  --help           Show this message and exit.

Note that unlike tools like datacube-stats --save-tasks that rely on GridWorkflow.group_into_cells, dstiler is capable of processing large datasets since it does not keep the entire Dataset object in memory for every dataset observed, instead only UUID is kept in RAM until completion, drastically reducing RAM usage. There is also an optimization for ingested products, these are already tiled into Albers tiles so rather than doing relatively expensive geometry overlap checks we can simply extract Albers tile index directly from Dataset's .metadata.grid_spatial property. To use this option supply --native-albers to dstiler app.

Notes on performance

It took 26 minutes to slurp 2,627,779 wofs datasets from a local postgres server on AWS(r4.xlarge), this generated 1.4G database file.

Command being timed: "slurpy -E wofs wofs.db :all:"
User time (seconds): 1037.93
System time (seconds): 48.77
Percent of CPU this job got: 69%
Elapsed (wall clock) time (h:mm:ss or m:ss): 26:04.79

Adding Albers tile grouping to this took just over 4 minutes, that's a processing rate of ~10.6K datasets per second.

Command being timed: "dstiler --native-albers wofs.db"
User time (seconds): 234.57
System time (seconds): 2.65
Percent of CPU this job got: 95%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:08.70

Similar work load but on VDI node (2,747,870 wofs dataset from main db) took 23 minutes to dump all datasets from DB and 7 minutes to tile into Albers grid using "native grid" optimization. Read throughput from file db on VDI node is slower than on AWS, but is still a respectable 6.5K datasets per second. Database file was somewhat bigger too, 2G vs 1.4G on AWS, maybe there is a significant difference in zstandard library between two systems.

Command being timed: "slurpy wofs.db wofs_albers"
User time (seconds): 1077.74
System time (seconds): 49.75
Percent of CPU this job got: 81%
Elapsed (wall clock) time (h:mm:ss or m:ss): 23:01.20
Command being timed: "dstiler --native-albers wofs.db"
User time (seconds): 408.65
System time (seconds): 6.28
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:03.22

I'd like to point out that grouping datasets into Grids can very well happen during slurpy process without adding much overhead, so two step processing is not strictly necessary.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

odc-dscache-0.2.0a0.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

odc_dscache-0.2.0a0-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file odc-dscache-0.2.0a0.tar.gz.

File metadata

  • Download URL: odc-dscache-0.2.0a0.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.1 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.10

File hashes

Hashes for odc-dscache-0.2.0a0.tar.gz
Algorithm Hash digest
SHA256 9b21f7bbddfdec48b93ad3e9095930cf595aa0d760a06ae44300497659f050b9
MD5 ac1181d411f8a0b8d256e55df3f79b8a
BLAKE2b-256 2e18cd59234a637b064c243d60b2a2b7b7d59cf8c5e001afa8f2c6676a44b6db

See more details on using hashes here.

File details

Details for the file odc_dscache-0.2.0a0-py3-none-any.whl.

File metadata

  • Download URL: odc_dscache-0.2.0a0-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.1 setuptools/57.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.10

File hashes

Hashes for odc_dscache-0.2.0a0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f3f5baf83a3d0f4486e1747e8fdc32d2a0db48853d823fb192d6514ebad2336
MD5 5d1e4eca69e064f413c0d961dbe3bc1d
BLAKE2b-256 492e4d05d07cd94e6a88050bb3809759a9aa669ec4370f9047ec71c3ca05dd4a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page