Skip to main content

experimental Dask array that opens/closes a resource when computing

Project description

resource-backed-dask-array

License PyPI Python Version CI codecov

ResourceBackedDaskArray is an experimental Dask array subclass that opens/closes a resource when computing (but only once per compute call).

installation

pip install resource-backed-dask-array

motivation for this package

Consider the following class that simulates a file reader capable of returning a dask array (using dask.array.map_blocks) The file handle must be in an open state in order to read a chunk, otherwise it segfaults (or otherwise errors)

import dask.array as da
import numpy as np


class FileReader:

    def __init__(self):
        self._closed = False

    def close(self):
        """close the imaginary file"""
        self._closed = True

    @property
    def closed(self):
        return self._closed

    def __enter__(self):
        if self.closed:
            self._closed = False  # open
        return self

    def __exit__(self, *_):
        self.close()

    def to_dask(self) -> da.Array:
        """Method that returns a dask array for this file."""
        return da.map_blocks(
            self._dask_block,
            chunks=((1,) * 4, 4, 4),
            dtype=float,
        )

    def _dask_block(self):
        """simulate getting a single chunk of the file."""
        if self.closed:
            raise RuntimeError("Segfault!")
        return np.random.rand(1, 4, 4)

As long as the file stays open, everything works fine:

>>> fr = FileReader()
>>> dsk_ary = fr.to_dask()
>>> dsk_ary.compute().shape
(4, 4, 4)

However, if one closes the file, the dask array returned from to_dask will now fail:

>>> fr.close()
>>> dsk_ary.compute()  # RuntimeError: Segfault!

A "quick-and-dirty" solution here might be to force the _dask_block method to temporarily reopen the file if it finds the file in the closed state, but if the file-open process takes any amount of time, this could incur significant overhead as it opens-and-closes for every chunk in the array.

usage

ResourceBackedDaskArray.from_array

This library attempts to provide a solution to the above problem with a ResourceBackedDaskArray object. This manages the opening/closing of an underlying resource whenever .compute() is called – and does so only once for all chunks in a single compute task graph.

>>> from resource_backed_dask_array import resource_backed_dask_array
>>> safe_dsk_ary = resource_backed_dask_array(dsk_ary, fr)
>>> safe_dsk_ary.compute().shape
(4, 4, 4)

>>> fr.closed  # leave it as we found it
True

The second argument passed to from_array must be a resuable context manager that additionally provides a closed attribute (like io.IOBase). In other words, it must implement the following protocol:

  1. it must have an __enter__ method that opens the underlying resource
  2. it must have an __exit__ method that closes the resource and optionally handles exceptions
  3. it must have a closed attribute that reports whether or not the resource is closed.

In the example above, the FileReader class itself implemented this protocol, and so was suitable as the second argument to ResourceBackedDaskArray.from_array above.

Important Caveats

This was created for single-process (and maybe just single-threaded?) use cases where dask's out-of-core lazy loading is still very desireable. Usage with dask.distributed is untested and may very well fail. Using stateful objects (such as the reusable context manager used here) in multi-threaded/processed tasks is error prone.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

resource_backed_dask_array-0.1.0.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

resource_backed_dask_array-0.1.0-py2.py3-none-any.whl (8.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file resource_backed_dask_array-0.1.0.tar.gz.

File metadata

  • Download URL: resource_backed_dask_array-0.1.0.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for resource_backed_dask_array-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8fabcccf5c7e29059b5badd6786dd7675a258a203c58babf10077d9c90ada54f
MD5 dea76680ec59e13b0bc9b3df93bbf65c
BLAKE2b-256 6280b8952048ae1772d33b95dbf7d7107cf364c037cc229a2690fc8fa9ee8e48

See more details on using hashes here.

File details

Details for the file resource_backed_dask_array-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: resource_backed_dask_array-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for resource_backed_dask_array-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ec457fa72d81f0340a67ea6557a5a5919323a11cccc978a950df29fa69fe5679
MD5 3da22ac0ac7f3d70d1d87639aa399a46
BLAKE2b-256 0db5852f619e53fa7fb70d8915fcae66632df3958cac7e926c4ac38458958674

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page