Skip to main content

Cacheable big data pipelines

Project description

Nuthatch

Tests

Nuthatch is a tool for building pure-python big data pipelines. At its core it enables the transparent multi-level caching and recall of results in formats that are efficient for each data type. It supports a variety of common storage backends, data processing frameworks, and their associated data types for caching.

It also provides a framework for re-using and sharing data-type specific post-processing, and for these data type processors to pass hints to storage backends for more efficient storage and recall.

Nuthatch was created to alleviate the comon pattern of data processing pipelines manually specifying their output storage locations, and the requirements of pipeline builders to use external data orchestration tools to specify the execution of their pipeliness. With Nuthatch simply tag your functions and anyone who has access to your storage backend - you, your team, or the public - can acess and build off of your most up-to-date data.

Getting started

The most basic form of Nuthatch simply stores and recalls your data based on its arguments in efficient formats:

from nuthatch import cache
import xarray as xr

@cache()
def my_first_cache():
    ds = xr.tutorial.open_dataset("air_temperature")

    # Data will automatically be saved in a zarr store and recalled
    return ds

my_first_cache()

But it's much more powerful if you configure nuthatch to be shared across a team:

from nuthatch import cache, set_parameter
import xarray as xr

set_parameter({'filesystem': "gs://my-datalake"})

@cache()
def my_first_cache():
    ds = xr.tutorial.open_dataset("air_temperature")

    # Data will automatically be saved in a zarr store and recalled
    return ds

my_first_cache()

Commit your code and anyone with access to your datalake has access to a self-documented cache of your data.

More powerful - push your code to pypi and anyone who imports your code can access the data simply by calling the function (assuming they have read-only access to the storage.)

Nuthatch configuration

To use Nuthatch you must configure access to some file store. Most basically this could be your local filesystem, but is likely more useful if it's a remote cloud bucket (like gcs, s3, etc). Configuration can be done in three places: (1) in your pyproject.toml, (2) in a special nuthatch.toml built into your package or (3) in your code - useful if you need to access secrets dynamical or configure nuthatch on distributed workers.

Nuthatch itself and most storage backends only need access to a filesystem. Some storage backends, like databases, may need additional parameters.

Nuthatch also has different storage locations. The root location is where things are stored by default, but users can also configure local locations for faster data access and mirror locations that are read only data sources. Imported projects from other python modules are automatically set up as mirror locations.

TOML Configuration

In either pyproject.toml or src/nuthatch.toml:

[tool.nuthatch]
filesystem = "s3://my-bucket/caches"

[tool.nuthatch.filesystem_options]
key = "your_key_id"
secret= "your_secret_key"

pyproject.toml cannot be easily packaged. If you would like your caches to be accessible when your package is installed and imported by others, you must use either a nuthatch.toml file or dynamic configuration.

Dynamic configuration - decorators

You should not save secrets in files. To solve this problem nuthatch enables a method of fetching secrets dynamically, from a cloud secret store, or from another location like an environment variable or file. Just make sure this file is imported before you run your code

from nuthatch import config_parameter

@config_parameter('filesystem_options', secret=True)
def fetch_key():
    # Fetch from secret store, environment, etc
    filesystem_options = {
        'key': os.environ['S3_KEY'],
        'secret': os.environ['S3_SECRET']
    }

    return filesystem_options

Dynamic configuration - direct setting

You can also simply set configuration parameters in code, which is sometimes necessary for distributed environments

from nuthatch import set_parameter
set_parameter({'filesystem': "gs://my-datalake"})

Backend-specific configuration

Nuthatch backends can be individually configured - for instance if all of your Zarr's are too big for the datalake and need cheaper storage you can set the zarr backend to have a different fileysystem location:

[tool.nuthatch.root.zarr]
filesystem = "s3://my-zarr-bucket/"

More advanced caching

Nuthatch has many more features:

  • Caches that are keyed by argument
  • Processors to enable slicing and data validation
  • Rerunning of DAGs explicitly
  • Per-data-type memoization of results (i.e. persisting an xarray and recalling the compute graph from memory)
  • Caching of data locally for lower-latency access
  • Namespacing of caches to rerun the same data pipeline for multiple scenarios
@timeseries(timeseries='time')
@cache(cache_args=['agg_days'])
def agg_and_clip(start_time, end_time, agg_days=1):
    ds = my_first_cache()

    # aggregate based on time
    ds = ds.rolling({'time': agg_days}).mean()

    return ds

# Daily aggregate
agg_and_clip("2013-01-01", "2014-01-01", agg_days=1)

# Daily aggregate recalled, persisted in memory, and clipped to 2013-06
agg_and_clip("2013-01-01", "2013-06-01", agg_days=1, memoize=True) 

# Daily aggregate recalled from memory and clipped to 2013-06
agg_and_clip("2013-01-01", "2013-06-01", agg_days=1, memoize=True) 

# Weekly aggregate computed fresh
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7)

# Weekly aggregate recomputed and overwrite existing cache
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7, recompute=True, force_overwrite=True)

# Weekly aggregate with both functions recomputed and overwritten
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7, recompute=['agg_and_clip', 'my_first_cache'], force_overwrite=True) 

Nuthatch Limitations

Current limitations:

  • Arguments must be basic types, not objects to key caches
  • There is currently no mechansim to detect cache "staleness". Automatically tracking and detecting changes is planned for future work.
  • Expanded configurability (i.e. directly from environment variable) is not supported

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nuthatch-0.2.1.tar.gz (177.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nuthatch-0.2.1-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file nuthatch-0.2.1.tar.gz.

File metadata

  • Download URL: nuthatch-0.2.1.tar.gz
  • Upload date:
  • Size: 177.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nuthatch-0.2.1.tar.gz
Algorithm Hash digest
SHA256 e10b99b594e984bb023f68590749fbd2ce731990c02bb563124f43259f18bbcf
MD5 2729f6aaaf6047b434df76100f8f15a1
BLAKE2b-256 ff12dbca55a7d753f19c3d1af36737f2b4308886c27800cbf4737e6c99e4eb52

See more details on using hashes here.

File details

Details for the file nuthatch-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: nuthatch-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nuthatch-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 de968d82ae7f61ce9ef269b412eea8a1bb4c9d19c97edd2b1dfcf422c408c2a5
MD5 2a93c0f3a0425ecada25053ab1892030
BLAKE2b-256 27b8d6098c8fb461c417a472adc3dff2e0f4a91c998f2b507ee166e4ae8b8d1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page