Cacheable big data pipelines

Project description

Nuthatch

Nuthatch is a tool for building pure-python big data pipelines. At its core it enables the transparent multi-level caching and recall of results in formats that are efficient for each data type. It supports a variety of common storage backends, data processing frameworks, and their associated data types for caching.

It also provides a framework for re-using and sharing data-type specific post-processing, and for these data type processors to pass hints to storage backends for more efficient storage and recall.

Nuthatch was created to alleviate the comon pattern of data processing pipelines manually specifying their output storage locations, and the requirements of pipeline builders to use external data orchestration tools to specify the execution of their pipelines. With Nuthatch simply tag your functions and anyone who has access to your storage backend - you, your team, or the public - can access and build off of your most up-to-date data.

Getting started

The most basic form of Nuthatch simply stores and recalls your data based on its arguments in efficient formats:

from nuthatch import cache
import xarray as xr

@cache()
def my_first_cache():
    ds = xr.tutorial.open_dataset("air_temperature")

    # Data will automatically be saved in a zarr store and recalled
    return ds

my_first_cache(cache_mode='local')

But it's much more powerful if you configure nuthatch to be shared across a team:

from nuthatch import cache, set_parameter
import xarray as xr

set_parameter({'filesystem': "gs://my-datalake"})

@cache()
def my_first_cache():
    ds = xr.tutorial.open_dataset("air_temperature")

    # Data will automatically be saved in a zarr store and recalled
    return ds

my_first_cache()

Commit your code and anyone with access to your datalake has access to a self-documented cache of your data.

More powerful - push your code to pypi and anyone who imports your code can access the data simply by calling the function (assuming they have read-only access to the storage.)

Slightly more advanced use cases

Nuthatch has many more features:

Caches that are keyed by argument
Processors to enable slicing and data validation
Rerunning of DAGs explicitly
Per-data-type memoization of results (i.e. persisting an xarray and recalling the compute graph from memory)
Caching of data locally for lower-latency access
Namespacing of caches to rerun the same data pipeline for multiple scenarios
Cache versioning (to invalidate stale caches)

set_parameter({'filesystem': "gs://my-datalake"})

@timeseries(timeseries='time')
@cache(cache_args=['agg_days'],
       version="0.1.0")
def agg_and_clip(start_time, end_time, agg_days=1):
    ds = my_first_cache()

    # aggregate based on time
    ds = ds.rolling({'time': agg_days}).mean()

    return ds

# Daily aggregate
agg_and_clip("2013-01-01", "2014-01-01", agg_days=1)

# Daily aggregate recalled, persisted in memory (or cluster memory if setup), and clipped to 2013-06
agg_and_clip("2013-01-01", "2013-06-01", agg_days=1, memoize=True) 

# Daily aggregate recalled from memory and clipped to 2013-06
agg_and_clip("2013-01-01", "2013-06-01", agg_days=1, memoize=True) 

# Weekly aggregate computed fresh
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7)

# Weekly aggregate recomputed and overwrite existing cache
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7, recompute=True, cache_mode='overwrite')

# Weekly aggregate with both functions recomputed and overwritten
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7, recompute=['agg_and_clip', 'my_first_cache'], cache_mode='overwrite') 

# Weekly aggregate with both functions recomputed and saved to a local cache for faster recall
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7, recompute=['agg_and_clip', 'my_first_cache'], cache_mode='local') 

# Weekly aggregate with cache saved to a new namespace
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7, namespace='experiment2')

Nuthatch caching levels

Root

The root cache is your main storage location. It's often the remote cloud bucket serving as your project's datalake, but it could also be a shared drive on a local cluster.

Local

If you use a local mode, nuthatch will automatically create a local cache for you and store data there for more efficient recall.

Mirror(s)

You can configure any number of read-only mirrors to look for your data in. If you import a project that uses nuthatch its root and all of its mirrors will be added as your mirrors so that you can fetch nuthatch data/functions that it defines.

Nuthatch cache modes

When calling nuthatch functions you can operate in several distinct modes which control which of the levels you write to and read from.

cache_mode='write'

The default mode when you have a root cache configured. By default writes to and reads from the root cache if the function is set to be cached. This mode prompts the user if a cache exists before overwriting.

cache_mode='overwrite'

Same as above but does not prompt the user before overwriting.

cache_mode='read_only'

This reads from all available cache locations but writes to no location, simply returns the results (or computes them if they do not exist). This is the default mode if you do not have a root configured. This still allows you to import external projects and read their data.

cache_mode='local'

This mode reads from all available caches, and will store results to your local cache for faster recall. This is useful for the faster recall of data from a remote source.

cache_mode='offline'

This mode only reads from local caches and doesn't read from the root cache.

Nuthatch supported backends

Nuthatch supports multiple backends for writing, and multiple engines (datatypes) for reading from those backends. The following are currently supported. Backends beyond the defaults can be configured by passing backend='name' and read data types beyond the default can be configured by passing engine=<type>. I.e. a function returning pandas dataframe for storing in parquet instead of delta could pass backend='parquet', engine=pandas.DataFrame

Backend	Default for Data Type	Parameters	Supported read engines
Basic (pickle)	Basic python types, default for unmatched types	filesystem, filesystem_options (optional)	N/A, unpickles to stored typed
Zarr	xarray.Dataset, xarray.DataArray	filesystem, filesystem_options (optional)	xarray.Dataset, xarray.DataArray
Parquet	dask.dataframe.DataFrame	filesystem, filesystem_options (optional)	pandas.DataFrame, dask.dataframe.DataFrame
Deltalake	pandas.DataFrame	filesystem, filesystem_options (optional)	pandas.DataFrame, dask.dataframe
SQL	None	host, port, database, driver, user, password, write_user (optional), write_password (optional)	pandas, dask.dataframe

Nuthatch configuration

If you are developing a Nuthatch-based project you should configure its root filestore, and possible its mirrors and your preferred local caching location. The root store most likely a remote cloud bucket (like gcs, s3, etc). Configuration can be done in three places: (1) in your pyproject.toml, (2) in a special nuthatch.toml built into your package or (3) in your code - useful if you need to access secrets dynamical or configure nuthatch on distributed workers.

Nuthatch itself and most storage backends only need access to a filesystem. Some storage backends, like databases, may need additional parameters.

TOML Configuration

In either pyproject.toml or src/nuthatch.toml:

[tool.nuthatch]
filesystem = "s3://my-bucket/caches"

[tool.nuthatch.filesystem_options]
key = "your_key_id" 
secret= "your_secret_key" #do NOT put your secret in your toml file. Use dynamic secrets below.

pyproject.toml cannot be easily packaged. If you would like your caches to be accessible when your package is installed and imported by others, you must use either a nuthatch.toml file or dynamic configuration. Make sure your nuthatch.toml is packaged with your project!

Dynamic configuration - decorators

You should not save secrets in files. To solve this problem nuthatch enables a method of fetching secrets dynamically, from a cloud secret store, or from another location like an environment variable or file. Just make sure this file is imported before you run your code

from nuthatch import config_parameter

@config_parameter('filesystem_options', secret=True)
def fetch_key():
    # Fetch from secret store, environment, etc
    filesystem_options = {
        'key': os.environ['S3_KEY'],
        'secret': os.environ['S3_SECRET']
    }

    return filesystem_options

Dynamic configuration - direct setting

You can also simply set configuration parameters in code, which is sometimes necessary for distributed environments

from nuthatch import set_parameter
set_parameter({'filesystem': "gs://my-datalake"})

Backend-specific configuration

Nuthatch backends can be individually configured - for instance if all of your Zarr's are too big for the datalake and need cheaper storage you can set the zarr backend to have a different fileysystem location:

[tool.nuthatch.root.zarr]
filesystem = "s3://my-zarr-bucket/"

Environment variables

You can configure nuthatch with environment variables in the form NUTHATCH_<PARAMETER_NAME> or NUTHATCH<PARAMETER_NAME> or NUTHATCH_<PARAMETER_NAME> (where location defalts to root.)

There is a special environment varialbe 'NUTHATCH_ALLOW_INSTALLED_PACKAGE_CONFIGURATION' that enables dynamic parameters to set the root parameters even when Nuthatch is an installed package. This is useful for running on clusters where your package is installed for execution even though it is the primary project.

Nuthatch Limitations

Current limitations:

Arguments must be basic types, not objects to key caches
There is currently no mechanism to detect cache "staleness". Automatically tracking and detecting changes is planned for future work.
We have not tested Nuthatch on S3 and Azure blob storage, only on google cloud, but that is ongoing and hopefully an update will be released soon.

Project details

Release history Release notifications | RSS feed

0.4.9

Apr 8, 2026

0.4.8

Mar 20, 2026

0.4.7

Feb 24, 2026

0.4.6

Feb 24, 2026

0.4.5

Feb 10, 2026

0.4.4

Feb 3, 2026

0.4.3

Jan 27, 2026

0.4.2

Jan 27, 2026

0.4.1

Jan 23, 2026

0.4.0

Jan 22, 2026

0.3.9

Jan 22, 2026

0.3.8

Jan 21, 2026

0.3.7

Jan 15, 2026

0.3.6

Jan 14, 2026

0.3.5

Jan 7, 2026

0.3.4

Jan 6, 2026

This version

0.3.3

Jan 6, 2026

0.3.2

Jan 6, 2026

0.3.1

Jan 6, 2026

0.3.0

Dec 12, 2025

0.2.1

Dec 9, 2025

0.1.1

Nov 17, 2025

0.1.0

Sep 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nuthatch-0.3.3.tar.gz (183.7 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nuthatch-0.3.3-py3-none-any.whl (45.2 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file nuthatch-0.3.3.tar.gz.

File metadata

Download URL: nuthatch-0.3.3.tar.gz
Upload date: Jan 6, 2026
Size: 183.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nuthatch-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`6807fc85a265a562d40419afcd15ff54e74abd38c5a7b75b78f6487bb8064578`
MD5	`a51dea6d82cc73a9fea8bcb8d8e92e28`
BLAKE2b-256	`ee649086821cf1ca49800c73a64d79bf3f88fdb2b66c53455a09fd4b9f999997`

See more details on using hashes here.

File details

Details for the file nuthatch-0.3.3-py3-none-any.whl.

File metadata

Download URL: nuthatch-0.3.3-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 45.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nuthatch-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18aace5777acb1210b402053b057a186bae4108bede4f11635229f16f06ec830`
MD5	`8b3f65e7c059188ac6652c018bf26ec8`
BLAKE2b-256	`9235170312200474cde6b7dc0327736a98f3f7716e2b01598feaf16e0bf4e333`

See more details on using hashes here.

nuthatch 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Nuthatch

Getting started

Slightly more advanced use cases

Nuthatch caching levels

Root

Local

Mirror(s)

Nuthatch cache modes

cache_mode='write'

cache_mode='overwrite'

cache_mode='read_only'

cache_mode='local'

cache_mode='offline'

Nuthatch supported backends

Nuthatch configuration

TOML Configuration

Dynamic configuration - decorators

Dynamic configuration - direct setting

Backend-specific configuration

Environment variables

Nuthatch Limitations

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes