Cacheable big data pipelines
Project description
Nuthatch
Nuthatch is a tool for building pure-python big data pipelines. At its core it enables the transparent multi-level caching and recall of results in formats that are efficient for each data type. It supports a variety of common storage backends, data processing frameworks, and their associated data types for caching.
It also provides a framework for re-using and sharing data-type specific post-processing, and for these data type processors to pass hints to storage backends for more efficient storage and recall.
Nuthatch was created to alleviate the comon pattern of data processing pipelines manually specifying their output storage locations, and the requirements of pipeline builders to use external data orchestration tools to specify the execution of their pipeliness. With Nuthatch simply tag your functions and anyone who has access to your storage backend - you, your team, or the public - can acess and build off of your most up-to-date data.
Using Nuthatch
Configuration
To use Nuthatch you must configure access to some file store. Most basically this could be your local filesystem, but is likely more useful if it's a remote cloud bucket (like gcs, s3, etc). Configuration is done in your pyproject.toml, e.g.
[tool.nuthatch]
filesystem = "s3://my-bucket/caches"
[tool.nuthatch.filesystem_options]
key = "your_key_id"
secret= "your_secret_key"
This is sufficient to enable nuthatch to store data in all file-like backends. Other backends, like databases, will require additional configuration parameters
Dynamic Secrets
You should not save secrets in pyproject.toml. To solve this problem nuthatch enables a method of fetching secrets dynamically, from a cloud secret store, or from another location like an environment variable or file. Just make sure this file is imported before you run your code
from nuthatch import config_parameter
@config_parameter('filesystem_options', secret=True)
def fetch_key():
# Fetch from secret store, environment, etc
filesystem_options = {
'key': os.environ['S3_KEY'],
'secret': os.environ['S3_SECRET']
}
return filesystem_options
Your first cache
Now you can make your first cache! Simple tag with nuthatch, and nuthatch will do its best to store your data efficiently.
from nuthatch import cache
from nuthatch.processors import timeseries
import xarray as xr
@cache()
def my_first_cache():
ds = xr.tutorial.open_dataset("air_temperature")
# Data will automatically be saved in a zarr store and recalled
return ds
my_first_cache()
Nuthatch has many more features too:
- Caches that are keyed by argument
- Processors to enable slicing and data validation
- Rerunning of DAGs explicitly
- Per-data-type memoization of results (i.e. persisting an xarray and recalling the compute graph from memory)
- Caching of data locally for lower-latency access
- Namespacing of caches to rerun the same data pipeline for multiple scenarios
@timeseries(timeseries='time')
@cache(cache_args=['agg_days'])
def agg_and_clip(start_time, end_time, agg_days=1):
ds = my_first_cache()
# aggregate based on time
ds = ds.rolling({'time': agg_days}).mean()
return ds
# Daily aggregate
agg_and_clip("2013-01-01", "2014-01-01", agg_days=1)
# Daily aggregate recalled, persisted in memory, and clipped to 2013-06
agg_and_clip("2013-01-01", "2013-06-01", agg_days=1, memoize=True)
# Daily aggregate recalled from memory and clipped to 2013-06
agg_and_clip("2013-01-01", "2013-06-01", agg_days=1, memoize=True)
# Weekly aggregate computed fresh
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7)
# Weekly aggregate recomputed and overwrite existing cache
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7, recompute=True, force_overwrite=True)
# Weekly aggregate with both functions recomputed and overwritten
agg_and_clip("2013-01-01", "2014-01-01", agg_days=7, recompute=['agg_and_clip', 'my_first_cache'], force_overwrite=True)
Nuthatch Limitations
Current limitations:
- Arguments must be basic types, not objects to key caches
- There is currently no mechansim to detect cache "staleness". Automatically tracking and detecting changes is planned for future work.
- Expanded configurability (i.e. directly from environment variable) is not supported
- Nuthatch's metastore (i.e. the database that tracks the caches and their versions) is still in flux. It adds a couple of seconds of write overhead for writing. Future work will try to eliminate this.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nuthatch-0.1.1.tar.gz.
File metadata
- Download URL: nuthatch-0.1.1.tar.gz
- Upload date:
- Size: 317.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc4e2e1fcfebbdc889b884eb374375641b56a42942d191cde256ac7339725980
|
|
| MD5 |
424b4e1e4bccebc995f93adbfe5355da
|
|
| BLAKE2b-256 |
a744578edcad620509bc51bf90da0cd5a4af5fecf3e1a24a58ce0252502bc157
|
File details
Details for the file nuthatch-0.1.1-py3-none-any.whl.
File metadata
- Download URL: nuthatch-0.1.1-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bacfcd2dad5c5c8308ef0310485194167220cebbaf6bb24a2aebbba287f406eb
|
|
| MD5 |
cdd4ff62c675e0860cf4b1ff17f62e8c
|
|
| BLAKE2b-256 |
73e65e4c32ed1d7c27d69ddd1f1d5989853e10a44dc4665f7344e4f5c53049d8
|