Content-addressable compute cache with git semantics

These details have not been verified by PyPI

Project links

Repository

Project description

cashet

Content-addressable compute cache with git semantics
Run a function once. Get the same result instantly every time after that.

Install · Quick Start · Why · Use Cases · CLI · API · How It Works

Install

Global CLI tool (recommended):

uv tool install cashet
# or
pipx install cashet

Then use the CLI anywhere:

cashet --help

In a project (library + CLI):

uv add cashet
# or
pip install cashet

This installs cashet as both an importable Python library (from cashet import Client) and a project-local CLI (uv run cashet).

Develop / contribute:

git clone https://github.com/jolovicdev/cashet.git
cd cashet
uv sync
uv run pytest

Quick Start

from cashet import Client

client = Client()  # creates .cashet/ in current directory

def expensive_transform(data, scale=1.0):
    # imagine this takes 10 minutes
    return [x * scale for x in data]

# First call: runs the function
ref = client.submit(expensive_transform, [1, 2, 3], scale=2.0)
print(ref.load())  # [2.0, 4.0, 6.0]

# Second call with same args: instant — returns cached result
ref2 = client.submit(expensive_transform, [1, 2, 3], scale=2.0)
print(ref2.load())  # [2.0, 4.0, 6.0] — no re-computation

You can also use Client as a context manager to ensure the store connection is closed cleanly:

with Client() as client:
    ref = client.submit(expensive_transform, [1, 2, 3], scale=2.0)
    print(ref.load())

Chain tasks into a pipeline where each step's output feeds into the next:

from cashet import Client

client = Client()

def load_dataset(path):
    return list(range(100))

def normalize(data):
    max_val = max(data)
    return [x / max_val for x in data]

def train_model(data, lr=0.01):
    return {"loss": 0.05, "lr": lr, "samples": len(data)}

# Step 1: load
raw = client.submit(load_dataset, "data/train.csv")

# Step 2: normalize (receives raw output as input)
normalized = client.submit(normalize, raw)

# Step 3: train (receives normalized output)
model = client.submit(train_model, normalized, lr=0.001)

print(model.load())  # {'loss': 0.05, 'lr': 0.001, 'samples': 100}

Re-run the script — everything returns instantly from cache. Change one argument and only that step (and downstream) re-runs.

Why

You already have caches (functools.lru_cache, joblib.Memory). Here's what's different:

	lru_cache	joblib.Memory	cashet
AST-normalized hashing	No	No	Yes (comments/formatting don't break cache)
DAG resolution (chain outputs)	No	No	Yes
Content-addressable storage	No	No	Yes (like git blobs)
CLI to inspect history	No	No	Yes
Diff two runs	No	No	Yes
Garbage collection / eviction	No	No	Yes
Pluggable serialization	No	No	Yes
Explicit cache opt-out	No	Partial	Yes
Pluggable store / executor	No	No	Yes
Persists across restarts	No	Yes	Yes

The core idea: hash the function's AST-normalized source + arguments = unique cache key. Comments, docstrings, and formatting changes don't invalidate the cache — only semantic changes do. Same function + same args = same result, stored immutably on disk. The result is a git-like blob you can inspect, diff, and chain.

Use Cases

1. ML Experiment Tracking Without the Bloat

You run 200 hyperparameter sweeps overnight. Half crash. You fix a bug and re-run. Without cashet, you re-process the dataset 200 times. With cashet:

from cashet import Client, TaskError, TaskRef

client = Client()

def preprocess(dataset_path, image_size):
    # 45 minutes of image resizing
    ...

def train(data, learning_rate, dropout):
    ...

# Batch submit with topological ordering
# TaskRef(0) refers to the first task's output
results = client.submit_many([
    (preprocess, ("s3://my-bucket/images", 224)),
    (train, (TaskRef(0), 0.01, 0.2)),
    (train, (TaskRef(0), 0.01, 0.5)),
    (train, (TaskRef(0), 0.001, 0.2)),
    (train, (TaskRef(0), 0.001, 0.5)),
    (train, (TaskRef(0), 0.0001, 0.2)),
    (train, (TaskRef(0), 0.0001, 0.5)),
])

preprocess runs once — all 6 training jobs reuse its cached output. Re-run the script tomorrow and even the training results come from cache (same function + same args = instant).

2. Data Pipeline Debugging

Your ETL pipeline fails at step 5. You fix a typo. Now you need to re-run steps 5-7 but steps 1-4 are unchanged and expensive:

from cashet import Client

client = Client()

raw = client.submit(load_s3, "s3://logs/2024-05-01/")
clean = client.submit(remove_pii, raw)
enriched = client.submit(join_crm, clean, "select * from users")
report = client.submit(generate_report, enriched)

Fix the join_crm function and re-run the script. Steps 1-2 return instantly from cache. Only step 3 onward re-executes. This works because cashet tracks which function produced which output — changing a function's source code changes its hash, invalidating downstream cache entries.

3. Reproducible Notebook Results

cashet is designed to work in Jupyter notebooks and IPython sessions. Share a result with a colleague and they can verify exactly how it was produced:

# your notebook
ref = client.submit(generate_forecast, date="2024-01-01", model="v3")
print(f"Result hash: {ref.hash}")

# their terminal — inspect provenance
cashet show <hash>

# Output:
# Hash:     a3b4c5d6...
# Function: generate_forecast
# Source:   def generate_forecast(date, model): ...
# Args:     (('2024-01-01',), {'model': 'v3'})
# Created:  2024-05-01T10:32:17

# Retrieve the actual result
cashet get <hash> -o forecast.csv

4. Incremental Computation

Process a large dataset in chunks. Already-processed chunks return instantly:

from cashet import Client

client = Client()

def process_chunk(chunk_id, source_file):
    # expensive per-chunk processing
    ...

results = []
for chunk_id in range(100):
    ref = client.submit(process_chunk, chunk_id, "huge_file.parquet")
    results.append(ref)

First run processes all 100 chunks. Second run (even after restarting Python) returns all 100 results instantly. Add a new chunk? Only that one runs.

CLI

# Show commit history
cashet log

# Filter by function name
cashet log --func "preprocess"

# Filter by tag
cashet log --tag env=prod --tag experiment=run-1

# Show full commit details (source code, args, error)
cashet show <hash>

# Retrieve a result (pretty-prints strings/dicts/lists)
cashet get <hash>

# Write a result to file
cashet get <hash> -o output.bin

# Compare two commits
cashet diff <hash_a> <hash_b>

# Show lineage of a result (same function+args over time)
cashet history <hash>

# Delete a specific commit
cashet rm <hash>

# Evict old cache entries and orphaned blobs
cashet gc --older-than 30

# Evict oldest entries until under a size limit
cashet gc --max-size 1GB

# Clear everything (alias for gc --older-than 0)
cashet clear

# Storage statistics (includes disk size)
cashet stats

API

`Client`

from cashet import Client

client = Client(
    store_dir=".cashet",       # where to store blobs + metadata (SQLiteStore)
                               # falls back to $CASHET_DIR env var if set
    store=None,                # or inject any Store implementation
    executor=None,             # or inject any Executor implementation
    serializer=None,           # defaults to PickleSerializer
    max_workers=1,             # max parallelism for submit_many (default: 1, sequential)
)

Pluggable Backends

Everything is protocol-based. Swap the store, executor, or serializer without touching your task code:

from pathlib import Path

from cashet import Client, Store, Executor, Serializer
from cashet.store import SQLiteStore
from cashet.executor import LocalExecutor

# These are equivalent (the defaults):
client = Client(store_dir=".cashet")

# Explicit injection:
client = Client(
    store=SQLiteStore(Path(".cashet")),
    executor=LocalExecutor(),
)

Store protocol — implement this to use RocksDB, Redis, S3, or anything else:

from cashet.protocols import Store

class RedisStore:
    def put_blob(self, data: bytes) -> ObjectRef: ...
    def get_blob(self, ref: ObjectRef) -> bytes: ...
    def put_commit(self, commit: Commit) -> None: ...
    def get_commit(self, hash: str) -> Commit | None: ...
    def find_by_fingerprint(self, fingerprint: str) -> Commit | None: ...
    def find_running_by_fingerprint(self, fingerprint: str) -> Commit | None: ...
    def list_commits(self, ...) -> list[Commit]: ...
    def get_history(self, hash: str) -> list[Commit]: ...
    def stats(self) -> dict[str, int]: ...
    def evict(self, older_than: datetime) -> int: ...
    def delete_commit(self, hash: str) -> bool: ...
    def close(self) -> None: ...

client = Client(store=RedisStore("redis://localhost"))
# Everything else works identically

Executor protocol — implement this for distributed execution (Celery, Kafka, RQ):

from cashet.protocols import Executor

class CeleryExecutor:
    def submit(self, func, args, kwargs, task_def, store, serializer):
        # Push to Celery, poll for result
        ...

client = Client(
    store=RedisStore("redis://localhost"),
    executor=CeleryExecutor(),
)

Serializer protocol — already covered below.

`client.submit(func, *args, **kwargs) -> ResultRef`

Submit a function for execution. Returns a ResultRef — a lazy handle to the result.

ref = client.submit(my_func, arg1, arg2, key="value")
ref.hash         # content hash of the result blob
ref.commit_hash  # commit hash (use this for show/history/rm/get)
ref.size         # size in bytes
ref.load()       # deserialize and return the result

If the same function + same arguments have been submitted before, returns the cached result without re-executing.

`client.clear()`

Remove all cache entries and orphaned blobs. Equivalent to client.gc(timedelta(days=0)).

client.clear()

`client.submit_many(tasks) -> list[ResultRef]`

Submit a batch of tasks with automatic topological ordering. Use TaskRef(index) to wire outputs between tasks in the batch.

from cashet import TaskRef

refs = client.submit_many([
    step1_func,
    (step2_func, (TaskRef(0),)),
    (step3_func, (TaskRef(1), "extra_arg")),
], max_workers=4)  # run independent tasks in parallel

This enables parallel fan-out and ensures each task only runs after its dependencies.

Opt out of caching:

# Per-call
ref = client.submit(non_deterministic_func, _cache=False)

# Per-function via decorator
@client.task(cache=False)
def random_score():
    return random.random()

Force re-execution (skip cache, always run):

# Per-call
ref = client.submit(my_func, arg, _force=True)

# Per-function via decorator
@client.task(force=True)
def always_rerun():
    ...

Tag commits:

# Per-call
ref = client.submit(train, data, lr=0.01, _tags={"experiment": "v1"})

# Per-function via decorator
@client.task(tags={"team": "ml"})
def preprocess(raw):
    ...

Tags are not part of the cache key — they are metadata for organization and filtering.

Retry flaky operations:

# Per-call
ref = client.submit(fetch_api, url, _retries=3)

# Per-function via decorator
@client.task(retries=3)
def fetch_api(url):
    ...

Retries wait briefly between attempts. When retries are exhausted, client.submit raises TaskError with the original traceback included in the message.

Task timeouts:

# Per-call (seconds)
ref = client.submit(slow_func, _timeout=30)

# Per-function via decorator
@client.task(timeout=30)
def slow_func():
    ...

Timeouts can be combined with retries — a timed-out attempt counts as a failure and will be retried.

`@client.task`

@client.task
def my_func(x):
    return x * 2

ref = my_func(5)  # Returns ResultRef, same as client.submit(my_func, 5)
ref.load()        # 10

@client.task(cache=False, name="custom_task_name", tags={"env": "prod"})
def other_func(x):
    return x + 1

client.submit(my_func, 5) still works identically.

`client.log()`, `client.show()`, `client.get()`, `client.diff()`, `client.history()`, `client.rm()`, `client.gc()`

# List commits
commits = client.log(func_name="preprocess", limit=10)

# Filter by status
commits = client.log(status="failed")

# Filter by tags
commits = client.log(tags={"experiment": "v1"})

# Get commit details
commit = client.show(hash)
commit.task_def.func_source  # the source code
commit.task_def.args_snapshot  # the serialized args
commit.parent_hash  # previous commit for same func+args
commit.created_at

# Load a result by commit hash
result = client.get(hash)

# Diff two commits
diff = client.diff(hash_a, hash_b)
# {'func_changed': True, 'args_changed': False, 'output_changed': True, ...}

# Get lineage (all runs of same func+args)
history = client.history(hash)

# Evict old entries (default: 30 days)
evicted = client.gc()
# Evict entries older than 7 days
from datetime import timedelta
evicted = client.gc(older_than=timedelta(days=7))
# Evict oldest entries until under size limit
evicted = client.gc(max_size_bytes=1024 * 1024 * 1024)  # 1GB

# Storage stats
stats = client.stats()
# {
#     'total_commits': 42,
#     'completed_commits': 40,
#     'stored_objects': 38,      # blob_objects + inline_objects
#     'disk_bytes': 10485760,    # blob_bytes + inline_bytes
#     'blob_objects': 35,
#     'blob_bytes': 9437184,
#     'inline_objects': 3,
#     'inline_bytes': 1048576,
# }

Jupyter & Notebook Support

cashet works seamlessly in Jupyter notebooks, IPython, and the Python REPL. It uses a tiered source-resolution strategy:

inspect.getsource() — for normal .py files
dill.source.getsource() — for interactive sessions with live history
dis.Bytecode fallback — for any live function, even after a kernel restart

This means you can define functions in a notebook cell, rerun the cell with changes, and cashet will correctly invalidate the cache based on the new code.

# In a notebook cell
client = Client()

def preprocess(data):
    return [x * 2 for x in data]

ref = client.submit(preprocess, [1, 2, 3])

Change the cell body and rerun — the cache invalidates automatically.

Thread Safety

cashet is safe to use from multiple threads and processes sharing the same store directory. Concurrent submissions of the same uncached task are deduplicated: the function executes exactly once and all callers receive the same cached result. This works across multiprocessing.Process, ProcessPoolExecutor, and multiple independent Python interpreters.

Note: Cross-process dedup uses a 5-minute timeout by default. If a process dies while running a task, its claim is automatically reclaimed after that timeout so other workers are not blocked forever. You can adjust this via LocalExecutor(running_ttl=...):
from datetime import timedelta
from cashet.executor import LocalExecutor

client = Client(executor=LocalExecutor(running_ttl=timedelta(minutes=10)))

import threading

def worker():
    c = Client()  # separate Client instance, same store
    c.submit(expensive_func, arg)

threads = [threading.Thread(target=worker) for _ in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()
# expensive_func ran only once

`ResultRef`

A lazy reference to a stored result. Pass it as an argument to chain tasks:

step1 = client.submit(func_a, input_data)
step2 = client.submit(func_b, step1)  # step1 auto-resolves to its output

Custom Serialization

from cashet import Client, PickleSerializer, SafePickleSerializer, JsonSerializer

# Default: pickle (handles arbitrary Python objects)
client = Client(serializer=PickleSerializer())

# Safe pickle: restricts deserialization to an allowlist of known types
client = Client(serializer=SafePickleSerializer())

# Allow custom classes through the allowlist
client = Client(serializer=SafePickleSerializer(extra_classes=[MyClass]))

# For JSON-safe data (dicts, lists, primitives)
client = Client(serializer=JsonSerializer())

# Or implement the Serializer protocol
from cashet.hashing import Serializer

class MySerializer:
    def dumps(self, obj) -> bytes:
        ...
    def loads(self, data: bytes):
        ...

How It Works

client.submit(func, arg1, arg2)
         │
         ▼
  ┌─────────────────┐
  │  Hash function   │  SHA256(AST-normalized source + dep versions + referenced user helpers)
  │  Hash arguments  │  SHA256(canonical repr of args/kwargs)
  └────────┬────────┘
           │
           ▼
  ┌─────────────────┐
  │  Fingerprint     │  func_hash:args_hash
  │  cache lookup    │  ← Store protocol (SQLiteStore, RedisStore, ...)
  └────────┬────────┘
           │
     ┌─────┴─────┐
     │            │
  CACHED       MISS
     │            │
     ▼            ▼
  Return ref   ← Executor protocol (LocalExecutor, CeleryExecutor, ...)
               Execute function
               Store result as blob → Store protocol
               Record commit with parent lineage
               Return ref

Architecture (protocol-based):

Protocol	Default	Implement for
`Store`	`SQLiteStore`	RocksDB, Redis, S3, Postgres
`Executor`	`LocalExecutor`	Celery, Kafka, RQ, subprocess
`Serializer`	`PickleSerializer`	JSON, MessagePack, custom formats

Storage layout (in .cashet/):

.cashet/
├── objects/          # content-addressable blobs (like git objects)
│   ├── a3/
│   │   └── b4c5d6... # compressed result blob
│   └── e7/
│       └── f8g9h0...
└── meta.db           # SQLite: commits, fingerprints, provenance, inline_objects

Small objects (<1KB) are stored inline in meta.db instead of the filesystem. This reduces inode overhead for caches with many tiny results. Larger objects are stored as compressed blobs in objects/ as usual.

Key design decisions:

Closure variables are not hashed and emit a ClosureWarning if present. Function identity is source code, not runtime state. If you need cache invalidation based on a value, pass it as an explicit argument.
Referenced user-defined helper functions are hashed recursively. Change an imported helper in your own code and the caller's cache invalidates correctly. Builtin and third-party library functions are skipped.
Blobs are deduplicated by content hash. Identical results share one blob on disk.
Source is hashed as an AST. Comments, docstrings, and whitespace changes don't invalidate the cache.
Non-cached tasks get unique commit hashes (timestamp salt) so they always re-execute but still record lineage.
Parent tracking: Each commit records the hash of the previous commit for the same function+args, forming a history chain you can traverse.

Project Status

Beta. The core (hashing, DAG resolution, fingerprint dedup) is stable. The defaults work reliably for single-machine and multiprocess workflows. The protocol layer (Store, Executor, Serializer) is ready for alternative backends — implementing a Redis store or Celery executor is a single-file job.

Built-in: SQLiteStore + LocalExecutor + PickleSerializer. Not yet built: Redis, RocksDB, S3 stores; Celery/Kafka executors. PRs welcome.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.4.4

May 11, 2026

0.4.3

May 1, 2026

0.4.2

May 1, 2026

0.4.1

May 1, 2026

0.4.0

Apr 30, 2026

0.3.2

Apr 29, 2026

0.3.1

Apr 28, 2026

0.3.0

Apr 26, 2026

This version

0.2.0

Apr 25, 2026

0.1.3

Apr 19, 2026

0.1.2

Apr 11, 2026

0.1.1

Apr 11, 2026

0.1.0

Apr 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cashet-0.2.0.tar.gz (61.1 kB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cashet-0.2.0-py3-none-any.whl (30.9 kB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file cashet-0.2.0.tar.gz.

File metadata

Download URL: cashet-0.2.0.tar.gz
Upload date: Apr 25, 2026
Size: 61.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cashet-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8598189bd709fc3941dbf736ed2551dc1e8a7c0274294d9f7bdfc4b209524bef`
MD5	`ae34bd92c840983fa1d3548a1dac99b0`
BLAKE2b-256	`fede4ad789949559ae440f9bf98bc9969f2ba3b2982187c5264acdbe1e237267`

See more details on using hashes here.

File details

Details for the file cashet-0.2.0-py3-none-any.whl.

File metadata

Download URL: cashet-0.2.0-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 30.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cashet-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d53f3bcab185c865c1b47aa156e2cf89bc6c37bfb04893ec5e39eec4e06a6b7`
MD5	`3aafbce0e8f065c073f3e2dae4323cfe`
BLAKE2b-256	`55f71adaaa5e1ef7580b27a9b3d3117e18e56ae8d7034aabae7a7c8b6b663534`

See more details on using hashes here.

cashet 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cashet

Install

Quick Start

Why

Use Cases

1. ML Experiment Tracking Without the Bloat

2. Data Pipeline Debugging

3. Reproducible Notebook Results

4. Incremental Computation

CLI

API

Client

Pluggable Backends

client.submit(func, *args, **kwargs) -> ResultRef

client.clear()

client.submit_many(tasks) -> list[ResultRef]

@client.task

client.log(), client.show(), client.get(), client.diff(), client.history(), client.rm(), client.gc()

Jupyter & Notebook Support

Thread Safety

ResultRef

Custom Serialization

How It Works

Project Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Client`

`client.submit(func, *args, **kwargs) -> ResultRef`

`client.clear()`

`client.submit_many(tasks) -> list[ResultRef]`

`@client.task`

`client.log()`, `client.show()`, `client.get()`, `client.diff()`, `client.history()`, `client.rm()`, `client.gc()`

`ResultRef`