A custom store backend for `joblib` that de-duplicates large NumPy arrays to save significant disk space in scientific computing and machine learning workflows.

Project description

decache

A custom store backend for joblib that de-duplicates large NumPy arrays to save significant disk space in scientific computing and machine learning workflows.

The Problem

joblib.Memory is an invaluable tool for caching the results of expensive function calls. It works by serializing the function's inputs and outputs, storing the output in a file system cache. When the function is called again with the same inputs, the output is retrieved from the cache, avoiding re-computation.

However, the standard caching mechanism operates at the function call level. If two different function calls (or calls to different functions) happen to return the exact same large NumPy array, joblib will save this array to disk twice, once for each cache entry.

In data-intensive fields, it's common for various preprocessing or simulation steps to produce identical, multi-gigabyte arrays. This redundancy can lead to massive consumption of disk space, especially in long-running projects with extensive caching.

The Solution: Content-Addressable Storage

This project introduces the ContentAddressableStoreBackend (CAS), a drop-in replacement backend for joblib.Memory that solves this problem through content-addressable storage.

Instead of storing everything in a single file per function call, it intelligently separates large NumPy arrays and stores them based on the hash of their content.

How It Works

Interception: The backend intercepts the output of a cached function before it's saved to disk.
Traversal & Identification: It recursively scans the output for NumPy arrays that exceed a configurable size threshold (e.g., 16 MB).
Hashing & De-duplication: For each large array, it computes a unique hash of its data. The array is then saved to a central blobs directory, using its hash as the filename. If a file with that name already exists, it means the exact same array has been seen before, and no new data is written.
Replacement: In the original function output, the large array is replaced with a small, lightweight placeholder object. This placeholder contains the array's hash and metadata (like its shape and data type).
Reconstruction: When a result is loaded from the cache, the backend reverses the process. It loads the structure containing placeholders, and for each placeholder, it reads the corresponding array from the blobs directory to seamlessly reconstruct the original object.

The result is that any given large array is only ever stored once on disk, regardless of how many times it appears in your cache.

Features

Automatic De-duplication: Drastically reduces disk space usage when caching functions that return identical large NumPy arrays.
Transparent Integration: Works as a drop-in backend for joblib.Memory with no changes needed to your existing cached functions.
Configurable Threshold: You can easily define what constitutes a "large" array via backend_options.
Handles Nested Structures: Correctly processes large arrays nested within lists, tuples, and dictionaries.

Installation

pip install decache

Usage

Using the backend is straightforward. First, you must register it with joblib, and then you can instantiate joblib.Memory with the new backend.

Here is a complete example demonstrating the de-duplication feature:

import joblib
import numpy as np
import shutil
from pathlib import Path

# 1. Import and register the custom backend
from decache.store_backend import register_cas_store_backend
register_cas_store_backend()

# 2. Define the cache directory and create the Memory object
#    Specify 'cas' as the backend and configure the threshold.
CACHE_DIR = "/tmp/decache"
memory = joblib.Memory(
    location=CACHE_DIR,
    backend="cas",
    backend_options={'large_array_threshold': 1 * 1024 * 1024},  # 1 MB
    verbose=10
)

# This array is large (~8 MB) and its content is identical every time.
IDENTICAL_LARGE_ARRAY = np.arange(1024 * 1024, dtype=np.float64)

@memory.cache
def process_data_source_a(source_id):
    """A cached function that returns a large, constant array."""
    print(f"Executing process_data_source_a for source '{source_id}'...")
    return IDENTICAL_LARGE_ARRAY.copy()

@memory.cache
def process_data_source_b(config_dict):
    """A completely different cached function that returns the same large array."""
    print(f"Executing process_data_source_b for config '{config_dict}'...")
    return IDENTICAL_LARGE_ARRAY.copy()

if __name__ == '__main__':
    # Call the first function. It will run and store the result.
    # A single blob for IDENTICAL_LARGE_ARRAY will be created.
    result_a = process_data_source_a("source1")

    # Call the second function. Its input is different, so it will also run.
    # However, since its output array is identical, it will *reuse* the existing blob.
    result_b = process_data_source_b({'user': 'test', 'version': 2})

    print("\n--- Cache Inspection ---")

    blobs_dir = Path(CACHE_DIR) / "joblib" / "blobs"
    blob_count = len(list(blobs_dir.iterdir()))
    print(f"Number of blobs stored: {blob_count}")

    # Although two functions were cached, only one blob file was created.
    assert blob_count == 1

    # Clean up
    # shutil.rmtree(CACHE_DIR)

Running Tests

The backend comes with a comprehensive test suite using pytest. To run the tests, first install the dependencies:

pip install pytest

Then, from the root of the project, simply run:

pytest

Limitations and Future Work

Garbage Collection: This implementation does not automatically clean up "orphaned" blobs. If cache entries are cleared (e.g., via memory.clear()), the corresponding files in the blobs directory are not removed. A separate garbage collection script could be implemented to scan all cache entries and remove any unreferenced blobs.

License

This project is licensed under the MIT License.

Project details

Release history Release notifications | RSS feed

This version

0.1.7

Nov 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

decache-0.1.7.tar.gz (10.7 kB view details)

Uploaded Nov 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

decache-0.1.7-py3-none-any.whl (7.2 kB view details)

Uploaded Nov 10, 2025 Python 3

File details

Details for the file decache-0.1.7.tar.gz.

File metadata

Download URL: decache-0.1.7.tar.gz
Upload date: Nov 10, 2025
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: pdm/2.26.1 CPython/3.14.0 Linux/6.11.0-1018-azure

File hashes

Hashes for decache-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`aca4c62f248ea461ebc269330e8be298b0eca8402399e4f95b475057ef522a3b`
MD5	`3e34a1ca07aab7b1865fa6a6f562a33a`
BLAKE2b-256	`ac4e41c2aeba78c937b331b2048a171245b6ac0d1f0fb3c012ba297835faf5c0`

See more details on using hashes here.

File details

Details for the file decache-0.1.7-py3-none-any.whl.

File metadata

Download URL: decache-0.1.7-py3-none-any.whl
Upload date: Nov 10, 2025
Size: 7.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: pdm/2.26.1 CPython/3.14.0 Linux/6.11.0-1018-azure

File hashes

Hashes for decache-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77b799eb7b666ded8bbbdcc9acb8f545d2b2c4bef4590d0b3c43ac974f5cf450`
MD5	`0c78e2c3eedd67432f31b65e8ef02a09`
BLAKE2b-256	`c97a7ee9739c380ef32526e32995cd78e3d4866418055a3894d7f479bc1c750d`

See more details on using hashes here.

decache 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

decache

The Problem

The Solution: Content-Addressable Storage

How It Works

Features

Installation

Usage

Running Tests

Limitations and Future Work

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes