A custom store backend for `joblib` that de-duplicates large NumPy arrays to save significant disk space in scientific computing and machine learning workflows.
Project description
decache
A custom store backend for joblib that de-duplicates large NumPy arrays to save significant disk space in scientific computing and machine learning workflows.
The Problem
joblib.Memory is an invaluable tool for caching the results of expensive function calls. It works by serializing the function's inputs and outputs, storing the output in a file system cache. When the function is called again with the same inputs, the output is retrieved from the cache, avoiding re-computation.
However, the standard caching mechanism operates at the function call level. If two different function calls (or calls to different functions) happen to return the exact same large NumPy array, joblib will save this array to disk twice, once for each cache entry.
In data-intensive fields, it's common for various preprocessing or simulation steps to produce identical, multi-gigabyte arrays. This redundancy can lead to massive consumption of disk space, especially in long-running projects with extensive caching.
The Solution: Content-Addressable Storage
This project introduces the ContentAddressableStoreBackend (CAS), a drop-in replacement backend for joblib.Memory that solves this problem through content-addressable storage.
Instead of storing everything in a single file per function call, it intelligently separates large NumPy arrays and stores them based on the hash of their content.
How It Works
- Interception: The backend intercepts the output of a cached function before it's saved to disk.
- Traversal & Identification: It recursively scans the output for NumPy arrays that exceed a configurable size threshold (e.g., 16 MB).
- Hashing & De-duplication: For each large array, it computes a unique hash of its data. The array is then saved to a central
blobsdirectory, using its hash as the filename. If a file with that name already exists, it means the exact same array has been seen before, and no new data is written. - Replacement: In the original function output, the large array is replaced with a small, lightweight placeholder object. This placeholder contains the array's hash and metadata (like its shape and data type).
- Reconstruction: When a result is loaded from the cache, the backend reverses the process. It loads the structure containing placeholders, and for each placeholder, it reads the corresponding array from the
blobsdirectory to seamlessly reconstruct the original object.
The result is that any given large array is only ever stored once on disk, regardless of how many times it appears in your cache.
Features
- Automatic De-duplication: Drastically reduces disk space usage when caching functions that return identical large NumPy arrays.
- Transparent Integration: Works as a drop-in backend for
joblib.Memorywith no changes needed to your existing cached functions. - Configurable Threshold: You can easily define what constitutes a "large" array via
backend_options. - Handles Nested Structures: Correctly processes large arrays nested within lists, tuples, and dictionaries.
Installation
pip install decache
Usage
Using the backend is straightforward. First, you must register it with joblib, and then you can instantiate joblib.Memory with the new backend.
Here is a complete example demonstrating the de-duplication feature:
import joblib
import numpy as np
import shutil
from pathlib import Path
# 1. Import and register the custom backend
from decache.store_backend import register_cas_store_backend
register_cas_store_backend()
# 2. Define the cache directory and create the Memory object
# Specify 'cas' as the backend and configure the threshold.
CACHE_DIR = "/tmp/decache"
memory = joblib.Memory(
location=CACHE_DIR,
backend="cas",
backend_options={'large_array_threshold': 1 * 1024 * 1024}, # 1 MB
verbose=10
)
# This array is large (~8 MB) and its content is identical every time.
IDENTICAL_LARGE_ARRAY = np.arange(1024 * 1024, dtype=np.float64)
@memory.cache
def process_data_source_a(source_id):
"""A cached function that returns a large, constant array."""
print(f"Executing process_data_source_a for source '{source_id}'...")
return IDENTICAL_LARGE_ARRAY.copy()
@memory.cache
def process_data_source_b(config_dict):
"""A completely different cached function that returns the same large array."""
print(f"Executing process_data_source_b for config '{config_dict}'...")
return IDENTICAL_LARGE_ARRAY.copy()
if __name__ == '__main__':
# Call the first function. It will run and store the result.
# A single blob for IDENTICAL_LARGE_ARRAY will be created.
result_a = process_data_source_a("source1")
# Call the second function. Its input is different, so it will also run.
# However, since its output array is identical, it will *reuse* the existing blob.
result_b = process_data_source_b({'user': 'test', 'version': 2})
print("\n--- Cache Inspection ---")
blobs_dir = Path(CACHE_DIR) / "joblib" / "blobs"
blob_count = len(list(blobs_dir.iterdir()))
print(f"Number of blobs stored: {blob_count}")
# Although two functions were cached, only one blob file was created.
assert blob_count == 1
# Clean up
# shutil.rmtree(CACHE_DIR)
Running Tests
The backend comes with a comprehensive test suite using pytest. To run the tests, first install the dependencies:
pip install pytest
Then, from the root of the project, simply run:
pytest
Limitations and Future Work
- Garbage Collection: This implementation does not automatically clean up "orphaned" blobs. If cache entries are cleared (e.g., via
memory.clear()), the corresponding files in theblobsdirectory are not removed. A separate garbage collection script could be implemented to scan all cache entries and remove any unreferenced blobs.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file decache-0.1.7.tar.gz.
File metadata
- Download URL: decache-0.1.7.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.26.1 CPython/3.14.0 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aca4c62f248ea461ebc269330e8be298b0eca8402399e4f95b475057ef522a3b
|
|
| MD5 |
3e34a1ca07aab7b1865fa6a6f562a33a
|
|
| BLAKE2b-256 |
ac4e41c2aeba78c937b331b2048a171245b6ac0d1f0fb3c012ba297835faf5c0
|
File details
Details for the file decache-0.1.7-py3-none-any.whl.
File metadata
- Download URL: decache-0.1.7-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.26.1 CPython/3.14.0 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77b799eb7b666ded8bbbdcc9acb8f545d2b2c4bef4590d0b3c43ac974f5cf450
|
|
| MD5 |
0c78e2c3eedd67432f31b65e8ef02a09
|
|
| BLAKE2b-256 |
c97a7ee9739c380ef32526e32995cd78e3d4866418055a3894d7f479bc1c750d
|