A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference
Project description
Cache Saver
A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference
Accepted at EMNLP 2025 (Findings)
Cache Saver is a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations, integrating cleanly into existing systems without requiring changes to the end-user application logic or the underlying LLM. At its heart is a namespace-aware list-valued cache that ensures statistical integrity of LLM responses by generating i.i.d. responses within a namespace while enabling response reuse across namespaces, all while guaranteeing full reproducibility.
On average across five reasoning strategies, five benchmark tasks, and three LLMs, Cache Saver reduces USD cost by ~25% and CO2 emissions by ~35%. In practical scenarios such as benchmarking and ablation analysis, savings reach up to 60%.
# Just change the import — everything else stays the same
from cachesaver.models.openai import AsyncOpenAI
client = AsyncOpenAI()
response = await client.chat.completions.create(
model="gpt-4.1-nano",
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
Installation
pip install cachesaver
For local HuggingFace Transformers inference:
pip install cachesaver[transformers]
Quick Start
Replace your LLM client import with Cache Saver's — the rest of your code is unchanged:
# Before
from openai import AsyncOpenAI
# After
from cachesaver.models.openai import AsyncOpenAI
client = AsyncOpenAI()
response = await client.chat.completions.create(
model="gpt-4.1-nano",
messages=[{"role": "user", "content": "What's the meaning of life?"}],
)
# Run again → A new sample is generated
response = await client.chat.completions.create(
model="gpt-4.1-nano",
messages=[{"role": "user", "content": "What's the meaning of life?"}],
)
# Re-initialize the client and run again → The responses are retrieved from the cache
client = AsyncOpenAI()
response = await client.chat.completions.create(
model="gpt-4.1-nano",
messages=[{"role": "user", "content": "What's the meaning of life?"}],
)
A synchronous client is also available:
from cachesaver.models.openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4.1-nano",
messages=[{"role": "user", "content": "What's the meaning of life?"}],
)
Supported Providers
| Provider | Import |
|---|---|
| OpenAI | from cachesaver.models.openai import AsyncOpenAI, OpenAI |
| Anthropic | from cachesaver.models.anthropic import AsyncAnthropic, Anthropic |
| Google Gemini | from cachesaver.models.gemini import AsyncGemini, Gemini |
| Together AI | from cachesaver.models.together import AsyncTogether |
| Groq | from cachesaver.models.groq import AsyncGroq, Groq |
| OpenRouter | from cachesaver.models.openrouter import AsyncOpenRouter, OpenRouter |
| HuggingFace (Inference Providers) | from cachesaver.models.huggingface import AsyncHuggingFace, HuggingFace |
| vLLM | from cachesaver.models.vllm import AsyncVLLM, VLLM |
| HuggingFace Transformers | from cachesaver.models.transformers import AsyncHFTransformers, HFTransformers |
All cloud providers use the same interface as their original SDK. Just change the import.
Key Features
Statistical Integrity via Namespaced Caching
Unlike naive key-value caches, Cache Saver uses a list-valued cache managed through namespaces. Within a namespace, all responses to a given prompt are guaranteed to be i.i.d. — a response is never reused within the same namespace. Across namespaces, responses are reused via stochastic coupling, which is what drives the cost savings. This is critical for scenarios like stochastic sampling, uncertainty estimation, and policy diversity, where multiple independent responses to the same prompt are required.
Reproducibility
Namespaces track which cached responses have been consumed, so re-running an experiment from scratch replays the exact same results in the exact same order — even for duplicate prompts.
# Run 1 — calls the API
results_run1 = await classify(sentences, namespace="experiment_v1")
# Run 2 — new namespace, identical results from cache
results_run2 = await classify(sentences, namespace="experiment_v2")
assert results_run1 == results_run2 # Always true
Error Recovery
Crash on item 7 of 10? Re-run and items 1–6 are served from cache instantly. Only items 7–10 hit the API.
# Attempt 1 — crashes at item 7
try:
results = await process(items, namespace="my_exp")
except RuntimeError:
pass # Items 1-6 are cached
# Attempt 2 — items 1-6 from cache, only 7-10 call API
results = await process(items, namespace="my_exp")
Async Parallelism
Fully async-native. Use asyncio.gather for concurrent requests:
results = await asyncio.gather(*[
client.chat.completions.create(
model="gpt-4.1-nano",
messages=[{"role": "user", "content": prompt}],
)
for prompt in prompts
])
Deterministic Async Ordering
When multiple async agents process the same prompt concurrently, Cache Saver caches by request id — not request or completion order. A built-in reordering module ensures replays are deterministic regardless of which task finishes first.
Why It Works: Reuse Potential in LLM Reasoning
Multi-step reasoning strategies (Tree-of-Thought, ReAct, RAP, FoA, ReST-MCTS*, etc.) are highly repetitive — ~50% of prompts are duplicates both within a single method execution and across methods on the same task. Cache Saver exploits this redundancy across three practical scenarios:
Three practical scenarios using GPT-4.1-Nano across the benchmarks of Game of 24, HumanEval, and SciBench.
The figure shows Cache Saver's impact across three practical ML scenarios. A1-Hyperparameter tuning: grid search over Tree-of-Thought configurations (tree width, depth, number of evaluations). A2-Ablation analysis: testing three variations of the FoA algorithm (removing the selection phase, backtracking, or resampling). A3-Benchmarking: comparing entirely different reasoning strategies (ToT, GoT, FoA).
The blue bars show the cost without Cache Saver. The orange bars show the average cost with Cache Saver. Because experiments share prompts, cached responses are reused and average cost drops significantly. The green bars show the marginal cost, that is the added cost of incorporating one more configuration, variation, or method into the experiment.
The reuse potential depends on how similar the experiments are: hyperparameter tuning (A1) achieves the highest savings (6x lower cost, tokens, and latency) since different configurations of the same method share most prompts. Ablation analysis (A2) achieves 2.5x savings. Finally, benchmarking across different methods (A3) still achieves 2x savings, a notable finding since even structurally different reasoning strategies share significant prompt overlap. These savings are on top of existing platform-level optimizations (paged attention, KV caching, prefix sharing, etc.).
Architecture
Cache Saver composes four async pipeline components around your model:
| Component | Role |
|---|---|
| Cacher | Namespace-aware list-valued cache with per-key async mutexes. Tracks per-namespace usage counts for i.i.d. sampling. |
| Deduplicator | Merges duplicate prompts within a batch by (hash, namespace), combines n values, redistributes responses. |
| Reorderer | Sorts by stable identifier before processing, restores original order after. Ensures deterministic results. |
| Batcher | Async producer-consumer queue. Groups requests by batch_size with timeout. |
Local Model Inference
For HuggingFace Transformers models running on local GPUs:
from cachesaver.models.transformers import AsyncHFTransformers
client = AsyncHFTransformers(
model_name="meta-llama/Llama-3.2-1B-Instruct",
namespace="local_exp",
cachedir="./cache",
batch_size=8,
)
response = await client.chat.completions.create(
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_new_tokens=20,
)
Examples
See the examples/ directory:
tutorial.ipynb— Full walkthrough: quickstart, reproducibility, error recovery, parallelism, ReAct agents, Tree-of-Thought, and RAG pipelines.providers_example.ipynb— Usage examples for all supported providers.
Requirements
- Python >= 3.10
Citation
@inproceedings{
potamitis2025cache,
title={Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible {LLM} Inference},
author={Nearchos Potamitis and Lars Henning Klein and Bardia Mohammadi and Chongyang Xu and Attreyee Mukherjee and Niket Tandon and Laurent Bindschaedler and Akhil Arora},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://openreview.net/forum?id=2Nxih3ySSi}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cachesaver-0.0.5.tar.gz.
File metadata
- Download URL: cachesaver-0.0.5.tar.gz
- Upload date:
- Size: 38.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83a9df96fd9db827c7c9399553006d56e6b60714e2a6d2bf732a5b040fadec0b
|
|
| MD5 |
d3a09fb3ebcf1fa960f84108ab0bb2bd
|
|
| BLAKE2b-256 |
18d12ad258fe2366651c48c2c6d9fd0a25961ed72204e97508f4ab48b38c0c7d
|
File details
Details for the file cachesaver-0.0.5-py3-none-any.whl.
File metadata
- Download URL: cachesaver-0.0.5-py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62c2da0ae2aea82f93b36b68f262eb65eae9f051571183caebeaf6c32719ea61
|
|
| MD5 |
c04d5b37d294ddcbe41ce934ecc157e7
|
|
| BLAKE2b-256 |
3d125c551430381d87e04207322eabfc809427268106c5b3a465ef88fe43509b
|