Skip to main content

A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

Project description

Cache Saver

A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

Accepted at EMNLP 2025 (Findings)

Paper Project Page PyPI - Version

Cache Saver is a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations, integrating cleanly into existing systems without requiring changes to the end-user application logic or the underlying LLM. At its heart is a namespace-aware list-valued cache that ensures statistical integrity of LLM responses by generating i.i.d. responses within a namespace while enabling response reuse across namespaces, all while guaranteeing full reproducibility.

On average across five reasoning strategies, five benchmark tasks, and three LLMs, Cache Saver reduces USD cost by ~25% and CO2 emissions by ~35%. In practical scenarios such as benchmarking and ablation analysis, savings reach up to 60%.

# Just change the import — everything else stays the same
from cachesaver.models.openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)

Installation

pip install cachesaver

For local HuggingFace Transformers inference:

pip install cachesaver[transformers]

Quick Start

Replace your LLM client import with Cache Saver's — the rest of your code is unchanged:

# Before
from openai import AsyncOpenAI

# After
from cachesaver.models.openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

# Run again → A new sample is generated
response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

# Re-initialize the client and run again → The responses are retrieved from the cache
client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

A synchronous client is also available:

from cachesaver.models.openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

Supported Providers

Provider Import
OpenAI from cachesaver.models.openai import AsyncOpenAI, OpenAI
Anthropic from cachesaver.models.anthropic import AsyncAnthropic, Anthropic
Google Gemini from cachesaver.models.gemini import AsyncGemini, Gemini
Together AI from cachesaver.models.together import AsyncTogether
Groq from cachesaver.models.groq import AsyncGroq, Groq
OpenRouter from cachesaver.models.openrouter import AsyncOpenRouter, OpenRouter
HuggingFace (Inference Providers) from cachesaver.models.huggingface import AsyncHuggingFace, HuggingFace
vLLM from cachesaver.models.vllm import AsyncVLLM, VLLM
HuggingFace Transformers from cachesaver.models.transformers import AsyncHFTransformers, HFTransformers

All cloud providers use the same interface as their original SDK. Just change the import.

Key Features

Statistical Integrity via Namespaced Caching

Unlike naive key-value caches, Cache Saver uses a list-valued cache managed through namespaces. Within a namespace, all responses to a given prompt are guaranteed to be i.i.d. — a response is never reused within the same namespace. Across namespaces, responses are reused via stochastic coupling, which is what drives the cost savings. This is critical for scenarios like stochastic sampling, uncertainty estimation, and policy diversity, where multiple independent responses to the same prompt are required.

Reproducibility

Namespaces track which cached responses have been consumed, so re-running an experiment from scratch replays the exact same results in the exact same order — even for duplicate prompts.

# Run 1 — calls the API
results_run1 = await classify(sentences, namespace="experiment_v1")

# Run 2 — new namespace, identical results from cache
results_run2 = await classify(sentences, namespace="experiment_v2")
assert results_run1 == results_run2  # Always true

Error Recovery

Crash on item 7 of 10? Re-run and items 1–6 are served from cache instantly. Only items 7–10 hit the API.

# Attempt 1 — crashes at item 7
try:
    results = await process(items, namespace="my_exp")
except RuntimeError:
    pass  # Items 1-6 are cached

# Attempt 2 — items 1-6 from cache, only 7-10 call API
results = await process(items, namespace="my_exp")

Async Parallelism

Fully async-native. Use asyncio.gather for concurrent requests:

results = await asyncio.gather(*[
    client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[{"role": "user", "content": prompt}],
    )
    for prompt in prompts
])

Deterministic Async Ordering

When multiple async agents process the same prompt concurrently, Cache Saver caches by request id — not request or completion order. A built-in reordering module ensures replays are deterministic regardless of which task finishes first.

Why It Works: Reuse Potential in LLM Reasoning

Multi-step reasoning strategies (Tree-of-Thought, ReAct, RAP, FoA, ReST-MCTS*, etc.) are highly repetitive — ~50% of prompts are duplicates both within a single method execution and across methods on the same task. Cache Saver exploits this redundancy across three practical scenarios:

Practical application results across cost, tokens, latency, and throughput
Three practical scenarios using GPT-4.1-Nano across the benchmarks of Game of 24, HumanEval, and SciBench.

The figure shows Cache Saver's impact across three practical ML scenarios. A1-Hyperparameter tuning: grid search over Tree-of-Thought configurations (tree width, depth, number of evaluations). A2-Ablation analysis: testing three variations of the FoA algorithm (removing the selection phase, backtracking, or resampling). A3-Benchmarking: comparing entirely different reasoning strategies (ToT, GoT, FoA).

The blue bars show the cost without Cache Saver. The orange bars show the average cost with Cache Saver. Because experiments share prompts, cached responses are reused and average cost drops significantly. The green bars show the marginal cost, that is the added cost of incorporating one more configuration, variation, or method into the experiment.

The reuse potential depends on how similar the experiments are: hyperparameter tuning (A1) achieves the highest savings (6x lower cost, tokens, and latency) since different configurations of the same method share most prompts. Ablation analysis (A2) achieves 2.5x savings. Finally, benchmarking across different methods (A3) still achieves 2x savings, a notable finding since even structurally different reasoning strategies share significant prompt overlap. These savings are on top of existing platform-level optimizations (paged attention, KV caching, prefix sharing, etc.).

Architecture

Cache Saver composes four async pipeline components around your model:

Component Role
Cacher Namespace-aware list-valued cache with per-key async mutexes. Tracks per-namespace usage counts for i.i.d. sampling.
Deduplicator Merges duplicate prompts within a batch by (hash, namespace), combines n values, redistributes responses.
Reorderer Sorts by stable identifier before processing, restores original order after. Ensures deterministic results.
Batcher Async producer-consumer queue. Groups requests by batch_size with timeout.

Local Model Inference

For HuggingFace Transformers models running on local GPUs:

from cachesaver.models.transformers import AsyncHFTransformers

client = AsyncHFTransformers(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    namespace="local_exp",
    cachedir="./cache",
    batch_size=8,
)

response = await client.chat.completions.create(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_new_tokens=20,
)

Examples

See the examples/ directory:

  • tutorial.ipynb — Full walkthrough: quickstart, reproducibility, error recovery, parallelism, ReAct agents, Tree-of-Thought, and RAG pipelines.
  • providers_example.ipynb — Usage examples for all supported providers.

Requirements

  • Python >= 3.10

Citation

@inproceedings{
potamitis2025cache,
title={Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible {LLM} Inference},
author={Nearchos Potamitis and Lars Henning Klein and Bardia Mohammadi and Chongyang Xu and Attreyee Mukherjee and Niket Tandon and Laurent Bindschaedler and Akhil Arora},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://openreview.net/forum?id=2Nxih3ySSi}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cachesaver-0.0.6.tar.gz (38.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cachesaver-0.0.6-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file cachesaver-0.0.6.tar.gz.

File metadata

  • Download URL: cachesaver-0.0.6.tar.gz
  • Upload date:
  • Size: 38.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for cachesaver-0.0.6.tar.gz
Algorithm Hash digest
SHA256 b950e6614ce0df6e8b7bac5f8bb2152aa8db19c02db0b4562199070107205bb0
MD5 24bc3223b14364f0915b2dab1e862d3c
BLAKE2b-256 cbfb61327e761f570f23abd130dc2adb8becaccd204bed72f1b5774af6d86437

See more details on using hashes here.

File details

Details for the file cachesaver-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: cachesaver-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for cachesaver-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b81e042bba91c86da15d34b267a34316b53f20eaaa245ec851e2957516081b8f
MD5 b083d7fed2abdd29f8356ad9bdee0e5a
BLAKE2b-256 4916d45aae699bd280d7639b3a36dc4ede8e010ac90a1f215aa95d5c47a04df0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page