A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

These details have not been verified by PyPI

Project description

Cache Saver

A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

Accepted at EMNLP 2025 (Findings)

Cache Saver is a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations, integrating cleanly into existing systems without requiring changes to the end-user application logic or the underlying LLM. At its heart is a namespace-aware list-valued cache that ensures statistical integrity of LLM responses by generating i.i.d. responses within a namespace while enabling response reuse across namespaces, all while guaranteeing full reproducibility.

On average across five reasoning strategies, five benchmark tasks, and three LLMs, Cache Saver reduces USD cost by ~25% and CO2 emissions by ~35%. In practical scenarios such as benchmarking and ablation analysis, savings reach up to 60%.

# Just change the import — everything else stays the same
from cachesaver.models.openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)

Installation

pip install cachesaver

For local HuggingFace Transformers inference:

pip install cachesaver[transformers]

Quick Start

Replace your LLM client import with Cache Saver's — the rest of your code is unchanged:

# Before
from openai import AsyncOpenAI

# After
from cachesaver.models.openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

# Run again → A new sample is generated
response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

# Re-initialize the client and run again → The responses are retrieved from the cache
client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

A synchronous client is also available:

from cachesaver.models.openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

Supported Providers

Provider	Import
OpenAI	`from cachesaver.models.openai import AsyncOpenAI, OpenAI`
Anthropic	`from cachesaver.models.anthropic import AsyncAnthropic, Anthropic`
Google Gemini	`from cachesaver.models.gemini import AsyncGemini, Gemini`
Together AI	`from cachesaver.models.together import AsyncTogether`
Groq	`from cachesaver.models.groq import AsyncGroq, Groq`
OpenRouter	`from cachesaver.models.openrouter import AsyncOpenRouter, OpenRouter`
HuggingFace (Inference Providers)	`from cachesaver.models.huggingface import AsyncHuggingFace, HuggingFace`
vLLM	`from cachesaver.models.vllm import AsyncVLLM, VLLM`
HuggingFace Transformers	`from cachesaver.models.transformers import AsyncHFTransformers, HFTransformers`

All cloud providers use the same interface as their original SDK. Just change the import.

Key Features

Statistical Integrity via Namespaced Caching

Unlike naive key-value caches, Cache Saver uses a list-valued cache managed through namespaces. Within a namespace, all responses to a given prompt are guaranteed to be i.i.d. — a response is never reused within the same namespace. Across namespaces, responses are reused via stochastic coupling, which is what drives the cost savings. This is critical for scenarios like stochastic sampling, uncertainty estimation, and policy diversity, where multiple independent responses to the same prompt are required.

Reproducibility

Namespaces track which cached responses have been consumed, so re-running an experiment from scratch replays the exact same results in the exact same order — even for duplicate prompts.

# Run 1 — calls the API
results_run1 = await classify(sentences, namespace="experiment_v1")

# Run 2 — new namespace, identical results from cache
results_run2 = await classify(sentences, namespace="experiment_v2")
assert results_run1 == results_run2  # Always true

Error Recovery

Crash on item 7 of 10? Re-run and items 1–6 are served from cache instantly. Only items 7–10 hit the API.

# Attempt 1 — crashes at item 7
try:
    results = await process(items, namespace="my_exp")
except RuntimeError:
    pass  # Items 1-6 are cached

# Attempt 2 — items 1-6 from cache, only 7-10 call API
results = await process(items, namespace="my_exp")

Async Parallelism

Fully async-native. Use asyncio.gather for concurrent requests:

results = await asyncio.gather(*[
    client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[{"role": "user", "content": prompt}],
    )
    for prompt in prompts
])

Deterministic Async Ordering

When multiple async agents process the same prompt concurrently, Cache Saver caches by request id — not request or completion order. A built-in reordering module ensures replays are deterministic regardless of which task finishes first.

Why It Works: Reuse Potential in LLM Reasoning

Multi-step reasoning strategies (Tree-of-Thought, ReAct, RAP, FoA, ReST-MCTS*, etc.) are highly repetitive — ~50% of prompts are duplicates both within a single method execution and across methods on the same task. Cache Saver exploits this redundancy across three practical scenarios:

Practical application results across cost, tokens, latency, and throughput
Three practical scenarios using GPT-4.1-Nano across the benchmarks of Game of 24, HumanEval, and SciBench.

The figure shows Cache Saver's impact across three practical ML scenarios. A1-Hyperparameter tuning: grid search over Tree-of-Thought configurations (tree width, depth, number of evaluations). A2-Ablation analysis: testing three variations of the FoA algorithm (removing the selection phase, backtracking, or resampling). A3-Benchmarking: comparing entirely different reasoning strategies (ToT, GoT, FoA).

The blue bars show the cost without Cache Saver. The orange bars show the average cost with Cache Saver. Because experiments share prompts, cached responses are reused and average cost drops significantly. The green bars show the marginal cost, that is the added cost of incorporating one more configuration, variation, or method into the experiment.

The reuse potential depends on how similar the experiments are: hyperparameter tuning (A1) achieves the highest savings (6x lower cost, tokens, and latency) since different configurations of the same method share most prompts. Ablation analysis (A2) achieves 2.5x savings. Finally, benchmarking across different methods (A3) still achieves 2x savings, a notable finding since even structurally different reasoning strategies share significant prompt overlap. These savings are on top of existing platform-level optimizations (paged attention, KV caching, prefix sharing, etc.).

Architecture

Cache Saver composes four async pipeline components around your model:

Component	Role
Cacher	Namespace-aware list-valued cache with per-key async mutexes. Tracks per-namespace usage counts for i.i.d. sampling.
Deduplicator	Merges duplicate prompts within a batch by (hash, namespace), combines `n` values, redistributes responses.
Reorderer	Sorts by stable identifier before processing, restores original order after. Ensures deterministic results.
Batcher	Async producer-consumer queue. Groups requests by `batch_size` with timeout.

Local Model Inference

For HuggingFace Transformers models running on local GPUs:

from cachesaver.models.transformers import AsyncHFTransformers

client = AsyncHFTransformers(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    namespace="local_exp",
    cachedir="./cache",
    batch_size=8,
)

response = await client.chat.completions.create(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_new_tokens=20,
)

Examples

See the examples/ directory:

tutorial.ipynb — Full walkthrough: quickstart, reproducibility, error recovery, parallelism, ReAct agents, Tree-of-Thought, and RAG pipelines.
providers_example.ipynb — Usage examples for all supported providers.

Requirements

Python >= 3.10

Citation

@inproceedings{
potamitis2025cache,
title={Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible {LLM} Inference},
author={Nearchos Potamitis and Lars Henning Klein and Bardia Mohammadi and Chongyang Xu and Attreyee Mukherjee and Niket Tandon and Laurent Bindschaedler and Akhil Arora},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://openreview.net/forum?id=2Nxih3ySSi}
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.6

Apr 20, 2026

0.0.5

Mar 16, 2026

0.0.3

Nov 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cachesaver-0.0.6.tar.gz (38.9 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cachesaver-0.0.6-py3-none-any.whl (25.7 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file cachesaver-0.0.6.tar.gz.

File metadata

Download URL: cachesaver-0.0.6.tar.gz
Upload date: Apr 20, 2026
Size: 38.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for cachesaver-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`b950e6614ce0df6e8b7bac5f8bb2152aa8db19c02db0b4562199070107205bb0`
MD5	`24bc3223b14364f0915b2dab1e862d3c`
BLAKE2b-256	`cbfb61327e761f570f23abd130dc2adb8becaccd204bed72f1b5774af6d86437`

See more details on using hashes here.

File details

Details for the file cachesaver-0.0.6-py3-none-any.whl.

File metadata

Download URL: cachesaver-0.0.6-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 25.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for cachesaver-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b81e042bba91c86da15d34b267a34316b53f20eaaa245ec851e2957516081b8f`
MD5	`b083d7fed2abdd29f8356ad9bdee0e5a`
BLAKE2b-256	`4916d45aae699bd280d7639b3a36dc4ede8e010ac90a1f215aa95d5c47a04df0`

See more details on using hashes here.

cachesaver 0.0.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Cache Saver

Installation

Quick Start

Supported Providers

Key Features

Statistical Integrity via Namespaced Caching

Reproducibility

Error Recovery

Async Parallelism

Deterministic Async Ordering

Why It Works: Reuse Potential in LLM Reasoning

Architecture

Local Model Inference

Examples

Requirements

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes