A caching library for LLMs

Project description

cachelm 🌟

Your Smart Caching Layer for LLM Applications

Stop wasting money on redundant API calls. cachelm intelligently caches LLM responses by understanding the meaning of queries, not just the exact words. Slash costs, accelerate response times, and build smarter, faster AI products.

The Problem: Repetitive Queries are Expensive

LLMs are powerful but costly. Users often ask the same question in slightly different ways, leading to identical, expensive API calls that traditional key-value caches can't detect.

"Explain quantum computing" vs. "Break down quantum computing basics"

A traditional cache sees two different requests. cachelm sees one.

By understanding semantic intent, cachelm serves a cached response for the second query, saving you money and delivering the answer in milliseconds.

Why Use cachelm?

Feature	Benefit
Semantic Caching	Intelligently handles paraphrased queries to maximize cache hits.
Cost & Latency Reduction	Cut LLM API costs by 20-40% and slash response times from seconds to milliseconds.
Seamless Integration	Drop it into your existing `openai` client code with just a few lines. No major refactoring needed.
Pluggable Architecture	Modular design lets you easily swap vector databases (Chroma, Redis, etc.) and models.
Streaming Support	Full, out-of-the-box compatibility with streaming chat completions.
Production-Ready	Battle-tested and built for scale with enterprise-grade integrations.

Perfect For:

High-traffic LLM applications where API costs are a concern.
Real-time chatbots and virtual assistants that require instant responses.
Cost-sensitive production deployments and internal tools.

How It Works

cachelm intercepts your LLM API calls and adds a smart caching layer.

Intercept: A user sends a prompt through the cachelm-enhanced client.
Vectorize: The prompt is converted into a numerical representation (an embedding) that captures its semantic meaning.
Search: cachelm searches your vector database for a similar, previously cached prompt.
Decision:
- Cache Hit: If a semantically similar prompt is found within a configurable threshold, the cached response is returned instantly. ⚡
- Cache Miss: If no match is found, the request is sent to the LLM provider (e.g., OpenAI). The new response is then vectorized and stored in the cache for future use.

🛠️ Quick Start

1. Installation

Install cachelm with the default dependencies (ChromaDB & FastEmbed):

pip install "cachelm[chroma,fastembed]"

2. Basic Usage

Enhance your OpenAI client with caching in just a few lines.

from openai import OpenAI
from cachelm import OpenAIAdaptor, ChromaDatabase, FastEmbedVectorizer

# 1. Initialize the caching components
# By default, ChromaDatabase runs in-memory.
database = ChromaDatabase(
    vectorizer=FastEmbedVectorizer(),
    distance_threshold=0.1  # Lower = stricter matching, Higher = looser matching
)

# 2. Create the adaptor and get your enhanced client
# Replace with your actual OpenAI API key
client = OpenAI(api_key="sk-...")
adaptor = OpenAIAdaptor(module=client, database=database)
smart_client = adaptor.get_adapted()

# 3. Use the client as you normally would!
print("--- First call (will be slow and hit the API) ---")
response = smart_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain the basics of quantum computing in simple terms."}],
)
print(response.choices[0].message.content)


print("\n--- Second call (will be fast and served from cache) ---")
# This query is phrased differently but has the same meaning
cached_response = smart_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Could you break down how quantum computers work?"}],
)
print(cached_response.choices[0].message.content)

# The 'x-cachelm-status' header confirms if the response was a cache HIT or MISS
print(f"\nCache status: {cached_response.headers.get('x-cachelm-status')}")

Customizing Your Cache (e.g., Persistent Storage)

To persist your cache across application restarts, configure your database.

import chromadb

# Configure ChromaDB to save data to disk
persistent_settings = chromadb.config.Settings(
    is_persistent=True,
    persist_directory="./my_llm_cache" # Directory to store the database
)

database = ChromaDatabase(
    vectorizer=FastEmbedVectorizer(),
    chromaSettings=persistent_settings,
    distance_threshold=0.1
)

# The rest of your setup remains the same!
# adaptor = OpenAIAdaptor(...)

Middleware: Customize Caching Behavior

The middleware system lets you hook into the caching process to modify or filter data. This is perfect for handling variable data (like names or IDs) and protecting sensitive information.

Key hooks:

pre_cache_save: Runs before a new response is saved to the cache.
post_cache_retrieval: Runs after a response is retrieved from the cache.

Example: Normalizing Data with `Replacer`

Imagine your prompts contain usernames that change but the core question is the same. The Replacer middleware substitutes placeholders to ensure these queries result in a cache hit.

from cachelm.middlewares.replacer import Replacer, Replacement

# Define replacements: "Anmol" will be treated as {{name}} for caching
replacements = [
    Replacement(key="{{name}}", value="Anmol"),
    Replacement(key="{{user_id}}", value="user_12345"),
]

adaptor = OpenAIAdaptor(
    ...,
    middlewares=[Replacer(replacements)]
)

# Now these two queries will map to the same cache entry:
# 1. "My name is Anmol. What's my order status for ID user_12345?"
# 2. "My name is Bob. What's my order status for ID user_67890?" (if Bob and user_67890 are also in replacements)

Before caching, "Anmol" becomes {{name}}. After retrieval, {{name}} is changed back to "Anmol". This dramatically improves cache hits for template-like queries.

Supported Integrations & Installation

cachelm is designed to be modular. Install only what you need.

Category	Technology	`pip install "cachelm[...]"`
Databases	ChromaDB	`[chroma]`
	Redis	`[redis]`
	ClickHouse	`[clickhouse]`
	Qdrant	`[qdrant]`
Vectorizers	FastEmbed	`[fastembed]`
	RedisVL	`[redis]`
	Text2Vec-Chroma	`[chroma]`
LLMs	OpenAI	(Included by default)

More integrations for providers like Anthropic and Cohere are coming soon!

Enterprise & High-Performance Setups

cachelm is ready for demanding production environments.

Redis + RedisVL for High Throughput

from cachelm.databases.redis import RedisDatabase
from cachelm.vectorizers.redisvl import RedisVLVectorizer

# Assumes you have a Redis instance with the RediSearch module
database = RedisDatabase(
    vectorizer=RedisVLVectorizer(model="sentence-transformers/all-MiniLM-L6-v2"),
    redis_url="redis://localhost:6379",
    index_name="llm_cache_prod"
)

ClickHouse for Cloud-Scale Analytics

from cachelm.databases.clickhouse import ClickHouse
from cachelm.vectorizers.fastembed import FastEmbedVectorizer

# Connect to a self-hosted or ClickHouse Cloud instance
database = ClickHouse(
    vectorizer=FastEmbedVectorizer(),
    host="your.clickhouse.cloud.host",
    port=8443,
    username="default",
    password="your-password"
)

Extending cachelm & Contributing

We welcome contributions! The modular design makes it easy to add new components.

1. Add a New Vectorizer

Implement the Vectorizer interface to support a new embedding model.

from cachelm.vectorizers.vectorizer import Vectorizer

class MyVectorizer(Vectorizer):
    def embed(self, text: str) -> list[float]:
        return my_embedding_model.encode(text).tolist()

    def embed_many(self, texts: list[str]) -> list[list[float]]:
        return my_embedding_model.encode(texts).tolist()

2. Add a New Vector Database

Implement the Database interface to connect to a different vector store.

from cachelm.databases.database import Database
from cachelm.types.chat_history import Message

class MyDatabase(Database):
    def find(self, history: list[Message], distance_threshold=0.1) -> Message | None:
        # Your logic to search for a similar history vector
        pass
    def write(self, history: list[Message], response: Message):
        # Your logic to store the history vector and response
        pass
    # ... implement connect() and disconnect()

See our Contribution Guide to get started. We're excited to see what you build!

License

cachelm is licensed under the MIT License. It is free for both personal and commercial use.

Ready to Accelerate Your LLM Workloads? Report an Issue | View the Source

Project details

Release history Release notifications | RSS feed

This version

0.5.11

Jul 3, 2025

0.5.10

Jul 3, 2025

0.5.9

Jun 24, 2025

0.5.8

Jun 24, 2025

0.5.7

Jun 24, 2025

0.5.6

Jun 24, 2025

0.5.5

Jun 23, 2025

0.5.4

Jun 23, 2025

0.5.3

Jun 13, 2025

0.5.2

Jun 13, 2025

0.5.1

Jun 12, 2025

0.5.0

Jun 10, 2025

0.4.6

Jun 9, 2025

0.4.5

Jun 5, 2025

0.4.4

Jun 5, 2025

0.4.3

Jun 4, 2025

0.4.2

Jun 4, 2025

0.4.1

Jun 4, 2025

0.4.0

Jun 4, 2025

0.3.3

Jun 4, 2025

0.3.2

Jun 3, 2025

0.3.1

Jun 3, 2025

0.3.0

Jun 3, 2025

0.2.0

Jun 3, 2025

0.1.5

Jun 2, 2025

0.1.4

Jun 2, 2025

0.1.3

Jun 2, 2025

0.1.2

Jun 2, 2025

0.1.1

May 18, 2025

0.1.0

May 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cachelm-0.5.11.tar.gz (2.1 MB view details)

Uploaded Jul 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cachelm-0.5.11-py3-none-any.whl (30.5 kB view details)

Uploaded Jul 3, 2025 Python 3

File details

Details for the file cachelm-0.5.11.tar.gz.

File metadata

Download URL: cachelm-0.5.11.tar.gz
Upload date: Jul 3, 2025
Size: 2.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for cachelm-0.5.11.tar.gz
Algorithm	Hash digest
SHA256	`732dbdfb93a19b6868811b8e2868f932c29f8e166f79f28cc98f95ba619502dc`
MD5	`d7fc3d983a7cc3edf12511048b932152`
BLAKE2b-256	`fe9ec919f216263aaed922b7d3dcfefbf2f8297a78a41646f42b2ddeac1849ba`

See more details on using hashes here.

File details

Details for the file cachelm-0.5.11-py3-none-any.whl.

File metadata

Download URL: cachelm-0.5.11-py3-none-any.whl
Upload date: Jul 3, 2025
Size: 30.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for cachelm-0.5.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a641d42ace3cf900d8b635e166924d6f2c7857b94b3988930d9c2b56b6f89a95`
MD5	`1811b7c46b2d22e442bd8c5462034ccf`
BLAKE2b-256	`11cb07abf6b927765ed0563fa24face6f3e9cd7770f2daceedfd2646d1cdebb1`

See more details on using hashes here.

cachelm 0.5.11

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

cachelm 🌟

The Problem: Repetitive Queries are Expensive

Why Use cachelm?

How It Works

🛠️ Quick Start

1. Installation

2. Basic Usage

Customizing Your Cache (e.g., Persistent Storage)

Middleware: Customize Caching Behavior

Example: Normalizing Data with Replacer

Supported Integrations & Installation

Enterprise & High-Performance Setups

Redis + RedisVL for High Throughput

ClickHouse for Cloud-Scale Analytics

Extending cachelm & Contributing

1. Add a New Vectorizer

2. Add a New Vector Database

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Example: Normalizing Data with `Replacer`