LMCache: prefill your long contexts only once
Project description
💡 What is LMCache?
LMCache lets LLMs prefill each text only once. By storing the KV caches of all reusable texts, LMCache can reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. It thus reduces prefill delay, i.e., time to first token (TTFT), as well as saves the precious GPU cycles.
By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.
Try LMCache with pre-built vllm docker images here.
🚀 Performance snapshot
💻 Quickstart
We provide a docker-based quickstart demo in the folder examples/
. This quickstart lets you start a serving engine (vLLM) with LMCache and then query the serving engine with a long context.
- Prerequisites
First, clone and cd into the LMCache repo with
git clone https://github.com/LMCache/LMCache && cd LMCache
To run the quickstart demo, your server should have 1 GPU and the docker environment with the nvidia-runtime installed.
You may need sudo access to run the docker depending on the server configuration.
This demo will use the port 8000 (for vLLM) and 8501 (for the frontend).
- Start the serving engine with LMCache
Start the docker-based serving engine by:
bash examples/quickstart.sh
The vLLM serving engine is ready after you see the following lines in the log:
- Start the frontend
The quickstart comes with a frontend. To run the frontend, use:
pip install openai streamlit
streamlit run examples/quickstart-frontend.py
You should be able to access the frontend from your browser at http://<your server's IP>:8501
The first query has a long TTFT because the server needs to prefill the long context. But once the first quey finishes, the TTFT of all future queries will be much lower as LMCache shares the KV cache to vLLM which can then skip the prefill of the long context.
- What's next
We provide multiple demos at 🔗LMCache-demos repo. The demos cover the following use cases:
- Share KV caches across multiple serving engines (🔗link)
- Loading non-prefix KV caches for RAG (🔗link)
🛣️ Project Milestones
- First release of LMCache
- Support installation through pip install
- Integration with latest vLLM
📖 Blogs and papers
LMCache is built on two key techniques:
- CacheGen [SIGCOMM'24]: A KV-cache compression system that encodes KV caches into compact bitstreams.
- CacheBlend [EuroSys'25]: A KV-cache blending system that dynamically composes new KV caches from smaller ones.
Please read our blog posts for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file lmcache-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: lmcache-0.1.2-py3-none-any.whl
- Upload date:
- Size: 47.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2946fafa950100c70719b9c4ae478dff074e5ac920919f050eb103ce010bc846 |
|
MD5 | 0e2c0824264c7c346f073c182262811f |
|
BLAKE2b-256 | 29c3dba4665b3bf5b28cc8e73e9884998cc091aa518043e44a97403c856b9839 |