Skip to main content

LMCache: prefill your long contexts only once

Project description

lmcache logo

💡 What is LMCache?

LMCache lets LLMs prefill each text only once. By storing the KV caches of all reusable texts, LMCache can reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. It thus reduces prefill delay, i.e., time to first token (TTFT), as well as saves the precious GPU cycles.

By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

Try LMCache with pre-built vllm docker images here.

🚀 Performance snapshot

image

💻 Quickstart

We provide a docker-based quickstart demo in the folder examples/. This quickstart lets you start a serving engine (vLLM) with LMCache and then query the serving engine with a long context.

- Prerequisites

First, clone and cd into the LMCache repo with

git clone https://github.com/LMCache/LMCache && cd LMCache

To run the quickstart demo, your server should have 1 GPU and the docker environment with the nvidia-runtime installed.

You may need sudo access to run the docker depending on the server configuration.

This demo will use the port 8000 (for vLLM) and 8501 (for the frontend).

- Start the serving engine with LMCache

Start the docker-based serving engine by:

bash examples/quickstart.sh

The vLLM serving engine is ready after you see the following lines in the log: image

- Start the frontend

The quickstart comes with a frontend. To run the frontend, use:

pip install openai streamlit
streamlit run examples/quickstart-frontend.py

You should be able to access the frontend from your browser at http://<your server's IP>:8501

The first query has a long TTFT because the server needs to prefill the long context. But once the first quey finishes, the TTFT of all future queries will be much lower as LMCache shares the KV cache to vLLM which can then skip the prefill of the long context.

- What's next

We provide multiple demos at 🔗LMCache-demos repo. The demos cover the following use cases:

  • Share KV caches across multiple serving engines (🔗link)
  • Loading non-prefix KV caches for RAG (🔗link)

🛣️ Project Milestones

  • First release of LMCache
  • Support installation through pip install
  • Integration with latest vLLM

📖 Blogs and papers

LMCache is built on two key techniques:

  1. CacheGen [SIGCOMM'24]: A KV-cache compression system that encodes KV caches into compact bitstreams.
  2. CacheBlend [EuroSys'25]: A KV-cache blending system that dynamically composes new KV caches from smaller ones.

Please read our blog posts for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

lmcache-0.1.2-py3-none-any.whl (47.3 kB view details)

Uploaded Python 3

File details

Details for the file lmcache-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: lmcache-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 47.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for lmcache-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2946fafa950100c70719b9c4ae478dff074e5ac920919f050eb103ce010bc846
MD5 0e2c0824264c7c346f073c182262811f
BLAKE2b-256 29c3dba4665b3bf5b28cc8e73e9884998cc091aa518043e44a97403c856b9839

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page