Skip to main content

LMCache: prefill your long contexts only once

Project description

lmcache logo

💡 What is LMCache?

LMCache lets LLMs prefill each text only once. By storing the KV caches of all reusable texts, LMCache can reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. It thus reduces prefill delay, i.e., time to first token (TTFT), as well as saves the precious GPU cycles.

By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

Try LMCache with pre-built vllm docker images here.

🚀 Performance snapshot

image

💻 Quickstart

LMCache provides the integration to the latest vLLM (0.6.2). To install LMCache, use the following command:

# requires python >= 3.10 and nvcc >= 12.1
pip install lmcache lmcache_vllm

LMCache has the same interface as vLLM (both online serving and offline inference). To use the online serving, you can start an OpenAI API-compatible vLLM server with LMCache via:

lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8

To use vLLM's offline inference with LMCache, just simply add lmcache_vllm before the import to vLLM components. For example

import lmcache_vllm.vllm as vllm
from lmcache_vllm.vllm import LLM 

More detailed documentation will be available soon.

- Sharing KV cache across multiple vLLM instances

LMCache supports sharing KV across different vLLM instances by the lmcache.server module. Here is a quick guide:

# Start lmcache server
lmcache_server localhost 65432

Then, start two vLLM instances with the LMCache config file

wget https://raw.githubusercontent.com/LMCache/LMCache/refs/heads/dev/examples/example.yaml

# start the first vLLM instance
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=0 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8000

# start the second vLLM instance
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=1 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8001

- What's next

We also provide multiple docker-based demos at 🔗LMCache-demos repo. The demos cover the following use cases:

  • Share KV caches across multiple serving engines (🔗link)
  • Loading non-prefix KV caches for RAG (🔗link)

🛣️ Incoming Milestones

  • First release of LMCache
  • Support installation through pip install and integrate with latest vLLM
  • Stable support for non-prefix KV caches
  • User and developer documentation

📖 Blogs and papers

LMCache is built on two key techniques:

  1. CacheGen [SIGCOMM'24]: A KV-cache compression system that encodes KV caches into compact bitstreams.
  2. CacheBlend [EuroSys'25]: A KV-cache blending system that dynamically composes new KV caches from smaller ones.

Please read our blog posts for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

lmcache-0.1.3-py3-none-any.whl (65.7 kB view details)

Uploaded Python 3

File details

Details for the file lmcache-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: lmcache-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 65.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for lmcache-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8d393e6e26e0ca2c3bf3773136abcbba17048fe749c94f6f30fb9812c74988f2
MD5 89a3922df8b0721372bc5f52d40fab2d
BLAKE2b-256 5c433fc26dd8f253a5f6171bf35ed9b459b56a9a679d7fc2654ef9281933d8f2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page