Skip to main content

A LLM serving engine extension to reduce TTFT and increase throughput, especially under long-context scenarios.

Project description

lmcache logo

A KV Cache Management Layer for Scalable LLM Inference


Blog | Documentation | Join Slack | Community Meeting | Roadmap

PyPI PyPI - Downloads GitHub commit activity Ask DeepWiki

Updates

  • [2026/05] 🔥 Agentic workload benchmark on AMD MI300X (blog).
  • [2026/04] 🔥 LMCache's new multiprocess(MP) architecture release (blog).
  • [2026/03] LMCache at GTC 2026 (post).
  • [2026/01] LMCache multi-node P2P CPU memory sharing, from experimental feature to production (blog).
More
  • [2025/11] LMCache x CoreWeave accelerate efficient LLM inference for Cohere (blog).
  • [2025/10] LMCache joins the PyTorch Foundation and Tensormesh unveiled (blog, PyTorch).
  • [2025/09] NVIDIA Dynamo integrates LMCache, accelerating LLM inference (blog).
  • [2025/08] 🎉 LMCache hits 5,000+ GitHub stars (blog).
  • [2025/08] LMCache supports gpt-oss (20B/120B) on day 1 (blog).
  • [2025/07] Get faster LLM inference and cheaper responses with LMCache and Redis (Redis blog).
  • [2025/07] LMCache extends its turbo-boost to multimodal models in vLLM V1 (blog).
  • [2025/06] LLM Production Stack goes cross-hardware: AMD, Arm and Ascend (blog).

About

LMCache is a KV cache management layer for LLM inference. It turns KV cache from a temporary state into reusable AI-native knowledge that can be stored persistently, reused across multiple serving engines, monitored with an observability stack, and transformed for better generation quality. As a result, LMCache reduces TTFT (time-to-first-token) and improves throughput, especially for long-context agentic, multi-turn conversation, and knowledge-augmented workloads (e.g., RAG).

LMCache is vendor-neutral. It can be used as a KV cache layer for a range of mainstream open-source serving engines, inference frameworks, hardware vendors, storage systems, and infrastructure providers. The vendor neutrality allows users to freely switch between serving engines and storage vendors, while reusing the stored KV caches.

LMCache Deployment Modes

Key features

  • Engine-independent deployment: LMCache, as a standalone daemon process, manages KV cache independently from the inference engine process, so that KV cache will not be lost even if the inference engine crashes (i.e., no fate-sharing with engines).

  • Persistent, tiered KV cache offloading and reuse: Move KV caches out of GPU memory into a tiered storage hierarchy spanning CPU memory, local storage, and remote backends, enabling reuse across requests, sessions, and engine instances to reduce repeated prefill computation and improve TTFT.

  • Production-level KV cache observability: LMCache provides a rich set of KV cache observability metrics, including typical Kubernetes metrics (health monitoring, performance diagnostics), KV-cache-specific metrics (request-level and token-level prefix cache hits, lifecycle, request-level KV cache performance), management metrics (user-specific usage), and more.

  • Pluggable storage and transport backends: Easily integrate remote storage and KV transfer backends through a unified interface, enabling KV cache offloading and sharing across storage providers. Through this interface, LMCache supports storage backends including CPU RAM, local disk (SSD), Redis/Valkey, Mooncake, InfiniStore, S3-compatible object storage, NIXL, and GDS.

  • Non-prefix KV reuse: Extend KV reuse beyond prefix caching by reusing cached KV blocks at any position in the prompt. This leverages CacheBlend to selectively recompute tokens for quality recovery.

  • PD disaggregation and KV transfer: Support KV cache transfer from prefill workers to decode workers over NVLink, RDMA, or TCP through transport layers such as NIXL.

  • Pluggable KV transformation: A simple interface for researchers to write compression, token dropping, and custom serialization through a flexible SERDE interface.

LMCache is becoming an integral layer in the LLM inference ecosystem, with community-driven integration with serving engines, inference frameworks, hardware vendors, storage systems, and infrastructure providers:

LMCache ecosystem

Getting Started

To use LMCache, simply install lmcache from your package manager, e.g. pip:

pip install lmcache

For more setup options and examples, see:

Contributing

We welcome and value contributions and collaborations. Join us in improving LMCache. Check out the Contributing Guide or join our Slack community to get started.

Adoption and Partnerships

LMCache has a growing community of developers, researchers, industry adopters, and partners building the next generation of efficient LLM inference systems.

LMCache Adoption and Partnerships

As an independent open-source project, LMCache is becoming the de-facto standard for KV Cache management in LLM inference. Its continued development and community work are supported in part by Tensormesh.

Citation

LMCache builds on research in KV cache management, including cache reuse, offloading, compression, and serving optimization. If you use LMCache in your research, please cite the LMCache paper and related work.

@article{cheng2025lmcache,
  title={LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference},
  author={Cheng, Yihua and Liu, Yuhan and Yao, Jiayi and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Du, Kuntai and Jiang, Junchen},
  journal={arXiv preprint arXiv:2510.09665},
  year={2025}
}
Related papers
@inproceedings{liu2024cachegen,
  title={Cachegen: Kv cache compression and streaming for fast large language model serving},
  author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},
  booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},
  pages={38--56},
  year={2024}
}

@inproceedings{yao2025cacheblend,
  title={Cacheblend: Fast large language model serving for rag with cached knowledge fusion},
  author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
  booktitle={Proceedings of the twentieth European conference on computer systems},
  pages={94--109},
  year={2025}
}

License

The LMCache codebase is licensed under Apache License 2.0. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmcache-0.4.7.tar.gz (6.2 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lmcache-0.4.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

lmcache-0.4.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

lmcache-0.4.7-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

lmcache-0.4.7-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (13.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file lmcache-0.4.7.tar.gz.

File metadata

  • Download URL: lmcache-0.4.7.tar.gz
  • Upload date:
  • Size: 6.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lmcache-0.4.7.tar.gz
Algorithm Hash digest
SHA256 15c6c4808e37f9ff22b46905bec6db2da65e82ce5558aab58418576ce825dfe3
MD5 facf820b4d77fbe58adff7dcb8c83e3a
BLAKE2b-256 95e906c0e037dcff3b0a946eade3809eb9e9ac985f7fa9bc0632456e418b4995

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.4.7.tar.gz:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmcache-0.4.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for lmcache-0.4.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a8d251fa10e8e8e0df91eeef056d473929f38ac7ad8d771c6fbe656da228ca89
MD5 4029cbbdcc03893c5921677a887c1950
BLAKE2b-256 f75dda409bfdc241abeef867953148fd060e2759047155d2f6ff3e4f4d65b41a

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.4.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmcache-0.4.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for lmcache-0.4.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 767f45e295d1bd2f4e686af3b5b89fc6c45d66f64e9f945e9b1081d7ede37e08
MD5 1f4a488aad852080c7c89348b695f34d
BLAKE2b-256 fb298f0e23b69509daa590e532fe708a6694f7bcdc0ef8100504a008eee01926

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.4.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmcache-0.4.7-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for lmcache-0.4.7-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 79022445f3129350d37ef949961c4c3ca3f528c73ed5b45ee8c2b01916e6071d
MD5 6aedd32b26de12581c2e53091583dac7
BLAKE2b-256 c4f43d9ec1d40d9db5f10d89e83eac8e08f0c41baacbbcd940f3bb470ed19619

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.4.7-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmcache-0.4.7-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for lmcache-0.4.7-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f031e26b47996b0c3b341b8fa9ff2d80fb0b3f464dcf6429d2c9c4d2196339c4
MD5 cee580c955b9e1994b515568f42007e6
BLAKE2b-256 31e0dafe29a91c1b54f74fdc6b122a392875e1a976f7442122ea9a5dcbb9af56

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.4.7-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page