Skip to main content

A LLM serving engine extension to reduce TTFT and increase throughput, especially under long-context scenarios.

Project description

lmcache logo

Docs PyPI PyPI - Python Version Unit Tests Code Quality Integration Tests


OpenSSF Best Practices OpenSSF Scorecard Ask DeepWiki GitHub commit activity PyPI - Downloads YouTube Channel Views


| Blog | Documentation | Join Slack | Interest Form | Roadmap

🔥 NEW: For enterprise-scale deployment of LMCache and vLLM, please check out vLLM Production Stack. LMCache is also officially supported in llm-d and KServe!

Summary

LMCache is an LLM serving engine extension to reduce TTFT and increase throughput, especially under long-context scenarios. By storing the KV caches of reusable texts across various locations, including (GPU, CPU DRAM, Local Disk), LMCache reuses the KV caches of any reused text (not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious GPU cycles and reduces user response delay.

By combining LMCache with vLLM, developers achieve 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

performance

Features

  • 🔥 Integration with vLLM v1 with the following features:
    • High performance CPU KVCache offloading
    • Disaggregated prefill
    • P2P KVCache sharing
  • Integration with SGLang for KV cache offloading
  • LMCache is supported in the vLLM production stack, llm-d, and KServe
  • Stable support for non-prefix KV caches
  • Storage support as follows:
  • Installation support through pip and latest vLLM

Installation

To use LMCache, simply install lmcache from your package manager, e.g. pip:

pip install lmcache

Works on Linux NVIDIA GPU platform.

More detailed installation instructions are available in the docs, particularly if you are not using the latest stable version of vllm or using another serving engine with different dependencies. Any "undefined symbol" or torch mismatch versions can be resolved in the documentation.

Getting started

The best way to get started is to checkout the Quickstart Examples in the docs.

Documentation

Check out the LMCache documentation which is available online.

We also post regularly in LMCache blogs.

Examples

Go hands-on with our examples, demonstrating how to address different use cases with LMCache.

Interested in Connecting?

Fill out the interest form, sign up for our newsletter, join LMCache slack, check out LMCache website, or drop an email, and our team will reach out to you!

Community meeting

The community meeting Zoom Link for LMCache is hosted bi-weekly. All are welcome to join!

Meetings are held bi-weekly on: Tuesdays at 9:00 AM PT – Add to Google Calendar

We keep notes from each meeting on this document for summaries of standups, discussion, and action items.

Recordings of meetings are available on the YouTube LMCache channel.

Contributing

We welcome and value all contributions and collaborations. Please check out Contributing Guide on how to contribute.

We continually update [Onboarding] Welcoming contributors with good first issues!

Citation

If you use LMCache for your research, please cite our papers:

@inproceedings{liu2024cachegen,
  title={Cachegen: Kv cache compression and streaming for fast large language model serving},
  author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},
  booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},
  pages={38--56},
  year={2024}
}

@article{cheng2024large,
  title={Do Large Language Models Need a Content Delivery Network?},
  author={Cheng, Yihua and Du, Kuntai and Yao, Jiayi and Jiang, Junchen},
  journal={arXiv preprint arXiv:2409.13761},
  year={2024}
}

@inproceedings{10.1145/3689031.3696098,
  author = {Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
  title = {CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion},
  year = {2025},
  url = {https://doi.org/10.1145/3689031.3696098},
  doi = {10.1145/3689031.3696098},
  booktitle = {Proceedings of the Twentieth European Conference on Computer Systems},
  pages = {94–109},
}

@article{cheng2025lmcache,
  title={LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference},
  author={Cheng, Yihua and Liu, Yuhan and Yao, Jiayi and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Du, Kuntai and Jiang, Junchen},
  journal={arXiv preprint arXiv:2510.09665},
  year={2025}
}

Socials

Linkedin | Twitter | Youtube

License

The LMCache codebase is licensed under Apache License 2.0. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmcache-0.3.9.post1.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lmcache-0.3.9.post1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

lmcache-0.3.9.post1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

lmcache-0.3.9.post1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

lmcache-0.3.9.post1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file lmcache-0.3.9.post1.tar.gz.

File metadata

  • Download URL: lmcache-0.3.9.post1.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lmcache-0.3.9.post1.tar.gz
Algorithm Hash digest
SHA256 afb0b9a8da47f29dbba2ecef2d08b29eda2ad44cade6e7394e28d26a39dbeef2
MD5 a39c06465cccbde14f2fde969c4b6345
BLAKE2b-256 688edd7e4a48a78241bbd1ce3d0a1964f6c97b4fbeae85f14ace25041e8311a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.3.9.post1.tar.gz:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmcache-0.3.9.post1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for lmcache-0.3.9.post1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 09f51d579477b915e8572fb30426358051636f563c2c34d2504612ca59069010
MD5 6c17e104f1950bf5af3ec403a757148b
BLAKE2b-256 7c9eb0b25a6cccbe0cecbd79364156131744e9c64a46fe2e088a19f327106ede

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.3.9.post1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmcache-0.3.9.post1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for lmcache-0.3.9.post1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ab32b9f6c58c644aefb58718d2b58137eab7ae075f8468a4c2ada103a23de876
MD5 84626c33fe497639c679cb0d16823b6d
BLAKE2b-256 0c91cccc49fbd9b904e24682f31eb4a0d9ffd2ccfe93d9cd358e81d2bb104d8f

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.3.9.post1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmcache-0.3.9.post1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for lmcache-0.3.9.post1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9acef7d187611e71ca5adea4bd944fc1d99ff2e317a960a41a7f28850f99f5f0
MD5 310fee877e16ac44f35388d3b99f086c
BLAKE2b-256 f841841948fde14008f540449cc3515ffed64303b0c818a552a4beaa78a9c1b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.3.9.post1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmcache-0.3.9.post1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for lmcache-0.3.9.post1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 740a5b867327dcbff43727d99a99ce13f78a86358e0670f4e55e5f863c553db6
MD5 ab764ddf61ce22a87e258f161a1605e8
BLAKE2b-256 1fb845a5d84f6224aa80d61a6c4acb04bd6bb870008e7220ab6d0ef6151bf389

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmcache-0.3.9.post1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl:

Publisher: publish.yml on LMCache/LMCache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page