Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |


Ray Summit CPF is Open (June 4th to June 20th)!

There will be a track for vLLM at the Ray Summit (09/30-10/02, SF) this year! If you have cool projects related to vLLM or LLM inference, we would love to see your proposals. This will be a great chance for everyone in the community to get together and learn. Please submit your proposal here


Latest News 🔥

  • [2024/06] We hosted the fourth vLLM meetup with Cloudflare and BentoML! Please find the meetup slides here.
  • [2024/04] We hosted the third vLLM meetup with Roblox! Please find the meetup slides here.
  • [2024/01] We hosted the second vLLM meetup in SF! Please find the meetup slides here.
  • [2024/01] Added ROCm 6.0 support to vLLM.
  • [2023/12] Added ROCm 5.7 support to vLLM.
  • [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
  • [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
  • [2023/09] We released our PagedAttention paper on arXiv!
  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD GPUs, Intel CPUs and GPUs
  • (Experimental) Prefix caching support
  • (Experimental) Multi-lora support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

  • Transformer-like LLMs (e.g., Llama)
  • Mixture-of-Expert LLMs (e.g., Mixtral)
  • Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Getting Started

Install vLLM with pip or from source:

pip install vllm

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Sponsors

vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!

  • a16z
  • AMD
  • Anyscale
  • AWS
  • Crusoe Cloud
  • Databricks
  • DeepInfra
  • Dropbox
  • Lambda Lab
  • NVIDIA
  • Replicate
  • Roblox
  • RunPod
  • Sequoia Capital
  • Trainy
  • UC Berkeley
  • UC San Diego
  • ZhenFund

We also have an official fundraising venue through OpenCollective. We plan to use the fund to support the development, maintenance, and adoption of vLLM.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm-0.5.1.tar.gz (790.6 kB view details)

Uploaded Source

Built Distributions

vllm-0.5.1-cp311-cp311-manylinux1_x86_64.whl (146.9 MB view details)

Uploaded CPython 3.11

vllm-0.5.1-cp310-cp310-manylinux1_x86_64.whl (146.9 MB view details)

Uploaded CPython 3.10

vllm-0.5.1-cp39-cp39-manylinux1_x86_64.whl (146.9 MB view details)

Uploaded CPython 3.9

vllm-0.5.1-cp38-cp38-manylinux1_x86_64.whl (146.9 MB view details)

Uploaded CPython 3.8

File details

Details for the file vllm-0.5.1.tar.gz.

File metadata

  • Download URL: vllm-0.5.1.tar.gz
  • Upload date:
  • Size: 790.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for vllm-0.5.1.tar.gz
Algorithm Hash digest
SHA256 c7b6d01ec0644dd251e00a52bd08f9a172da2185dce87538deb56c577120e0eb
MD5 712ce027e5af01e73b04a4013e0db846
BLAKE2b-256 fcd238993991ee61bacaccd49182ce597bf4496bc4ac96e92a3b011fcde90bb7

See more details on using hashes here.

File details

Details for the file vllm-0.5.1-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.5.1-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8b7483adcfc115dc7237663636a0976943fa65cde6f616a3839f0de69a419b5c
MD5 ee26a24ff21bb008762b918b8c560bd0
BLAKE2b-256 1379e212b0eae62716be0d75994b71b7f97f4a45a7ffd8533408bcd3cde6a7e0

See more details on using hashes here.

File details

Details for the file vllm-0.5.1-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.5.1-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 562e3dd1d54ad0afb299e64412322b372976faaebb70a7db8e4a6c4ef5ed67c4
MD5 3e2c3e89952f245a1b50f96e894ee98f
BLAKE2b-256 8243a0fbf45a2ccf038ea05774fba715633c70f7a26c7b3270c337038f9f77c0

See more details on using hashes here.

File details

Details for the file vllm-0.5.1-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.5.1-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6e6cad87335713cfc6710bc44c2688a6158abc3184d3530731e89b948617b2c7
MD5 3ff86162634224c47bb9b413b090b345
BLAKE2b-256 7289671a176610d4ffe04216dccde6bffd07a8c67f4c0c45ce8be60266d7f574

See more details on using hashes here.

File details

Details for the file vllm-0.5.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.5.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4dd42231beddf1546eb8321df0cdca621863842fa6b0d5ad736338a8ea65e531
MD5 79d4a13dfc8aab60fee8f5dec62de21a
BLAKE2b-256 c3e0f950ff3900f98b1436272ac34dbf9664db63d47702451209ed495fb378f9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page