Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |


Latest News 🔥

  • [2024/01] We hosted the second vLLM meetup in SF! Please find the meetup slides here.
  • [2024/01] Added ROCm 6.0 support to vLLM.
  • [2023/12] Added ROCm 5.7 support to vLLM.
  • [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
  • [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
  • [2023/09] We released our PagedAttention paper on arXiv!
  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs and AMD GPUs
  • (Experimental) Prefix caching support
  • (Experimental) Multi-lora support

vLLM seamlessly supports many Hugging Face models, including the following architectures:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • DeciLM (Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • Gemma (google/gemma-2b, google/gemma-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • InternLM2 (internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.)
  • LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OLMo (allenai/OLMo-1B, allenai/OLMo-7B, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Phi (microsoft/phi-1_5, microsoft/phi-2, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Qwen2 (Qwen/Qwen2-7B-beta, Qwen/Qwen-7B-Chat-beta, etc.)
  • StableLM(stabilityai/stablelm-3b-4e1t, stabilityai/stablelm-base-alpha-7b-v2, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vllm-0.3.2-cp311-cp311-manylinux1_x86_64.whl (41.4 MB view details)

Uploaded CPython 3.11

vllm-0.3.2-cp310-cp310-manylinux1_x86_64.whl (41.4 MB view details)

Uploaded CPython 3.10

vllm-0.3.2-cp39-cp39-manylinux1_x86_64.whl (41.4 MB view details)

Uploaded CPython 3.9

vllm-0.3.2-cp38-cp38-manylinux1_x86_64.whl (41.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file vllm-0.3.2-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.3.2-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 82b974b5c26c5d2db126f7ebc6e863227d654e4da7c878cc808e25dde827e687
MD5 3889db2dd7329cd55434ffd971d6ad10
BLAKE2b-256 1cf302e0f00c9e539c89c670e187eebe122ba315d751f035722ceda979680673

See more details on using hashes here.

File details

Details for the file vllm-0.3.2-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.3.2-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 29d250c38719b6b7d4436f7ad319cdc5696eedf5a9abab77fc55b7d177540fa4
MD5 680dd0b31ca5869d4c8ae2ce1b100b5c
BLAKE2b-256 db6dd280d5c179f9e83d65056bea80b285721f6c508debc1b4bae0fb8f154bb3

See more details on using hashes here.

File details

Details for the file vllm-0.3.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.3.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ab1df996cc117e54f1fb6ede8b0b0b61f4cf398d908f4f79eefa2e35177771cd
MD5 2929b81c1f0d5af0ce1e2e776975a988
BLAKE2b-256 5e5c5e9b7552bbe37683aca180d354f0c75001a224c172f41a4a4428493a4068

See more details on using hashes here.

File details

Details for the file vllm-0.3.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.3.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9e05dbbf9fcbcd29e14497d45163eac7c7be532bc08cf4fc8fda8f6228be96bc
MD5 f96ca74ee955dfa399a1e020443b16fa
BLAKE2b-256 1c54ca91d90b14b59b3bbe7735bbc62972a9e7977021147f0785a5f832617b2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page