Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |


Latest News 🔥

  • [2023/12] Added ROCm support to vLLM.
  • [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
  • [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
  • [2023/09] We released our PagedAttention paper on arXiv!
  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, SqueezeLLM
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs and AMD GPUs

vLLM seamlessly supports many Hugging Face models, including the following architectures:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • DeciLM (Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Phi (microsoft/phi-1_5, microsoft/phi-2, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm-0.2.7.tar.gz (170.8 kB view details)

Uploaded Source

Built Distributions

vllm-0.2.7-cp311-cp311-manylinux1_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.11

vllm-0.2.7-cp310-cp310-manylinux1_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.10

vllm-0.2.7-cp39-cp39-manylinux1_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.9

vllm-0.2.7-cp38-cp38-manylinux1_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.8

File details

Details for the file vllm-0.2.7.tar.gz.

File metadata

  • Download URL: vllm-0.2.7.tar.gz
  • Upload date:
  • Size: 170.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for vllm-0.2.7.tar.gz
Algorithm Hash digest
SHA256 41a7266be66b2887be1afd879a77c3e4062c5f30fd3159eace2e8e271fe21271
MD5 81f2ffd8906267877ed03676dfe86845
BLAKE2b-256 315457a0be8c28c3f5ab399657d4353deeb3fa5c20ba34ae72d6a43c95ba97fe

See more details on using hashes here.

File details

Details for the file vllm-0.2.7-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.7-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 adb96ec1c18b40f59cefb849425be4d1a247ab6489b69ff072867d62ffe83ce4
MD5 54d4519d937fa7de275bbba94c1515ad
BLAKE2b-256 d91b36580f965575977d0c7eb186dfafa8765784f3deb65e0261eebaf64a21c0

See more details on using hashes here.

File details

Details for the file vllm-0.2.7-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.7-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e4c355b3cd99aac0f87fc2977531c8be69b76414391f74d025e63b1ab884334a
MD5 d64299be0a068ccff5159d9d65d44539
BLAKE2b-256 c7c3ee8545cae25c517bae939fd103d3a31fe9c68c96e48a8a67ddbaa7bb86a3

See more details on using hashes here.

File details

Details for the file vllm-0.2.7-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.7-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8bf6cff299dcbcd498ab46c357b781eb70207d11523c730cc88c8447eb5bbc0a
MD5 507a72e65b0a613dccad235eb5eb26fc
BLAKE2b-256 042760fa5759587e0458b2d5b2db2bd9bccc77757e90fa69fe97d1aa639f4be8

See more details on using hashes here.

File details

Details for the file vllm-0.2.7-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.7-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e8bb37a5f4435941e67bcc9bbe4048048af67786fa2a2a2ae9c117b52c36d36d
MD5 93af007d2b4dca1c97a12b1e6edfbf78
BLAKE2b-256 7d4dfefcc2891f56aa98fe17d896ff1016d7eaed7f4b0bfa9263d21f532bb144

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page