Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |

Latest News 🔥

  • [2024/04] We hosted the third vLLM meetup with Roblox! Please find the meetup slides here.
  • [2024/01] We hosted the second vLLM meetup in SF! Please find the meetup slides here.
  • [2024/01] Added ROCm 6.0 support to vLLM.
  • [2023/12] Added ROCm 5.7 support to vLLM.
  • [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
  • [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
  • [2023/09] We released our PagedAttention paper on arXiv!
  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs and AMD GPUs
  • (Experimental) Prefix caching support
  • (Experimental) Multi-lora support

vLLM seamlessly supports many Hugging Face models, including the following architectures:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • Command-R (CohereForAI/c4ai-command-r-v01, etc.)
  • DBRX (databricks/dbrx-base, databricks/dbrx-instruct etc.)
  • DeciLM (Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • Gemma (google/gemma-2b, google/gemma-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • InternLM2 (internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.)
  • Jais (core42/jais-13b, core42/jais-13b-chat, core42/jais-30b-v3, core42/jais-30b-chat-v3, etc.)
  • LLaMA, Llama 2, and Meta Llama 3 (meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • MiniCPM (openbmb/MiniCPM-2B-sft-bf16, openbmb/MiniCPM-2B-dpo-bf16, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OLMo (allenai/OLMo-1B-hf, allenai/OLMo-7B-hf, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Orion (OrionStarAI/Orion-14B-Base, OrionStarAI/Orion-14B-Chat, etc.)
  • Phi (microsoft/phi-1_5, microsoft/phi-2, etc.)
  • Phi-3 (microsoft/Phi-3-mini-4k-instruct, microsoft/Phi-3-mini-128k-instruct, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Qwen2 (Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, etc.)
  • Qwen2MoE (Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.)
  • StableLM(stabilityai/stablelm-3b-4e1t, stabilityai/stablelm-base-alpha-7b-v2, etc.)
  • Starcoder2(bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.)
  • Xverse (xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm-0.4.2.tar.gz (588.8 kB view details)

Uploaded Source

Built Distributions

vllm-0.4.2-cp311-cp311-manylinux1_x86_64.whl (67.7 MB view details)

Uploaded CPython 3.11

vllm-0.4.2-cp310-cp310-manylinux1_x86_64.whl (67.7 MB view details)

Uploaded CPython 3.10

vllm-0.4.2-cp39-cp39-manylinux1_x86_64.whl (67.7 MB view details)

Uploaded CPython 3.9

vllm-0.4.2-cp38-cp38-manylinux1_x86_64.whl (67.7 MB view details)

Uploaded CPython 3.8

File details

Details for the file vllm-0.4.2.tar.gz.

File metadata

  • Download URL: vllm-0.4.2.tar.gz
  • Upload date:
  • Size: 588.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for vllm-0.4.2.tar.gz
Algorithm Hash digest
SHA256 0b857f3084b507cbdd3bfcbaae19d171c55df9955eb3ac41c9c711768e852772
MD5 e4aa667215e559390f8a710c51ea3279
BLAKE2b-256 5286c493b975f36d48939cc575d22813000d2c60d519283d1c61ff744fb440e8

See more details on using hashes here.

File details

Details for the file vllm-0.4.2-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.4.2-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 94afce6af9a2adb1e72081a245b116325073a9266e88044169fbaf6707d847c0
MD5 1a189a7a022afcc65361eb913af07e8a
BLAKE2b-256 f9e778a1565187c00da962321ed32ed3d08015e513c9d018c7135f95cb7c21e7

See more details on using hashes here.

File details

Details for the file vllm-0.4.2-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.4.2-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 26ad388c1c6ec0348c19495318c1e10911d5fd4c67aa53c868c7ee773ecf1163
MD5 9f866da9dac7b2808c008529d2d564b4
BLAKE2b-256 b9a4a54bf01bcb5949f86e15a47c5059ae098c1488ae56d8fa89b06b803897b6

See more details on using hashes here.

File details

Details for the file vllm-0.4.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.4.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0fc6220591f40b6f715bd040a293b9e0732fdad100bd25c5a02eb4b461680223
MD5 d2a773d360aa5932fd3c05a8c86d2279
BLAKE2b-256 0b07be4524d4d98c6006c895f6996f9b4cc3fade45e879ed43a99634c32be766

See more details on using hashes here.

File details

Details for the file vllm-0.4.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.4.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 27b395dff451c10e2ce699980a047c52d7aa8737b59fbd8e333e824c94401a16
MD5 93b560e8f9368b3049e352e5136bd58b
BLAKE2b-256 e9f07ab85e23c37bbe76042caea0f2004e4fa4f57203b42ad2184af9e2f7967c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page