Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |


Latest News 🔥

  • [2023/12] Added ROCm support to vLLM.
  • [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
  • [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
  • [2023/09] We released our PagedAttention paper on arXiv!
  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA CUDA and AMD ROCm.

vLLM seamlessly supports many Hugging Face models, including the following architectures:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Phi-1.5 (microsoft/phi-1_5, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm-0.2.5.tar.gz (154.2 kB view details)

Uploaded Source

Built Distributions

vllm-0.2.5-cp311-cp311-manylinux1_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.11

vllm-0.2.5-cp310-cp310-manylinux1_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.10

vllm-0.2.5-cp39-cp39-manylinux1_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.9

vllm-0.2.5-cp38-cp38-manylinux1_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.8

File details

Details for the file vllm-0.2.5.tar.gz.

File metadata

  • Download URL: vllm-0.2.5.tar.gz
  • Upload date:
  • Size: 154.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for vllm-0.2.5.tar.gz
Algorithm Hash digest
SHA256 d78d1b68d1e581fd6992ea09d68d8019f86550086ad76d475769b741b96bc23a
MD5 2623138089027edc2062f45176699864
BLAKE2b-256 1da719668dec39263f633587446e6f8d18753315b5642b4aaf7ad7d5fa477cce

See more details on using hashes here.

File details

Details for the file vllm-0.2.5-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.5-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 68040c17d0c76e597dc83b41945e4ca7a7ff6afc100b311dca6f2d9db72a3947
MD5 c0571798a456c871f2fab61b461121b5
BLAKE2b-256 80c6f97ff26b20129570692291f534dd22777aee2e611b9e1830561f994c0549

See more details on using hashes here.

File details

Details for the file vllm-0.2.5-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.5-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5470503f37d219280ad91f1a3dcd6391d06c4a5d319ca74d6a9a1f138550b1db
MD5 b0de43201f74612c7a4d58f91b7fbf29
BLAKE2b-256 2a39eb463229e14824d90b42975c67fe294e00909dfde8c4b6169fd52251b5f7

See more details on using hashes here.

File details

Details for the file vllm-0.2.5-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.5-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b455f7d0986884864e6da1a4f01022198a3daad756437475ef86edda41f13164
MD5 a2cbd4b6df236da813c19230ac4260b6
BLAKE2b-256 5a6bba88c2e7c2cef60facdb68e6f5da157828bf11dd93d4be8385166710b6d6

See more details on using hashes here.

File details

Details for the file vllm-0.2.5-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.5-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 603367c60c0c44c01963c3b8fd8c1aa6c9b18f08917712290ce86c1a4e7cd211
MD5 a4efe4941c08ccb80a7a28b833798ea6
BLAKE2b-256 09d638bafd3e92ae2f51a72c7068300b38202dd54b72532f8662b76499718564

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page