Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |


Latest News 🔥

  • [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
  • [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
  • [2023/09] We released our PagedAttention paper on arXiv!
  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

vLLM seamlessly supports many Hugging Face models, including the following architectures:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan (baichuan-inc/Baichuan-7B, baichuan-inc/Baichuan-13B-Chat, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Phi-1.5 (microsoft/phi-1_5, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm-0.2.2.tar.gz (144.9 kB view details)

Uploaded Source

Built Distributions

vllm-0.2.2-cp311-cp311-manylinux1_x86_64.whl (29.1 MB view details)

Uploaded CPython 3.11

vllm-0.2.2-cp310-cp310-manylinux1_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.10

vllm-0.2.2-cp39-cp39-manylinux1_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.9

vllm-0.2.2-cp38-cp38-manylinux1_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.8

File details

Details for the file vllm-0.2.2.tar.gz.

File metadata

  • Download URL: vllm-0.2.2.tar.gz
  • Upload date:
  • Size: 144.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for vllm-0.2.2.tar.gz
Algorithm Hash digest
SHA256 d047bec3c28a93325e8c7f3bc4db01aff20e64c77d56fbb53a1542cde34c6fdc
MD5 a48b58dbdef442f7820971e48040da66
BLAKE2b-256 05a0a01420bb4e0fa6d00d94a2af7ae16c5b23ea98dc541ad620132db60005c5

See more details on using hashes here.

File details

Details for the file vllm-0.2.2-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.2-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 025e34c57f86c5ed94d2d10566040d6bb468f7a769f71035c10e94e9b578079e
MD5 0de7633609610eecf3b9a3820f614101
BLAKE2b-256 e1309f5d2bf9e37c7cee94f59d4400f396ba9015decad9cfd12ed452649f9de4

See more details on using hashes here.

File details

Details for the file vllm-0.2.2-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.2-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2dec13a2e8cdd319d9279d33a211bbb67aba80623362e1e7bb42f60b797731af
MD5 2d8b86d91750b2e2eb418684f310d4c6
BLAKE2b-256 c37ccba2121976f8914691bfd345961a5bd6a0467a0fa4efae1c3468d590d39c

See more details on using hashes here.

File details

Details for the file vllm-0.2.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6bbc27a56eee597b60a630867c0381f4bbefd40d2f4b32c1e968d9652df140b9
MD5 df4c4f5a3cfc342d9fdb85db931c8c9e
BLAKE2b-256 c0ef7d057799186cf1f73baae96badf4b66a3f660c7f68a5fe6247b0d0055304

See more details on using hashes here.

File details

Details for the file vllm-0.2.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.2.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2dac3b3dae522452c77f264a85e726a2752e968fe21e970d91700355edfe2f7d
MD5 65f3d5c1d243eade8609965d60d9c37b
BLAKE2b-256 ada539d0b6de76ef25cd4b10bddd4c093179278b49924e91714753ca0096a700

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page