Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Discussions |


Latest News 🔥

  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

vLLM seamlessly supports many Huggingface models, including the following architectures:

  • Aquila (BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan (baichuan-inc/Baichuan-7B, baichuan-inc/Baichuan-13B-Chat, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Performance

vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. For details, check out our blog post.


Serving throughput when each request asks for 1 output completion.


Serving throughput when each request asks for 3 output completions.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm-0.1.4.tar.gz (107.8 kB view details)

Uploaded Source

Built Distributions

vllm-0.1.4-cp311-cp311-manylinux1_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.11

vllm-0.1.4-cp310-cp310-manylinux1_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.10

vllm-0.1.4-cp39-cp39-manylinux1_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.9

vllm-0.1.4-cp38-cp38-manylinux1_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.8

File details

Details for the file vllm-0.1.4.tar.gz.

File metadata

  • Download URL: vllm-0.1.4.tar.gz
  • Upload date:
  • Size: 107.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for vllm-0.1.4.tar.gz
Algorithm Hash digest
SHA256 98f75c602545e9faad771d440d2d0e0019f38f312ca98e75b8cc17b7968d2ac2
MD5 80be4986bae4e95bba3ab505ce1302df
BLAKE2b-256 e40df3ddb7f8ff73fa1eab4043d9d3feee74ea2d572e6ca16785892696574b9f

See more details on using hashes here.

File details

Details for the file vllm-0.1.4-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.1.4-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2afea825f4e6b3cef00e35c51762f3f7c1aa461e20e3af89f7ffd99d69816f08
MD5 6d5c738bbaf701dfd790ab6bda81c99e
BLAKE2b-256 da2a62decf006f7ed170a1c8f4bc2a1b1f114af743ef2e9eb60874b34921ea76

See more details on using hashes here.

File details

Details for the file vllm-0.1.4-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.1.4-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fc2c7a3f6b57c0c3dcc20b3be25f8df0ad5a34251ffef60248f58260a823bdab
MD5 3650053c22cc98da803a70b49b4c9d7e
BLAKE2b-256 5125169816837b6bd3a0657e55ac35196add9376f6c8f059e4d9bf080ef8283d

See more details on using hashes here.

File details

Details for the file vllm-0.1.4-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.1.4-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 96d0d9e31caa0174095f85a28de38fdb594fdfbe928b902664c997110ab90979
MD5 930a2e4ea7a2b6c3a1bddafd828bc8b1
BLAKE2b-256 7c564ba812d6428aed6038f1af4a9dd356a4d46762a76b6ffe0a9f1b0e273d26

See more details on using hashes here.

File details

Details for the file vllm-0.1.4-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for vllm-0.1.4-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ca66e0af62450dac1d35981c4c91494b6434452ca43215871710d0914727bd40
MD5 c5db894199a6a56b14d5b4162347d3b1
BLAKE2b-256 e6deb4baee3476de5df8e5f779da03811aacd1fdb962c73d9fb5cd0dc09b4bb5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page