A high-throughput and memory-efficient inference and serving engine for LLMs
Project description
vLLM: Easy, Fast, and Cheap LLM Serving for Everyone
| Documentation | Blog |
vLLM is a fast and easy-to-use library for LLM inference and serving.
Latest News 🔥
- [2023/06] We officially released vLLM! vLLM has powered LMSYS Vicuna and Chatbot Arena since mid April. Check out our blog post.
Getting Started
Visit our documentation to get started.
- Installation:
pip install vllm
- Quickstart
- Supported Models
Key Features
vLLM comes with many powerful features that include:
- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including parallel sampling and beam search
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
Performance
vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. For details, check out our blog post.
Serving throughput when each request asks for 1 output completion.
Serving throughput when each request asks for 3 output completions.
Contributing
We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file vllm-0.0.1.tar.gz
.
File metadata
- Download URL: vllm-0.0.1.tar.gz
- Upload date:
- Size: 82.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a23f23a01bbec98b1a632b885507241616ebcf966ad486c33cbfad48fa590fe |
|
MD5 | e8219e9afd3cabd62593e83a0e227552 |
|
BLAKE2b-256 | 126b8e55a4fbd4e9c3caee2041f4dc732d3bf66964aebff53155e012f5bf7815 |