Skip to main content

A high-throughput and memory-efficient inference and serving engine for LLMs

Project description

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone

| Documentation | Blog |

vLLM is a fast and easy-to-use library for LLM inference and serving.

Latest News 🔥

Getting Started

Visit our documentation to get started.

Key Features

vLLM comes with many powerful features that include:

  • State-of-the-art performance in serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Seamless integration with popular HuggingFace models
  • Dynamic batching of incoming requests
  • Optimized CUDA kernels
  • High-throughput serving with various decoding algorithms, including parallel sampling and beam search
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

Performance

vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. For details, check out our blog post.


Serving throughput when each request asks for 1 output completion.


Serving throughput when each request asks for 3 output completions.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm-0.0.1.tar.gz (82.4 kB view details)

Uploaded Source

File details

Details for the file vllm-0.0.1.tar.gz.

File metadata

  • Download URL: vllm-0.0.1.tar.gz
  • Upload date:
  • Size: 82.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for vllm-0.0.1.tar.gz
Algorithm Hash digest
SHA256 1a23f23a01bbec98b1a632b885507241616ebcf966ad486c33cbfad48fa590fe
MD5 e8219e9afd3cabd62593e83a0e227552
BLAKE2b-256 126b8e55a4fbd4e9c3caee2041f4dc732d3bf66964aebff53155e012f5bf7815

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page