No project description provided
Project description
🚀 vLLM.rs – A Minimalist vLLM in Rust
A blazing-fast ⚡, lightweight Rust 🦀 implementation of vLLM.
✨ Key Features
- 🔧 Pure Rust Backend – Absolutely no PyTorch required
- 🚀 High Performance (with Context-cache and PD Disaggregation) – Superior than Python counterparts
- 🧠 Minimalist Core – Core logic written in **~ 2000 lines** of clean Rust
- 💻 Cross-Platform – Supports CUDA (Linux/Windows) and Metal (macOS)
- 🤖 Built-in Chatbot/API Server – Native Rust server for both CUDA and Metal
- 🐍 Lightweight Python Interface – PyO3-powered bindings for chat completion
- 🤝 Open for Contributions – PRs, issues, and stars are welcome!
Chat Performace
A100 (Single Card, 40G)
| Model | Format | Size | Decoding Speed |
|---|---|---|---|
| Llama-3.1-8B | ISQ (BF16->Q4K) | 8B | 90.19 tokens/s |
| DeepSeek-R1-Distill-Llama-8B | Q2_K | 8B | 94.47 tokens/s |
| DeepSeek-R1-0528-Qwen3-8B | Q4_K_M | 8B | 95 tokens/s |
| GLM-4-9B-0414 | Q4_K_M | 9B | 70.38 tokens/s |
| QwQ-32B | Q4_K_M | 32B | 35.69 tokens/s |
| Qwen3-30B-A3B | Q4_K_M | 30B (MoE) | 75.91 tokens/s |
Performance of vLLM.rs on Metal (Apple Silicon, M4)
Models: Qwen3-0.6B (BF16), Qwen3-4B (Q4_K_M), Qwen3-8B (Q2_K); Concurrent Requests: 1 - 128; Max Model Length: 512 - 2048; Max Output Tokens / Request: 512 - 2048;
| Model | Batch Size | Output Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|---|
| Qwen3-0.6B (BF16) | 128 | 63488 | 83.13s | 763.73 |
| Qwen3-0.6B (BF16) | 32 | 15872 | 23.53s | 674.43 |
| Qwen3-0.6B (BF16) | 1 | 456 | 9.23s | 49.42 |
| Qwen3-4B (Q4_K_M) | 1 | 1683 | 52.62s | 31.98 |
| Qwen3-8B (Q2_K) | 1 | 1300 | 80.88s | 16.07 |
Performance Comparison
Model: Qwen3-0.6B (BF16); Concurrent Requests: 256; Max Model Length: 1024; Max Output Tokens / Request: 1024
| Inference Engine | Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|
| vLLM (RTX 4070) (Reference) | 133,966 | 98.37 | 1361.84 |
| Nano-vLLM (RTX 4070) (Reference) | 133,966 | 93.41 | 1434.13 |
| vLLM.rs (A100) | 262,144 | 23.88s | 10977.55 (40%+ speedup) |
| Nano-vLLM (A100) | 262,144 | 34.22s | 7660.26 |
🧠 Supported Architectures
- ✅ LLaMa (LLaMa2, LLaMa3)
- ✅ Qwen (Qwen2, Qwen3)
- ✅ Qwen2 Moe
- ✅ Qwen3 Moe
- ✅ Mistral
- ✅ GLM4 (0414, Not ChatGLM)
Supports both Safetensor (including GPTQ and AWQ formats) and GGUF formats.
📽️ Demo Video
Watch it in action 🎉
📘 Usage in Rust
Run with --i for interactive chat 🤖, --server for server mode 🌐, --m specify Huggingface model, or --w to specify local safetensors model path, or --f local gguf file:
# Naive CUDA chat mode (single card only, optional `--fp8-kvcache`)
cargo run --release --features cuda,nccl -- --i --d 0 --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --context-cache
# Multi-GPU chat mode (+Flash Attention, +CUDA graph, this scirpt help build the runner)
./run.sh --release --features cuda,nccl,graph,flash-attn -- --i --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --context-cache
# Multi-GPU server mode (unquantized models)
./run.sh --release --features cuda,nccl,graph,flash-attn -- --d 0,1,2,3 --w /path/Qwen3-30B-A3B-Instruct-2507 --max-model-len 262144 --max-num-seqs 2 --server --port 8000
# Multi-GPU server mode (load as Q4K format, with `--fp8-kvcache` or `--context-cache`)
./run.sh --release --features cuda,nccl,graph,flash-attn -- --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 2 --server --port 8000 --fp8-kvcache
# Multi-GPU server mode (with `--context-cache`, Flash Attention used in both prefill/decode, long time to build)
./run.sh --release --features cuda,nccl,flash-context -- --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 2 --server --port 8000 --context-cache
# macOS chat mode (Metal, or `--server` for server mode)
cargo run --release --features metal -- --i --f /path/DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf
#macOS (Metal, ISQ)
cargo run --release --features metal -- --i --w /path/Qwen3-0.6B --isq q4k --context-cache
🚀 Prefill-decode Disaggregation
# Start the PD server (`port` not required since it does not directly respond request(s))
# Option 1: Rust
./run.sh --release --features cuda,nccl,graph,flash-attn -- --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 2 --server --pd-server
# Option 2: Python (depend: pip install vllm_rs fastapi uvicorn)
python3 -m vllm_rs.server --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 2 --d 0,1 --pd-server
# Start the corresponding PD client
# Option 1: Rust
./run.sh --release --features cuda,nccl,graph,flash-attn -- --d 2,3 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 200000 --max-num-seqs 2 --server --port 8000 --pd-client
# Option 2: Python
python3 -m vllm_rs.server --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 200000 --max-num-seqs 2 --d 2,3 --port 8000 --pd-client
# PD server and client need to use same number of ranks and same model (can be different formats, e.g., unquantized safetensor on Server but GGUF on client).
# If `--pd-url` (e.g., server side: 0.0.0.0:8100, client side: server_ip:8100) is provided, the PD server/client will try to bind (or connect) to the given address, and the client will attempt to connect to the server using the specified URL. In this scenario, the server and client can be located on different machines. This feature is experimental. Metal platform must provide `--pd-url` because it does not support LocalIPC for transfer GPU mem handles.
📘 Usage in Python
📦 Install with pip
💡 1. Manual build required for CUDA compute capability < 8.0 (e.g., V100, no flash-attn support)
💡 2. Prebuilt package has native context cache feature without relying on flash attention, manual build required to use flash-context feature.
python3 -m pip install vllm_rs fastapi uvicorn
🌐✨ API Server Mode
💡 You can use any client compatible with the OpenAI API.
🤖 Here is the client usage of context cache
# Start OpenAI API Server (default http://0.0.0.0:8000)
# openai.base_url = "http://localhost:8000/v1/"
# openai.api_key = "EMPTY"
# Local gguf file (`--f`), max output tokens for each request (`--max-tokens`), FP8 KV Cache (`--fp8-kvcache`, slight accuracy degradation)
python -m vllm_rs.server --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --host 0.0.0.0 --port 8000 --max-tokens 32768 --max-model-len 128000 --fp8-kvcache
# Use model weights from huggingface (`--m`: model_id, `--f`: gguf file)
python -m vllm_rs.server --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --host 0.0.0.0 --port 8000
# Multi-GPU (`--d`)
python -m vllm_rs.server --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --d 0,1 --host 0.0.0.0 --port 8000 --max-model-len 64000
# Multi-GPU for safetensors model: local safetensors model (`--w`) with in-situ quant to Q4K during model loading (enable maximum context length)
python -m vllm_rs.server --w /path/Qwen3-30B-A3B-Instruct-2507 --d 0,1 --host 0.0.0.0 --port 8000 --isq q4k --max-model-len 262144 --max-num-seqs 1
# multi-GPU inference + context caching for GGUF model (to cache context, you need to include a `session_id` in the `extra_body` field when making a request through the OpenAI API. The session_id should remain the same throughout a conversation, and a new `session_id` should be used for a new conversation, unsed session cache will be cleared. No need to change other settings of the API).
python -m vllm_rs.server --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --d 0,1 --host 0.0.0.0 --port 8000 --max-model-len 64000 --max-num-seqs 8 --context-cache
Interactive Chat and completion
# Interactive chat
# Load with model id
python -m vllm_rs.chat --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --fp8-kvcache
# local gguf file on second device (device order 1,`--d 1`)
python -m vllm_rs.chat --d 1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf
# Load unquantized safetensors model as GGUF quantized (e.g., q4k), with maximum model context length
python -m vllm_rs.chat --d 0 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 1 --max-tokens 16384
# Enable context cache for fast response (CUDA)
python -m vllm_rs.chat --d 0,1 --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --max-model-len 262144 --max-num-seqs 1 --context-cache
# ISQ q4k (macOS/Metal recommended, optional `--context-cache`)
python -m vllm_rs.chat --w /path/Qwen3-0.6B --isq q4k
# Chat completion
python -m vllm_rs.completion --f /path/qwq-32b-q4_k_m.gguf --prompts "How are you? | How to make money?"
# Chat completion (Multi-GPU, CUDA)
python -m vllm_rs.completion --w /home/GLM-4-9B-0414 --d 0,1 --batch 8 --max-model-len 1024 --max-tokens 1024
🐍 Python API
from vllm_rs import Engine, EngineConfig, SamplingParams, Message
cfg = EngineConfig(weight_path="/path/Qwen3-8B-Q2_K.gguf", max_model_len=4096)
engine = Engine(cfg, "bf16")
params = SamplingParams(temperature=0.6, max_tokens=256)
prompt = engine.apply_chat_template([Message("user", "How are you?")], True)
# Synchronous generation for batched input
outputs = engine.generate_sync([params,params], [prompt, prompt])
print(outputs)
params.session_id = xxx # pass session to use context cache
# Streaming generation for single request
(seq_id, prompt_length, stream) = engine.generate_stream(params, prompt)
for item in stream:
# item.datatype == "TOKEN"
print(item.data)
🔨 Build Python Package from source (Optional)
⚠️ The first build may take time if
Flash Attentionis enabled.
⚠️ When enabling context caching or multi-GPU inference, you also need to compile
Runner(usingbuild.shorrun.sh).
🛠️ Prerequisites
- Install the Rust toolchain
- On macOS, install Xcode command line tools
- For Python bindings, install Maturin
Building steps
- Install Maturin
# install build dependencies (Linux)
sudo apt install libssl-dev pkg-config -y
pip install maturin
pip install maturin[patchelf] # For Linux/Windows
- Build the Python package
# Naive CUDA (single GPU only)
maturin build --release --features cuda,python
# Naive CUDA (+CUDA Graph, experimental)
./build.sh --release --features cuda,graph,python
# CUDA (with context-cache and FP8 KV Cache, no Flash Attention)
./build.sh --release --features cuda,nccl,python
# CUDA (+Flash Attention, only used in prefill stage)
./build.sh --release --features cuda,nccl,flash-attn,python
# CUDA (+Flash Attention, used in both prefill and decode stage, long time to build)
./build.sh --release --features cuda,nccl,flash-context,python
# macOS (Metal, single GPU only, with Context-cache and FP8 kvcache)
maturin build --release --features metal,python
- Install packages
# the package you built
pip install target/wheels/vllm_rs-*-cp38-abi3-*.whl --force-reinstall
pip install fastapi uvicorn
⚙️ Command Line Arguments
| Flag | Description | |
|---|---|---|
--m |
Hugginface Model ID | |
--w |
Path to Safetensors model | |
--f |
GGUF filename when model_id given or GGUF file path | |
--d |
Device ID (e.g. --d 0) |
|
--max-num-seqs |
Maximum number of concurrent requests (default: 32, 8 on macOS) |
|
--max-tokens |
Max tokens per response (default: 4096, up to max_model_len) |
|
--batch |
Only used for benchmark (this will replace max-num-seqs and ignore prompts) |
|
--prompts |
Prompts separated by | | |
--dtype |
KV cache dtype: bf16 (default), f16, or f32 |
|
--isq |
Load unquantized model as GGUF quantized format such as q2k, q4k, etc. |
|
--temperature |
Controls randomness: lower (0.) → deterministic, higher (1.0) → creative/random. | |
--top-k |
Limits choices to the top k highest-probability tokens. smaller k → more stable;larger k → more random | |
--top-p |
Dynamically chooses the smallest set of tokens whose cumulative probability ≥ p. Range: 0.8 ~ 0.95 | |
--presence-penalty |
Presence penalty, controls whether the model avoids reusing tokens that have already appeared. Range [-2, 2]. Higher positive values → more likely to introduce new tokens; negative values → more likely to repeat previously used tokens |
|
--frequency-penalty |
Frequency penalty, controls whether the model reduces the probability of tokens that appear too often. Range [-2, 2]. Higher positive values → stronger penalty for frequently repeated tokens; negative values → encourages more repetition |
|
--server |
server mode used in Rust CLI, while Python use python -m vllm.server |
|
--fp8-kvcache |
Use FP8 KV Cache (when flash-context not enabled) | |
--cpu-mem-fold |
The percentage of CPU KVCache memory size compare to GPU (default 1.0, range from 0.1 to 10.0) | |
--pd-server |
When using PD Disaggregation, specify the current instance as the PD server (this server is only used for Prefill) | |
--pd-client |
When using PD Disaggregation, specify the current instance as the PD client (this client sends long-context Prefill requests to the PD server for processing) | |
--pd-url |
When using PD Disaggregation, if specified pd-url, communication will occur via TCP/IP (used when the PD server and client are on different machines) |
🗜️ In-Situ Quantization (GGUF Conversion during loading)
💡 Run any unquantized models as GGUF quantized format, but it may takes few minutes for --isq other than q4k and q8_0.
# macOS
cargo run --release --features metal -- --w /path/Qwen3-0.6B/ --isq q4k --prompts "How are you today?"
# CUDA
cargo run --release --features cuda,flash-attn -- --w /path/Qwen3-8B/ --isq q4k --prompts "How are you today?"
📌 Project Status
🚧 Under active development – breaking changes may occur!
🛠️ Roadmap
- Batched inference (Metal)
- GGUF format support
- FlashAttention (CUDA)
- CUDA Graph
- OpenAI-compatible API (streaming support)
- Continuous batching
- Multi-gpu inference (Safetensors, GPTQ, AWQ, GGUF)
- Speedup prompt processing on Metal/macOS
- Chunked Prefill
- Session-based context cache (available on
CUDAwhencontext-cacheenabled) - Model loading from hugginface hub
- Model loading from ModelScope (China)
- Context cache for Metal/macOS
- FP8 KV Cache (CUDA)
- FP8 KV Cache (Metal)
- FP8 KV Cache (with Flash-Attn)
- Additional model support (GLM 4.6, Kimi K2 Thinking, etc.)
- CPU KV Cache Offloading
- Prefill-decode Disaggregation (CUDA)
- Prefill-decode Disaggregation (Metal)
📚 References
- Candle-vLLM
- Python nano-vllm
💡 Like this project? Give it a ⭐ and contribute!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_rs-0.4.1-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: vllm_rs-0.4.1-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 5.0 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f51282fdc6a99460a0c3b3be1d06bad0fea591964ef389ea16c274c266ebbfa3
|
|
| MD5 |
16c5b7290d649af2bd2c2c99ac27fedb
|
|
| BLAKE2b-256 |
14e14f4d148322c9f1d7b3c00d5ce9f20a503cbf48b4b2d9341a2e9234b4eaf7
|