CLI for managing LLM inference on GPU workstations
Project description
vserve
A CLI for managing LLM inference on GPU workstations.
Download models. Auto-tune limits. Serve with one command. Multiple backends.
Beta Release: 0.5.2b1
vserve is now in beta.
Highlights in 0.5.2b1:
- lifecycle coordination is now backend-wide:
runstops conflicting backends before launch, andstopdrains all detected active backends - startup handling is stricter for automation and friendlier for operators: interactive
runcan return while a backend is still warming, while non-interactiverunstill requires a healthy API vserve runnow polls health every 3 seconds and shows the latest 5journalctllines while waiting, which makes JIT/kernel warmup much easier to follow- config, profile, limits, and local state parsing is hardened against malformed files
- doctor/update/runtime paths are more defensive around probe failures, unreadable files, and backend/service uncertainty
Current beta caveats:
- non-interactive startup remains intentionally strict: if the backend never reaches a healthy API state within the timeout window,
runexits nonzero even if the service is still warming - multi-user coordination is best-effort operational safety, not a security boundary
Install
uv tool install vserve
Or with pip:
pip install vserve
For llama.cpp GGUF tuning support:
pip install 'vserve[llamacpp]'
Quick Start
vserve init # scan GPU, backends, CUDA, systemd — write config
vserve add # search HuggingFace, pick variant, download
vserve run <model> # auto-tune + interactive config + serve
vserve run <model> --tools # enable tool calling (auto-detected)
vserve run <model> --backend llamacpp # force a specific backend
Backends
vserve auto-detects the right backend from the model format:
| Format | Backend | Engine |
|---|---|---|
| safetensors, GPTQ, AWQ, FP8 | vLLM | PagedAttention, continuous batching |
| GGUF | llama.cpp | CPU/GPU offload, quantized inference |
No configuration needed — download a model and vserve run picks the right engine.
vLLM
The default for transformer models in safetensors format. Optimized for high-throughput serving with PagedAttention, KV cache management, and automatic batching.
- Auto-tunes
--max-model-len,--max-num-seqs,--kv-cache-dtypebased on your GPU - Tool calling with parser auto-detection (Qwen, Llama, Mistral, DeepSeek, Gemma, GPT-OSS)
- Systemd service management via
vllm.service
llama.cpp
For GGUF quantized models. Serves via llama-server with an OpenAI-compatible API.
- Auto-calculates
--n-gpu-layers,--ctx-size,--parallelbased on VRAM - Partial GPU offload — serve models that don't fully fit in VRAM
- Tool calling via
--jinja(no parser configuration needed) - Systemd service management via
llama-cpp.service
What It Does
vserve manages the full lifecycle of serving LLMs on a GPU workstation:
- Download — search HuggingFace, see available weight variants (FP8, NVFP4, BF16, GGUF) with sizes, download only what you need
- Auto-tune — calculate exactly what context lengths and concurrency your GPU can handle, based on model architecture and available VRAM
- Tool calling — auto-detects the correct parser from the model's chat template (vLLM) or uses
--jinja(llama.cpp) - Run/Stop — interactive config wizard, systemd service management, health check with timeout
- Fan control — temperature-based curve daemon with quiet hours, or hold a fixed speed
- Multi-user — best-effort session coordination warns other
vserveusers before they disrupt your running model - Doctor — diagnose GPU, CUDA, backend, systemd issues with actionable fix suggestions
Commands
| Command | Description |
|---|---|
vserve |
Dashboard — GPU, models, status |
vserve init |
Auto-discover backends and write config |
vserve list [name] |
List models with backend, tools, and limits |
vserve add [model] |
Search and download from HuggingFace with variant picker |
vserve rm <name> |
Remove a downloaded model |
vserve tune [model] |
Calculate context/concurrency limits |
vserve run [model] |
Configure and start serving (auto-tunes if needed) |
vserve run <model> --tools |
Start with tool calling enabled |
vserve run <model> --backend llamacpp |
Force a specific backend |
vserve stop |
Stop the running server |
vserve status |
Show current serving config |
vserve fan [auto|off|30-100] |
GPU fan control with temp-based curve |
vserve doctor |
Check system readiness |
vserve cache clean [--all] |
Clean stale sockets and JIT caches |
vserve version |
Show current version and check for updates |
vserve update |
Update vserve to the latest version |
All commands support fuzzy matching — vserve run qwen fp8 finds the right model.
Tool Calling
vLLM
Auto-detects the correct vLLM parser by reading the model's chat template:
| Model Family | Tool Parser | Reasoning Parser |
|---|---|---|
| Qwen 2.5 | hermes |
— |
| Qwen 3 | hermes |
qwen3 |
| Qwen 3.5 | qwen3_coder |
qwen3 |
| Llama 3.1 / 3.2 / 3.3 | llama3_json |
— |
| Llama 4 | llama4_pythonic |
— |
| Mistral / Mixtral | mistral |
mistral |
| DeepSeek V3 / R1 | deepseek_v3 |
deepseek_r1 |
| Gemma 4 | gemma4 |
gemma4 |
| GPT-OSS | openai |
openai_gptoss |
Detection is template-based (not model-name regex), so it works for fine-tunes and community uploads.
llama.cpp
Uses --jinja to read the model's chat template directly. No parser selection needed — one flag covers all model families.
Prerequisites
| Requirement | Check | Install |
|---|---|---|
| NVIDIA GPU + drivers | nvidia-smi |
nvidia.com/drivers |
| CUDA toolkit | nvcc --version |
sudo apt install nvidia-cuda-toolkit |
| systemd | (most Linux servers) | See troubleshooting |
| sudo access | for systemctl, fan control |
For vLLM backend:
| Requirement | Check | Install |
|---|---|---|
| vLLM 0.19+ | vllm --version |
docs.vllm.ai |
For llama.cpp backend:
| Requirement | Check | Install |
|---|---|---|
| llama-server | llama-server --version |
github.com/ggml-org/llama.cpp |
Configuration
Auto-discovered on first run. Override at ~/.config/vserve/config.yaml:
# Shared
port: 8888
# vLLM
vllm_root: /opt/vllm
cuda_home: /usr/local/cuda
service_name: vllm
service_user: vllm
# llama.cpp (optional)
llamacpp_root: /opt/llama-cpp
llamacpp_service_name: llama-cpp
llamacpp_service_user: llama-cpp
Directory Layout
/opt/vllm/ # vLLM backend
├── venv/bin/vllm # Python venv
├── models/ # safetensors models
├── configs/ # limits + profiles
└── logs/
/opt/llama-cpp/ # llama.cpp backend
├── bin/llama-server # compiled binary
├── models/ # GGUF models
├── configs/ # JSON configs
└── logs/
Fan Control
vserve fan # show status, interactive menu
vserve fan auto # temp-based curve with quiet hours
vserve fan 80 # hold at 80% (persistent daemon)
vserve fan off # stop daemon, restore NVIDIA auto
The auto curve ramps with temperature and caps fan speed during quiet hours (configurable). Emergency override at 88C ignores quiet hours.
Architecture
vserve uses a Backend Protocol pattern. Each inference engine implements the same interface:
Backend Protocol
├── VllmBackend — safetensors, AWQ, FP8, GPTQ
├── LlamaCppBackend — GGUF
└── (future: SGLang, etc.)
The registry auto-detects the right backend from the model format. All CLI commands work through the protocol — no backend-specific code in the command layer.
Development
git clone https://github.com/Gavin-Qiao/vserve.git
cd vserve
uv sync --dev
uv run pytest tests/ # 314 tests
uv run ruff check src/ tests/ # lint
uv run mypy src/vserve/ # type check
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vserve-0.5.2b1.tar.gz.
File metadata
- Download URL: vserve-0.5.2b1.tar.gz
- Upload date:
- Size: 131.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42e1f288d7613e809d0972154e34cdc6f65506aec5e9e3f8bff4e2ca59e6b77c
|
|
| MD5 |
24defce9fa837f53b58f8977380a6b0e
|
|
| BLAKE2b-256 |
dab6b078fd7ccee705878db49518b2627b153344418a9fde79becf4c9f76237a
|
Provenance
The following attestation bundles were made for vserve-0.5.2b1.tar.gz:
Publisher:
publish.yml on Gavin-Qiao/vserve
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vserve-0.5.2b1.tar.gz -
Subject digest:
42e1f288d7613e809d0972154e34cdc6f65506aec5e9e3f8bff4e2ca59e6b77c - Sigstore transparency entry: 1239398983
- Sigstore integration time:
-
Permalink:
Gavin-Qiao/vserve@7cb0a97a1952ff877fec1ecbb214023b47645f17 -
Branch / Tag:
refs/tags/v0.5.2b1 - Owner: https://github.com/Gavin-Qiao
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7cb0a97a1952ff877fec1ecbb214023b47645f17 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vserve-0.5.2b1-py3-none-any.whl.
File metadata
- Download URL: vserve-0.5.2b1-py3-none-any.whl
- Upload date:
- Size: 68.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa131ed806ab56fd83c0697c383393b8b7912ba5e81db81e2c7fbecf4e26d123
|
|
| MD5 |
e099475a72b385c1d861a69c88f3f38a
|
|
| BLAKE2b-256 |
fd8c973c5e9c0e2eb43760ab436d50b4877adcb227f41a15b973de0ffef8ddc7
|
Provenance
The following attestation bundles were made for vserve-0.5.2b1-py3-none-any.whl:
Publisher:
publish.yml on Gavin-Qiao/vserve
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vserve-0.5.2b1-py3-none-any.whl -
Subject digest:
aa131ed806ab56fd83c0697c383393b8b7912ba5e81db81e2c7fbecf4e26d123 - Sigstore transparency entry: 1239398984
- Sigstore integration time:
-
Permalink:
Gavin-Qiao/vserve@7cb0a97a1952ff877fec1ecbb214023b47645f17 -
Branch / Tag:
refs/tags/v0.5.2b1 - Owner: https://github.com/Gavin-Qiao
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7cb0a97a1952ff877fec1ecbb214023b47645f17 -
Trigger Event:
push
-
Statement type: