CLI for managing LLM inference on GPU workstations
Project description
vserve
A CLI for managing LLM inference on GPU workstations.
Download models. Auto-tune limits. Serve with one command. Multiple backends.
Beta Release: 0.5.2b2
vserve is now in beta.
Highlights in 0.5.2b2:
- lifecycle coordination is now backend-wide:
runstops conflicting backends before launch, andstopdrains all detected active backends - startup handling is stricter for automation and friendlier for operators: interactive
runcan return while a backend is still warming, while non-interactiverunstill requires a healthy API vserve runnow polls health every 3 seconds and shows the latest 5journalctllines while waiting, which makes JIT/kernel warmup much easier to follow- config, profile, limits, and local state parsing is hardened against malformed files
- doctor/update/runtime paths are more defensive around probe failures, unreadable files, and backend/service uncertainty
- vLLM runtime support is pinned to stable
>=0.20,<0.21;vserve runtime check vllmreports installed vLLM, torch, CUDA, Transformers, and dependency health - tuning caches are versioned and tied to model, GPU, backend, and runtime fingerprints, so stale limits are recalculated after runtime drift
Current beta caveats:
- non-interactive startup remains intentionally strict: if the backend never reaches a healthy API state within the timeout window,
runexits nonzero even if the service is still warming - multi-user coordination is best-effort operational safety, not a security boundary
Install
Beta/pre-release channel (primary while vserve is in beta):
uv tool install --prerelease allow vserve
pip install --pre vserve
vserve update --nightly
For llama.cpp GGUF tuning support:
pip install 'vserve[llamacpp]'
Quick Start
vserve init # scan GPU, backends, CUDA, systemd — write config
vserve runtime check vllm # verify the external vLLM runtime
vserve add # search HuggingFace, pick variant, download
vserve run <model> # auto-tune + interactive config + serve
vserve run <model> --tools # enable tool calling (auto-detected)
vserve run <model> --backend llamacpp # force a specific backend
Scriptable serving:
vserve run qwen fp8 --yes --context 32768 --slots 4 --kv-cache-dtype fp8 --port 8888
vserve run qwen fp8 --yes --replace # safe non-interactive restart
vserve run qwen fp8 --save-profile fast --yes
vserve run qwen fp8 --profile fast
vserve run --profile /opt/vllm/configs/models/provider--Model.fast.yaml --yes
Runtime repair and GGUF-only setup:
vserve runtime check vllm
vserve runtime upgrade vllm --stable
vserve add TheBloke/some-model-GGUF
vserve run some model q4 --backend llamacpp --yes --gpu-layers 999
Automation:
vserve run qwen fp8 --profile fast --yes
vserve status
vserve stop
Backends
vserve auto-detects the right backend from the model format:
| Format | Backend | Engine |
|---|---|---|
| safetensors, GPTQ, AWQ, FP8 | vLLM | PagedAttention, continuous batching |
| GGUF | llama.cpp | CPU/GPU offload, quantized inference |
No configuration needed — download a model and vserve run picks the right engine.
vLLM
The default for transformer models in safetensors format. Optimized for high-throughput serving with PagedAttention, KV cache management, and automatic batching.
- Auto-tunes
--max-model-len,--max-num-seqs,--kv-cache-dtypebased on your GPU - Tool calling with parser auto-detection (Qwen, Llama, Mistral, DeepSeek, Gemma, GPT-OSS)
- Systemd service management via
vllm.service
llama.cpp
For GGUF quantized models. Serves via llama-server with an OpenAI-compatible API.
- Auto-calculates
--n-gpu-layers,--ctx-size,--parallelbased on VRAM - Partial GPU offload — serve models that don't fully fit in VRAM
- Tool calling via
--jinja(no parser configuration needed) - Systemd service management via
llama-cpp.service
What It Does
vserve manages the full lifecycle of serving LLMs on a GPU workstation:
- Download — search HuggingFace, see available weight variants (FP8, NVFP4, BF16, GGUF) with sizes, download only one backend format at a time, and materialize each runnable variant into its own model root
- Auto-tune — calculate exactly what context lengths and concurrency your GPU can handle, based on model architecture and available VRAM
- Tool calling — auto-detects the correct parser from the model's chat template (vLLM) or uses
--jinja(llama.cpp) - Run/Stop — interactive config wizard, systemd service management, health check with timeout
- Fan control — temperature-based curve daemon with quiet hours, or hold a fixed speed
- Multi-user — best-effort session coordination warns other
vserveusers before they disrupt your running model - Doctor — diagnose GPU, CUDA, backend, systemd issues with actionable fix suggestions
Commands
| Command | Description |
|---|---|
vserve |
Dashboard — GPU, models, status |
vserve init |
Auto-discover backends and write config |
vserve list [name] |
List models with backend, tools, and limits |
vserve add [model] |
Search and download from HuggingFace with variant picker |
vserve rm <name> |
Remove a downloaded model |
vserve tune [model] |
Calculate context/concurrency limits |
vserve run [model] |
Configure and start serving (auto-tunes if needed) |
vserve run MODEL... --yes --context N --slots N |
Non-interactive serving from flags |
vserve run MODEL... --yes --replace |
Non-interactive restart; without --replace, running backends are refused |
vserve run MODEL... --profile NAME_OR_PATH |
Serve a saved profile by name or explicit path |
vserve run MODEL... --tools --tool-parser hermes --reasoning-parser qwen3 |
Start with explicit parsers |
vserve run MODEL... --trust-remote-code |
Opt in to vLLM remote model code execution |
vserve run MODEL... --backend llamacpp --gpu-layers 999 |
Force llama.cpp for GGUF |
vserve profile list|show|rm |
Manage saved serving profiles |
vserve stop |
Stop the running server |
vserve status [--json] |
Show current serving config and probe uncertainty |
vserve fan [auto|off|30-100] |
GPU fan control with temp-based curve |
vserve doctor [--json] [--strict] |
Check system readiness; strict exits nonzero on failures |
vserve cache clean [--dry-run] [--all] [--yes] |
Preview or clean stale sockets and JIT caches |
vserve runtime check vllm |
Check vLLM version/dependency compatibility |
vserve runtime upgrade vllm --stable |
Reinstall vserve's pinned stable vLLM runtime |
vserve version |
Show current version and check for updates |
vserve update [--nightly] |
Update vserve, optionally allowing pre-releases |
Model-taking commands support fuzzy matching — vserve run qwen fp8 finds the right model.
Profile rules: names saved with --save-profile must match [A-Za-z0-9._-]+ and cannot be ., .., or include path separators. Profile names resolve inside configured vserve profile roots. Explicit external --profile paths are accepted only by run and infer backend from YAML/JSON when possible. profile show and profile rm never read or delete arbitrary external paths, even with --force.
Automation note: run --yes is fully non-interactive. If it needs to stop or start systemd services it uses non-prompting service operations; configure passwordless service control for the vserve operator or run without --yes.
Tool Calling
vLLM
Auto-detects the correct vLLM parser by reading the model's chat template:
| Model Family | Tool Parser | Reasoning Parser |
|---|---|---|
| Qwen 2.5 | hermes |
— |
| Qwen 3 | hermes |
qwen3 |
| Qwen 3.5 | qwen3_coder |
qwen3 |
| Llama 3.1 / 3.2 / 3.3 | llama3_json |
— |
| Llama 4 | llama4_pythonic |
— |
| Mistral / Mixtral | mistral |
mistral |
| DeepSeek V3 / R1 | deepseek_v3 |
deepseek_r1 |
| Gemma 4 | gemma4 |
gemma4 |
| GPT-OSS | openai |
openai_gptoss |
Detection is template-based (not model-name regex), so it works for fine-tunes and community uploads.
Remote model code is disabled by default. Use --trust-remote-code only for repositories you trust; generated profiles include trust-remote-code only when that flag is explicitly set.
llama.cpp
Uses --jinja to read the model's chat template directly. No parser selection needed — one flag covers all model families.
Prerequisites
| Requirement | Check | Install |
|---|---|---|
| NVIDIA GPU + drivers | nvidia-smi |
nvidia.com/drivers |
| CUDA toolkit | nvcc --version |
sudo apt install nvidia-cuda-toolkit |
| systemd | (most Linux servers) | See troubleshooting |
| sudo access | for systemctl, fan control |
For vLLM backend:
| Requirement | Check | Install |
|---|---|---|
| stable vLLM 0.20.x | vserve runtime check vllm |
vserve runtime upgrade vllm --stable or docs.vllm.ai |
For llama.cpp backend:
| Requirement | Check | Install |
|---|---|---|
| llama-server | llama-server --version |
github.com/ggml-org/llama.cpp |
Configuration
Auto-discovered on first run. Override at ~/.config/vserve/config.yaml:
schema_version: 2
cuda_home: /usr/local/cuda
gpu:
index: 0
memory_utilization: 0.91
backends:
vllm:
root: /opt/vllm
service_name: vllm
service_user: vllm
port: 8888
llamacpp:
root: /opt/llama-cpp
service_name: llama-cpp
service_user: llama-cpp
Legacy top-level vllm_root, service_name, llamacpp_root, and GPU memory keys still load, but newly saved config uses the backend-indexed schema above.
gpu.index is part of runtime truth, not only a tuning hint. vserve records it in active manifests and tuning fingerprints. llama.cpp launch scripts export CUDA_VISIBLE_DEVICES=<index>. vLLM writes configs/.env with the same value and doctor expects the systemd unit to load that environment file.
Directory Layout
/opt/vllm/ # vLLM backend
├── venv/bin/vllm # Python venv
├── .venv/bin/vllm # alternate Python venv location
├── models/ # safetensors models
├── configs/
│ ├── .env # service environment
│ ├── active.yaml # active profile symlink
│ └── models/ # limits + YAML profiles
├── tmp/ # RPC sockets / runtime temp files
├── .cache/
│ ├── flashinfer/ # FlashInfer JIT cache
│ ├── torch_extensions/ # torch extension cache
│ └── vllm/ # vLLM/torch.compile cache
├── run/
│ └── active-manifest.json # active backend state
└── logs/
/opt/llama-cpp/ # llama.cpp backend
├── bin/llama-server # compiled binary
├── models/ # GGUF models
├── configs/
│ ├── active.sh # active launch script symlink
│ ├── active.json # active config symlink
│ └── models/ # JSON profiles
├── run/
│ └── active-manifest.json # active backend state
└── logs/
GGUF downloads create one runnable model root per selected quant/subdirectory, so Q4_K_M and Q8_0 variants do not share a directory. Source roots left only for materialization are ignored by model scanning.
Fan Control
vserve fan # show status, interactive menu
vserve fan auto # temp-based curve with quiet hours
vserve fan 80 # hold at 80% (persistent daemon)
vserve fan off # stop daemon, restore NVIDIA auto
The auto curve ramps with temperature and caps fan speed during quiet hours (configurable). Emergency override at 88C ignores quiet hours.
Architecture
vserve uses a Backend Protocol pattern. Each inference engine implements the same interface:
Backend Protocol
├── VllmBackend — safetensors, AWQ, FP8, GPTQ
├── LlamaCppBackend — GGUF
└── (future: SGLang, etc.)
The registry auto-detects the right backend from the model format. Runtime checks, tuning fingerprints, profile/config generation, service lifecycle, active manifests, and status summaries live behind the backend protocol so the command layer can stay focused on user workflows.
Development
git clone https://github.com/Gavin-Qiao/vserve.git
cd vserve
uv sync --dev
uv run pytest tests/ # 495 tests
uv run ruff check src/ tests/ # lint
uv run mypy src/vserve/ --ignore-missing-imports --check-untyped-defs
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vserve-0.5.2b2.tar.gz.
File metadata
- Download URL: vserve-0.5.2b2.tar.gz
- Upload date:
- Size: 173.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24c08226296f1f4283b785016fcd2a328473e1acca555a01ed61d649bdb1fdac
|
|
| MD5 |
640efe02728890b32987a8e64c0e5e31
|
|
| BLAKE2b-256 |
7dc0de8fc2f050210a35c72f30db2c0425f9594a68cd6ede1959bc1b69cfcb40
|
Provenance
The following attestation bundles were made for vserve-0.5.2b2.tar.gz:
Publisher:
publish.yml on Gavin-Qiao/vserve
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vserve-0.5.2b2.tar.gz -
Subject digest:
24c08226296f1f4283b785016fcd2a328473e1acca555a01ed61d649bdb1fdac - Sigstore transparency entry: 1398948226
- Sigstore integration time:
-
Permalink:
Gavin-Qiao/vserve@1cdeecb07611f45922502ec6db6e12ff7108735f -
Branch / Tag:
refs/tags/v0.5.2b2 - Owner: https://github.com/Gavin-Qiao
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1cdeecb07611f45922502ec6db6e12ff7108735f -
Trigger Event:
push
-
Statement type:
File details
Details for the file vserve-0.5.2b2-py3-none-any.whl.
File metadata
- Download URL: vserve-0.5.2b2-py3-none-any.whl
- Upload date:
- Size: 94.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78e0eea493ed79392b2b0d54755b80b4bf20857844db05637927e2b50f13be43
|
|
| MD5 |
5eef6fb3d2827042b0d306ec78760e6b
|
|
| BLAKE2b-256 |
f995fd6a3e49302c789fabbcb9208e2c47e11d947c5dddc5b617c9e94578c582
|
Provenance
The following attestation bundles were made for vserve-0.5.2b2-py3-none-any.whl:
Publisher:
publish.yml on Gavin-Qiao/vserve
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vserve-0.5.2b2-py3-none-any.whl -
Subject digest:
78e0eea493ed79392b2b0d54755b80b4bf20857844db05637927e2b50f13be43 - Sigstore transparency entry: 1398948239
- Sigstore integration time:
-
Permalink:
Gavin-Qiao/vserve@1cdeecb07611f45922502ec6db6e12ff7108735f -
Branch / Tag:
refs/tags/v0.5.2b2 - Owner: https://github.com/Gavin-Qiao
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1cdeecb07611f45922502ec6db6e12ff7108735f -
Trigger Event:
push
-
Statement type: