A CLI tool to conveniently serve LLMs with vLLM
Project description
vLLM CLI
A command-line interface tool for serving Large Language Models using vLLM. Provides both interactive and command-line modes with features for configuration profiles, model management, and server monitoring.
Interactive terminal interface with GPU status and system overview
Tip: You can customize the GPU stats bar in settings
Features
- ๐ฏ Interactive Mode - Rich terminal interface with menu-driven navigation
- โก Command-Line Mode - Direct CLI commands for automation and scripting
- ๐ค Model Management - Automatic discovery of local models with HuggingFace and Ollama support
- ๐ง Configuration Profiles - Pre-configured and custom server profiles for different use cases
- ๐ Server Monitoring - Real-time monitoring of active vLLM servers
- ๐ฅ๏ธ System Information - GPU, memory, and CUDA compatibility checking
- ๐ Advanced Configuration - Full control over vLLM parameters with validation
Quick Links: ๐ Docs | ๐ Quick Start | ๐ธ Screenshots | ๐ Usage Guide | โ Troubleshooting | ๐บ๏ธ Roadmap
What's New in v0.2.5
Multi-Model Proxy Server (Experimental)
The Multi-Model Proxy is a new experimental feature that enables serving multiple LLMs through a single unified API endpoint. This feature is currently under active development and available for testing.
What It Does:
- Single Endpoint - All your models accessible through one API
- Live Management - Add or remove models without stopping the service
- Dynamic GPU Management - Efficient GPU resource distribution through vLLM's sleep/wake functionality
- Interactive Setup - User-friendly wizard guides you through configuration
Note: This is an experimental feature under active development. Your feedback helps us improve! Please share your experience through GitHub Issues.
For complete documentation, see the ๐ Multi-Model Proxy Guide.
What's New in v0.2.4
๐ Hardware-Optimized Profiles for GPT-OSS Models
New built-in profiles specifically optimized for serving GPT-OSS models on different GPU architectures:
gpt_oss_ampere- Optimized for NVIDIA A100 GPUsgpt_oss_hopper- Optimized for NVIDIA H100/H200 GPUsgpt_oss_blackwell- Optimized for NVIDIA Blackwell GPUs
Based on official vLLM GPT recipes for maximum performance.
โก Shortcuts System
Save and quickly launch your favorite model + profile combinations:
vllm-cli serve --shortcut my-gpt-server
๐ฆ Full Ollama Integration
- Automatic discovery of Ollama models
- GGUF format support (experimental)
- System and user directory scanning
๐ง Enhanced Configuration
- Environment Variables - Universal and profile-specific environment variable management
- GPU Selection - Choose specific GPUs for model serving (
--device 0,1) - Enhanced System Info - vLLM feature detection with attention backend availability
See CHANGELOG.md for detailed release notes.
Quick Start
Important: vLLM Installation Notes
โ ๏ธ Binary Compatibility Warning: vLLM contains pre-compiled CUDA kernels that must match your PyTorch version exactly. Installing mismatched versions will cause errors.
vLLM-CLI will not install vLLM or Pytorch by default.
Installation
Option 1: Install vLLM seperately and then install vLLM CLI (Recommended)
# Install vLLM -- Skip this step if you have vllm installed in your environment
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
# Or specify a backend: uv pip install vllm --torch-backend=cu128
# Install vLLM CLI
uv pip install --upgrade vllm-cli
uv run vllm-cli
# If you are using conda:
# Activate the environment you have vllm installed in
pip install vllm-cli
vllm-cli
Option 2: Install vLLM CLI + vLLM
# Install vLLM CLI + vLLM
pip install vllm-cli[vllm]
vllm-cli
Option 3: Build from source (You still need to install vLLM seperately)
git clone https://github.com/Chen-zexi/vllm-cli.git
cd vllm-cli
pip install -e .
Option 4: For Isolated Installation (pipx/system packages)
โ ๏ธ Compatibility Note: pipx creates isolated environments which may have compatibility issues with vLLM's CUDA dependencies. Consider using uv or conda (see above) for better PyTorch/CUDA compatibility.
# If you do not want to use virtual environment and want to install vLLM along with vLLM CLI
pipx install "vllm-cli[vllm]"
# If you want to install pre-release version
pipx install --pip-args="--pre" "vllm-cli[vllm]"
Prerequisites
- Python 3.9+
- CUDA-compatible GPU (recommended)
- vLLM package installed
- For dependency issues, see Troubleshooting Guide
Basic Usage
# Interactive mode - menu-driven interface
vllm-cl
# Serve a model
vllm-cli serve --model openai/gpt-oss-20b
# Use a shortcut
vllm-cli serve --shortcut my-model
For detailed usage instructions, see the ๐ Usage Guide and ๐ Multi-Model Proxy Guide.
Configuration
Built-in Profiles
vLLM CLI includes 7 optimized profiles for different use cases:
General Purpose:
standard- Minimal configuration with smart defaultshigh_throughput- Maximum performance configurationlow_memory- Memory-constrained environmentsmoe_optimized- Optimized for Mixture of Experts models
Hardware-Specific (GPT-OSS):
gpt_oss_ampere- NVIDIA A100 GPUsgpt_oss_hopper- NVIDIA H100/H200 GPUsgpt_oss_blackwell- NVIDIA Blackwell GPUs
See ๐ Profiles Guide for detailed information.
Configuration Files
- Main Config:
~/.config/vllm-cli/config.yaml - User Profiles:
~/.config/vllm-cli/user_profiles.json - Shortcuts:
~/.config/vllm-cli/shortcuts.json
Documentation
- ๐ Usage Guide - Complete usage instructions
- ๐ Multi-Model Proxy - Serve multiple models simultaneously
- ๐ Profiles Guide - Built-in profiles details
- โ Troubleshooting - Common issues and solutions
- ๐ธ Screenshots - Visual feature overview
- ๐ Model Discovery - Model management guide
- ๐ฆ Ollama Integration - Using Ollama models
- โ๏ธ Custom Models - Serving custom models
- ๐บ๏ธ Roadmap - Future development plans
Integration with hf-model-tool
vLLM CLI uses hf-model-tool for model discovery:
- Comprehensive model scanning
- Ollama model support
- Shared configuration
Development
Project Structure
src/vllm_cli/
โโโ cli/ # CLI command handling
โโโ config/ # Configuration management
โโโ models/ # Model management
โโโ server/ # Server lifecycle
โโโ ui/ # Terminal interface
โโโ schemas/ # JSON schemas
Contributing
Contributions are welcome! Please feel free to open an issue or submit a pull request.
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_cli-0.2.5.tar.gz.
File metadata
- Download URL: vllm_cli-0.2.5.tar.gz
- Upload date:
- Size: 230.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b384958a82f40650e63b4bdf5538c543c35c3680422ab66ec2ccba7ce0b032d
|
|
| MD5 |
70393b1acb280ea54a880cc1034446ab
|
|
| BLAKE2b-256 |
ad734ebb1471317a6dfa55bc9c8c21fe5d2d108f22aaa8357702b13a4d9069e8
|
Provenance
The following attestation bundles were made for vllm_cli-0.2.5.tar.gz:
Publisher:
python-publish.yml on Chen-zexi/vllm-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vllm_cli-0.2.5.tar.gz -
Subject digest:
4b384958a82f40650e63b4bdf5538c543c35c3680422ab66ec2ccba7ce0b032d - Sigstore transparency entry: 429887159
- Sigstore integration time:
-
Permalink:
Chen-zexi/vllm-cli@b49de5b4635b20dbf81cc985fb7b913551c5166a -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/Chen-zexi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b49de5b4635b20dbf81cc985fb7b913551c5166a -
Trigger Event:
release
-
Statement type:
File details
Details for the file vllm_cli-0.2.5-py3-none-any.whl.
File metadata
- Download URL: vllm_cli-0.2.5-py3-none-any.whl
- Upload date:
- Size: 249.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b100dd8b001adda684e8035ee306fc024604317200b1db3d8cfebbdd4d9f2df8
|
|
| MD5 |
0e48b8caeaaed2468b1a0b5ec9aa2a9e
|
|
| BLAKE2b-256 |
8c1238f0a8d045ab23dac015d37a268bcc480e0738a71d13005ab66609b7486a
|
Provenance
The following attestation bundles were made for vllm_cli-0.2.5-py3-none-any.whl:
Publisher:
python-publish.yml on Chen-zexi/vllm-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vllm_cli-0.2.5-py3-none-any.whl -
Subject digest:
b100dd8b001adda684e8035ee306fc024604317200b1db3d8cfebbdd4d9f2df8 - Sigstore transparency entry: 429887164
- Sigstore integration time:
-
Permalink:
Chen-zexi/vllm-cli@b49de5b4635b20dbf81cc985fb7b913551c5166a -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/Chen-zexi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b49de5b4635b20dbf81cc985fb7b913551c5166a -
Trigger Event:
release
-
Statement type: