Skip to main content

A CLI tool to conveniently serve LLMs with vLLM

Project description

vLLM CLI

CI Release PyPI version License: MIT Python 3.9+ PyPI Downloads

A command-line interface tool for serving Large Language Models using vLLM. Provides both interactive and command-line modes with features for configuration profiles, model management, and server monitoring.

vLLM CLI Welcome Screen Interactive terminal interface with GPU status and system overview
Tip: You can customize the GPU stats bar in settings

Features

  • ๐ŸŽฏ Interactive Mode - Rich terminal interface with menu-driven navigation
  • โšก Command-Line Mode - Direct CLI commands for automation and scripting
  • ๐Ÿค– Model Management - Automatic discovery of local models with HuggingFace and Ollama support
  • ๐Ÿ”ง Configuration Profiles - Pre-configured and custom server profiles for different use cases
  • ๐Ÿ“Š Server Monitoring - Real-time monitoring of active vLLM servers
  • ๐Ÿ–ฅ๏ธ System Information - GPU, memory, and CUDA compatibility checking
  • ๐Ÿ“ Advanced Configuration - Full control over vLLM parameters with validation

Quick Links: ๐Ÿ“– Docs | ๐Ÿš€ Quick Start | ๐Ÿ“ธ Screenshots | ๐Ÿ“˜ Usage Guide | โ“ Troubleshooting | ๐Ÿ—บ๏ธ Roadmap

What's New in v0.2.5

Multi-Model Proxy Server (Experimental)

The Multi-Model Proxy is a new experimental feature that enables serving multiple LLMs through a single unified API endpoint. This feature is currently under active development and available for testing.

What It Does:

  • Single Endpoint - All your models accessible through one API
  • Live Management - Add or remove models without stopping the service
  • Dynamic GPU Management - Efficient GPU resource distribution through vLLM's sleep/wake functionality
  • Interactive Setup - User-friendly wizard guides you through configuration

Note: This is an experimental feature under active development. Your feedback helps us improve! Please share your experience through GitHub Issues.

For complete documentation, see the ๐ŸŒ Multi-Model Proxy Guide.

What's New in v0.2.4

๐Ÿš€ Hardware-Optimized Profiles for GPT-OSS Models

New built-in profiles specifically optimized for serving GPT-OSS models on different GPU architectures:

  • gpt_oss_ampere - Optimized for NVIDIA A100 GPUs
  • gpt_oss_hopper - Optimized for NVIDIA H100/H200 GPUs
  • gpt_oss_blackwell - Optimized for NVIDIA Blackwell GPUs

Based on official vLLM GPT recipes for maximum performance.

โšก Shortcuts System

Save and quickly launch your favorite model + profile combinations:

vllm-cli serve --shortcut my-gpt-server

๐Ÿฆ™ Full Ollama Integration

  • Automatic discovery of Ollama models
  • GGUF format support (experimental)
  • System and user directory scanning

๐Ÿ”ง Enhanced Configuration

  • Environment Variables - Universal and profile-specific environment variable management
  • GPU Selection - Choose specific GPUs for model serving (--device 0,1)
  • Enhanced System Info - vLLM feature detection with attention backend availability

See CHANGELOG.md for detailed release notes.

Quick Start

Important: vLLM Installation Notes

โš ๏ธ Binary Compatibility Warning: vLLM contains pre-compiled CUDA kernels that must match your PyTorch version exactly. Installing mismatched versions will cause errors.

vLLM-CLI will not install vLLM or Pytorch by default.

Installation

Option 1: Install vLLM seperately and then install vLLM CLI (Recommended)

# Install vLLM -- Skip this step if you have vllm installed in your environment
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
# Or specify a backend: uv pip install vllm --torch-backend=cu128

# Install vLLM CLI
uv pip install --upgrade vllm-cli
uv run vllm-cli

# If you are using conda:
# Activate the environment you have vllm installed in
pip install vllm-cli
vllm-cli

Option 2: Install vLLM CLI + vLLM

# Install vLLM CLI + vLLM
pip install vllm-cli[vllm]
vllm-cli

Option 3: Build from source (You still need to install vLLM seperately)

git clone https://github.com/Chen-zexi/vllm-cli.git
cd vllm-cli
pip install -e .

Option 4: For Isolated Installation (pipx/system packages)

โš ๏ธ Compatibility Note: pipx creates isolated environments which may have compatibility issues with vLLM's CUDA dependencies. Consider using uv or conda (see above) for better PyTorch/CUDA compatibility.

# If you do not want to use virtual environment and want to install vLLM along with vLLM CLI
pipx install "vllm-cli[vllm]"

# If you want to install pre-release version
pipx install --pip-args="--pre" "vllm-cli[vllm]"

Prerequisites

  • Python 3.9+
  • CUDA-compatible GPU (recommended)
  • vLLM package installed
  • For dependency issues, see Troubleshooting Guide

Basic Usage

# Interactive mode - menu-driven interface
vllm-cl
# Serve a model
vllm-cli serve --model openai/gpt-oss-20b

# Use a shortcut
vllm-cli serve --shortcut my-model

For detailed usage instructions, see the ๐Ÿ“˜ Usage Guide and ๐ŸŒ Multi-Model Proxy Guide.

Configuration

Built-in Profiles

vLLM CLI includes 7 optimized profiles for different use cases:

General Purpose:

  • standard - Minimal configuration with smart defaults
  • high_throughput - Maximum performance configuration
  • low_memory - Memory-constrained environments
  • moe_optimized - Optimized for Mixture of Experts models

Hardware-Specific (GPT-OSS):

  • gpt_oss_ampere - NVIDIA A100 GPUs
  • gpt_oss_hopper - NVIDIA H100/H200 GPUs
  • gpt_oss_blackwell - NVIDIA Blackwell GPUs

See ๐Ÿ“‹ Profiles Guide for detailed information.

Configuration Files

  • Main Config: ~/.config/vllm-cli/config.yaml
  • User Profiles: ~/.config/vllm-cli/user_profiles.json
  • Shortcuts: ~/.config/vllm-cli/shortcuts.json

Documentation

Integration with hf-model-tool

vLLM CLI uses hf-model-tool for model discovery:

  • Comprehensive model scanning
  • Ollama model support
  • Shared configuration

Development

Project Structure

src/vllm_cli/
โ”œโ”€โ”€ cli/           # CLI command handling
โ”œโ”€โ”€ config/        # Configuration management
โ”œโ”€โ”€ models/        # Model management
โ”œโ”€โ”€ server/        # Server lifecycle
โ”œโ”€โ”€ ui/            # Terminal interface
โ””โ”€โ”€ schemas/       # JSON schemas

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_cli-0.2.5.tar.gz (230.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_cli-0.2.5-py3-none-any.whl (249.6 kB view details)

Uploaded Python 3

File details

Details for the file vllm_cli-0.2.5.tar.gz.

File metadata

  • Download URL: vllm_cli-0.2.5.tar.gz
  • Upload date:
  • Size: 230.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vllm_cli-0.2.5.tar.gz
Algorithm Hash digest
SHA256 4b384958a82f40650e63b4bdf5538c543c35c3680422ab66ec2ccba7ce0b032d
MD5 70393b1acb280ea54a880cc1034446ab
BLAKE2b-256 ad734ebb1471317a6dfa55bc9c8c21fe5d2d108f22aaa8357702b13a4d9069e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_cli-0.2.5.tar.gz:

Publisher: python-publish.yml on Chen-zexi/vllm-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vllm_cli-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: vllm_cli-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 249.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vllm_cli-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b100dd8b001adda684e8035ee306fc024604317200b1db3d8cfebbdd4d9f2df8
MD5 0e48b8caeaaed2468b1a0b5ec9aa2a9e
BLAKE2b-256 8c1238f0a8d045ab23dac015d37a268bcc480e0738a71d13005ab66609b7486a

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_cli-0.2.5-py3-none-any.whl:

Publisher: python-publish.yml on Chen-zexi/vllm-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page