Skip to main content

A one-click LLM serving deployment tool for Kubernetes

Project description

NorthServing

A one-click LLM serving deployment tool for Kubernetes with Volcano job scheduling.

Overview

NorthServing (ๅŒ—ๆœ) is a Python-based tool that simplifies the deployment and management of Large Language Model (LLM) serving infrastructure on Kubernetes. It provides a unified command-line interface for deploying models using various backends (vLLM, SGLang, etc.) with support for multi-node, multi-GPU configurations.

Features

  • ๐Ÿš€ One-Click Deployment: Launch LLM serving with a single command
  • ๐Ÿ”„ Multiple Backends: Support for vLLM, SGLang, and other inference engines
  • ๐Ÿ“Š Performance Benchmarking: Built-in benchmarking tools with Feishu reporting
  • ๐ŸŒ Multi-Cluster Support: Deploy across different Kubernetes clusters
  • โšก Advanced Configurations:
    • PD (Prefill-Decode) separation mode
    • Multi-node deployments with Ray
    • Tensor/Pipeline parallelism
    • Custom resource scheduling
  • ๐Ÿงช Well-Tested: Comprehensive test suite with >80% coverage

Installation

Prerequisites

  • Python >= 3.8
  • Kubernetes cluster with Volcano scheduler
  • kubectl configured
  • Access to Infrawave API (for job management)

Install from PyPI

# Install from internal PyPI server
pip install northserve -i http://10.51.6.7:31624/simple/ --trusted-host 10.51.6.7 --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

# Or configure pip once (recommended)
mkdir -p ~/.pip
cat > ~/.pip/pip.conf << 'EOF'
[global]
index-url = http://10.51.6.7:31624/simple/
trusted-host = 10.51.6.7
EOF

# Then install normally
pip install northserve

Install from Source

git clone https://github.com/china-qijizhifeng/NorthServing.git
cd NorthServing

# Install dependencies from Tsinghua mirror
pip install -r requirements.txt -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

# Install in development mode
pip install -e .

For detailed installation guide, see INSTALL.md

Configuration

Set up your credentials using environment variables:

# Add to your ~/.bashrc or ~/.zshrc
export INFRAWAVES_USERNAME='your_username'
export INFRAWAVES_PASSWORD='your_password'

# Apply changes
source ~/.bashrc  # or source ~/.zshrc

Quick Start

Launch a Model

northserve launch \
  --model-name qwen2-72b-instruct \
  --model-path /gpfs/models/huggingface.co/Qwen/Qwen2-72B-Instruct/ \
  --replicas 1 \
  --gpus-per-pod 8 \
  --profile generation

List Running Models

northserve list

Stop a Model

northserve stop --model-name qwen2-72b-instruct

Command Reference

northserve launch

Launch a new LLM serving deployment.

Required Options:

  • --model-name: Model name for identification
  • --model-path: Path to model weights (optional for some backends)

Common Options:

  • --backend: Inference backend (default: vllm)
    • vllm: vLLM inference engine
    • sglang: SGLang inference engine
    • bp-vllm: BP-optimized vLLM
    • crossing: Crossing inference engine
  • --protocol: API protocol (default: openai)
    • openai: OpenAI-compatible API
    • anthropic: Anthropic-compatible API
  • --replicas: Number of replicas (default: 1)
  • --gpus-per-pod: GPUs per pod (default: 1)
  • --pods-per-job: Pods per job for multi-node (default: 1)
  • --gpu-type: GPU type - gpu, h20, 4090d
  • --namespace: Kubernetes namespace (default: qiji)
  • --priority-class-name: Priority class (default: low-priority-job)

Advanced Options:

  • --extra-cmds: Additional command-line arguments for the engine
  • --extra-envs: Extra environment variables (KEY=value KEY2=value2)
  • --tensor-parallel-size: Tensor parallelism (defaults to gpus-per-pod)
  • --pipeline-parallel-size: Pipeline parallelism (default: 1)
  • --prefill-nodes: Prefill nodes for PD separation (SGLang only)
  • --decode-nodes: Decode nodes for PD separation (SGLang only)
  • --use-host-network: Use host network
  • --standalone: Create standalone service with NodePort
  • -y, --yes: Skip confirmation prompts

Examples:

Simple deployment:

northserve launch --model-name llama2-7b --model-path /gpfs/models/llama2-7b --gpus-per-pod 1

Multi-GPU deployment:

northserve launch \
  --model-name qwen2-72b \
  --model-path /gpfs/models/qwen2-72b \
  --gpus-per-pod 8 \
  --replicas 2

With custom arguments:

northserve launch \
  --model-name mistral-large \
  --model-path /gpfs/models/mistral-large \
  --gpus-per-pod 8 \
  --extra-cmds "--max-num-batched-tokens=16384 --max-model-len=16384 --enforce-eager"

PD separation mode (SGLang):

northserve launch \
  --model-name qwen2-72b \
  --model-path /gpfs/models/qwen2-72b \
  --backend sglang \
  --gpus-per-pod 8 \
  --prefill-nodes 2 \
  --decode-nodes 4 \
  --minilb-replicas 4

northserve stop

Stop a running deployment.

Options:

  • --model-name: Model name to stop (required)
  • --backend: Backend type (default: vllm)
  • --namespace: Kubernetes namespace (default: qiji)
  • --standalone: Stop standalone service
  • -y, --yes: Skip confirmation

Example:

northserve stop --model-name qwen2-72b-instruct

northserve list

List all deployed models and their status.

Example:

northserve list

northserve benchmark

Performance benchmarking commands.

northserve benchmark launch

Launch a benchmark test on a running deployment.

Options:

  • --model-name: Model name to benchmark (required)
  • --model-path: Path to model weights (required)
  • --backend: Backend type
  • --namespace: Kubernetes namespace

Example:

northserve benchmark launch \
  --model-name qwen2-72b \
  --model-path /gpfs/models/qwen2-72b \
  --backend vllm

northserve benchmark report

Report benchmark results to Feishu.

Options:

  • --log-path: Path to benchmark logs (required)
  • --config-file: Path to Feishu config file (required)

Example:

northserve benchmark report \
  --log-path ~/.northserve/logs/qwen2-72b-vllm-server-0 \
  --config-file ~/.northserve/feishu.json

northserve launch_north_llm_api

Launch the North LLM API service.

Options:

  • --version: Version to deploy (default: v0.2.3)
  • --replicas: Number of replicas (default: 1)
  • --namespace: Kubernetes namespace (default: qiji)

Example:

northserve launch_north_llm_api --version v0.2.3 --replicas 2

northserve stop_north_llm_api

Stop the North LLM API service.

Example:

northserve stop_north_llm_api --version v0.2.3

Architecture

NorthServing follows a modular architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         CLI Interface (Click)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚                     โ”‚
โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Commandsโ”‚         โ”‚Core Logic  โ”‚
โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚                    โ”‚
    โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚    โ”‚                                 โ”‚
    โ”‚ โ”Œโ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ โ”‚Job Manager  โ”‚          โ”‚Config Builder    โ”‚
    โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚    โ”‚                                โ”‚
    โ”‚ โ”Œโ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ””โ”€โ”คAPI Clients    โ”‚        โ”‚Template Renderer โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Components

  • CLI Layer: Click-based command-line interface
  • Commands: Individual command implementations (launch, stop, list, etc.)
  • Core Logic:
    • JobManager: Orchestrates deployment lifecycle
    • ConfigBuilder: Builds deployment configurations
    • TemplateRenderer: Jinja2 template rendering
    • BenchmarkEngine: Performance testing
  • API Clients:
    • InfrawaveClient: Infrawave API integration
    • KubernetesClient: Direct kubectl operations
  • Models: Type-safe data models with validation
  • Utils: Validators, logger, helpers

Development

Running Tests

# Install dev dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest

# Run with coverage
pytest --cov=northserve --cov-report=html

# Run specific test file
pytest tests/test_core/test_config_builder.py

Code Quality

# Format code
black northserve tests

# Sort imports
isort northserve tests

# Lint
flake8 northserve tests

# Type checking
mypy northserve

Migration from Shell Version

The Python version maintains backward compatibility with the shell-based version:

  • Same Command Interface: All commands work the same way
  • Same Configuration Files: YAML configs and templates unchanged
  • Same Output: Identical deployment behavior

To use the new version, simply install it and use northserve instead of the old shell script.

Troubleshooting

Common Issues

"Config file not found"

  • Ensure ~/.config/northjob/userinfo.conf exists with valid credentials

"Failed to create job"

  • Check Infrawave API connectivity
  • Verify your credentials are correct
  • Ensure you have permissions for the namespace

"Template not found"

  • Make sure you installed from the repository root
  • YAML templates should be in yaml_templates/ directory

"Invalid backend"

  • Use one of: vllm, sglang, bp-vllm, crossing
  • Note: nla-vllm is deprecated, use bp-vllm

Debug Mode

Enable debug logging:

export NORTHSERVE_LOG_LEVEL=DEBUG
northserve launch ...

Skip auto-update checks:

export NORTHSERVE_SKIP_UPDATE=1
northserve launch ...

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Run the test suite
  5. Submit a pull request

License

See LICENSE file for details.

Support

For issues and questions:

Why NorthServing?

  • โœ… Training-Inference Unified Scheduling: Uses Volcano jobs compatible with training workloads
  • โœ… Multi-Backend Support: Unified interface for different inference engines
  • โœ… Cross-Cluster Deployment: Deploy to multiple clusters with unified ingress
  • โœ… Production Ready: Mature codebase with comprehensive testing
  • โœ… Easy Automation: Command-line interface perfect for CI/CD pipelines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

northserve-2.0.5.tar.gz (57.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

northserve-2.0.5-py3-none-any.whl (76.2 kB view details)

Uploaded Python 3

File details

Details for the file northserve-2.0.5.tar.gz.

File metadata

  • Download URL: northserve-2.0.5.tar.gz
  • Upload date:
  • Size: 57.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for northserve-2.0.5.tar.gz
Algorithm Hash digest
SHA256 42f4fe60c7a63f05c856b8645af2da5e87267ef186ee68478a6a713d45be82af
MD5 ae876f60dd225fe7b3b5912a7f94c574
BLAKE2b-256 fb575b4e5289d58929653d09c77bb37fa45cab9b14596d268a8fec5330743ec0

See more details on using hashes here.

File details

Details for the file northserve-2.0.5-py3-none-any.whl.

File metadata

  • Download URL: northserve-2.0.5-py3-none-any.whl
  • Upload date:
  • Size: 76.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for northserve-2.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1dfeef03890c290181876a73e61e1bcd381f072851f9d1e296ad1919f8b30712
MD5 cb9168f70f22965ffa2989d206b160c8
BLAKE2b-256 9ae058a71347a2ab102b41f1c616379810e063e544d471f1ae91a5d0e052af68

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page