Skip to main content

vLLM Kunlun3 backend plugin

Project description

vLLM Kunlun Logo

๐Ÿ“– Documentation | ๐Ÿš€ Quick Start | ๐Ÿ“ฆ Installation | ๐Ÿ’ฌ Slack

GitHub License GitHub Stars GitHub Forks GitHub Issues Python Version


Latest News ๐Ÿ”ฅ

  • [2026/02] ๐Ÿง  GLM model family support โ€” Added GLM5, GLM-4.7 MTP (Multi-Token Prediction), and GLM-47 tool parser with thinking/non-thinking mode toggle
  • [2026/02] โšก Performance optimizations โ€” Fused MoE with small batches, optimized attention metadata building, Multi-LoRA inference achieves 80%+ of non-LoRA performance
  • [2026/02] ๐Ÿ”ง DeepSeek-V3.2 MTP support โ€” Added MTP (Multi-Token Prediction) for DeepSeek-V3.2, with RoPE and decoding stage kernel optimizations
  • [2026/01] ๐Ÿ”ข New quantization methods โ€” Support for compressed-tensors W4A16, AWQ MoE W4A16, and DeepSeek-V3.2 W8A8 quantization
  • [2026/01] ๐Ÿ› ๏ธ CI/CD overhaul โ€” Added E2E tests, unit test CI, ruff format checks, and modular CI workflow refactoring
  • [2025/12] ๐ŸŽ‰ v0.11.0rc1 released โ€” Added Qwen3-Omni, Qwen3-Next, Seed-OSS support (Release Notes)
  • [2025/12] ๐Ÿ“ฆ v0.10.1.1 released โ€” 5+ multimodal models, AWQ/GPTQ quantization for dense models, Piecewise CUDA Graph, vLLM V1 engine, Flash-Infer Top-K/Top-P sampling with 10-100ร— speedup (Release Notes)
  • [2025/12] ๐ŸŒŸ Initial release of vLLM Kunlun โ€” Open sourced on Dec 8, 2025

Overview

vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the RFC Hardware Pluggable.

This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM. By utilizing vLLM Kunlun, popular open-source models โ€” including Transformer-like, Mixture-of-Expert (MoE), Embedding, and Multi-modal LLMs โ€” can run effortlessly on the Kunlun XPU.

โœจ Key Features

  • Seamless Plugin Integration โ€” Works as a standard vLLM platform plugin via Python entry points, no need to modify vLLM source code
  • Broad Model Support โ€” Supports 15+ mainstream LLMs including Qwen, Llama, DeepSeek, Kimi-K2, and multimodal models
  • Quantization Support โ€” INT8 and other quantization methods for MoE and dense models
  • LoRA Fine-Tuning โ€” LoRA adapter support for Qwen series models
  • Piecewise Kunlun Graph โ€” Hardware-accelerated graph optimization for high-performance inference
  • FlashMLA Attention โ€” Optimized multi-head latent attention for DeepSeek MLA architectures
  • Tensor Parallelism โ€” Multi-device parallel inference with distributed execution support
  • OpenAI-Compatible API โ€” Serve models with the standard OpenAI API interface

Prerequisites

  • Hardware: Kunlun3 P800
  • OS: Ubuntu 22.04
  • Software:
    • Python >= 3.10
    • PyTorch >= 2.5.1
    • vLLM (same version as vllm-kunlun)
    • transformers >= 4.57.0

Supported Models

Generative Models

Model Support Quantization LoRA Kunlun Graph
Qwen2 โœ… โœ… โœ… โœ…
Qwen2.5 โœ… โœ… โœ… โœ…
Qwen3 โœ… โœ… โœ… โœ…
Qwen3-Moe โœ… โœ… โœ…
Qwen3-Next โœ… โœ… โœ…
Qwen3.5 โœ… โœ… โœ…
MiMo-V2-Flash โœ… โœ… โœ…
Llama2 โœ… โœ… โœ… โœ…
Llama3 โœ… โœ… โœ… โœ…
Llama3.1 โœ… โœ… โœ…
gpt-oss โœ… โœ…
GLM4.5 โœ… โœ… โœ…
GLM4.5Air โœ… โœ… โœ…
GLM4.7 โœ… โœ… โœ…
GLM5 โœ… โœ… โœ…
DeepSeek-R1 โœ… โœ… โœ…
DeepSeek-V3 โœ… โœ… โœ…
DeepSeek-V3.2 โœ… โœ… โœ…
Kimi-K2 โœ… โœ… โœ…
Minimax-M2.5 โœ… โœ… โœ…

Multimodal Language Models

Model Support Quantization LoRA Kunlun Graph
Qwen2-VL โœ… โœ… โœ…
Qwen2.5-VL โœ… โœ… โœ…
Qwen3-VL โœ… โœ… โœ…
Qwen3-VL-MoE โœ… โœ… โœ…
Qwen3-Omni-MoE โœ… โœ…
InternVL-2.5 โœ… โœ…
InternVL-3.5 โœ… โœ…
InternS1 โœ… โœ…
Kimi-K2.5 โœ… โœ… โœ…

Performance Visualization ๐Ÿš€

High-performance computing at work: How different models perform on the Kunlun3 P800.

Current environment: 16-way concurrency, input/output size 2048.

Models and tgs


Quick Start

Start an OpenAI-Compatible API Server

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 8356 \
    --model <your-model-path> \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --max-model-len 32768 \
    --tensor-parallel-size 1 \
    --dtype float16 \
    --max_num_seqs 128 \
    --max_num_batched_tokens 32768 \
    --block-size 128 \
    --distributed-executor-backend mp \
    --served-model-name <your-model-name>

Send a Request

curl http://localhost:8356/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<your-model-name>",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 512
  }'

Version Matrix

Version Release Type Documentation
v0.11.0 Latest stable version Quick Start ยท Installation

Architecture

vllm-kunlun/
โ”œโ”€โ”€ vllm_kunlun/              # Core plugin package
โ”‚   โ”œโ”€โ”€ platforms/             # Kunlun XPU platform implementation
โ”‚   โ”œโ”€โ”€ models/                # Model implementations (DeepSeek, Qwen, Llama, etc.)
โ”‚   โ”œโ”€โ”€ ops/                   # Custom operators (attention, linear, sampling, etc.)
โ”‚   โ”‚   โ”œโ”€โ”€ attention/         # FlashMLA, paged attention, merge attention states
โ”‚   โ”‚   โ”œโ”€โ”€ fla/               # Flash linear attention operations
โ”‚   โ”‚   โ””โ”€โ”€ sample/            # Sampling operators
โ”‚   โ”œโ”€โ”€ v1/                    # vLLM V1 engine adaptations
โ”‚   โ”œโ”€โ”€ compilation/           # Torch compile wrapper for Kunlun Graph
โ”‚   โ”œโ”€โ”€ csrc/                  # C++ extensions (custom CUDA-compatible kernels)
โ”‚   โ””โ”€โ”€ config/                # Model configuration overrides
โ”œโ”€โ”€ tests/                     # Test suite
โ”œโ”€โ”€ docs/                      # Documentation (Sphinx-based, ReadTheDocs hosted)
โ”œโ”€โ”€ ci/                        # CI pipeline configurations
โ”œโ”€โ”€ setup.py                   # Legacy build script (with C++ extensions)
โ””โ”€โ”€ pyproject.toml             # Modern Python build configuration (hatchling)

Contributing

We welcome contributions from the community! Please read our Contributing Guide before submitting a PR.

PR Classification

Use the following prefixes for PR titles:

  • [Attention] โ€” Attention mechanism features/optimizations
  • [Core] โ€” Core vllm-kunlun logic (platform, attention, communicators, model runner)
  • [Kernel] โ€” Compute kernels and ops
  • [Bugfix] โ€” Bug fixes
  • [Doc] โ€” Documentation improvements
  • [Test] โ€” Tests
  • [CI] โ€” CI/CD improvements
  • [Misc] โ€” Other changes

Star History ๐Ÿ”ฅ

We opened the project at Dec 8, 2025. We love open source and collaboration โค๏ธ

Star History Chart


Sponsors ๐Ÿ‘‹

We sincerely appreciate the KunLunXin team for their support in providing XPU resources, which enabled efficient model adaptation debugging, comprehensive end-to-end testing, and broader model compatibility.


License

Apache License 2.0, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_kunlun-0.11.1.tar.gz (661.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_kunlun-0.11.1-py3-none-any.whl (741.1 kB view details)

Uploaded Python 3

File details

Details for the file vllm_kunlun-0.11.1.tar.gz.

File metadata

  • Download URL: vllm_kunlun-0.11.1.tar.gz
  • Upload date:
  • Size: 661.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vllm_kunlun-0.11.1.tar.gz
Algorithm Hash digest
SHA256 1d45c87c7e76128a66ba6707eccf3571dcbecf9369f6be032444c8c26a20ff53
MD5 c990124431f044a523b4743dccdf4bc5
BLAKE2b-256 fce8d1cd0adbaa895c44e5ace0d94b6881595ab142834f970ca2cb355a960d03

See more details on using hashes here.

File details

Details for the file vllm_kunlun-0.11.1-py3-none-any.whl.

File metadata

  • Download URL: vllm_kunlun-0.11.1-py3-none-any.whl
  • Upload date:
  • Size: 741.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vllm_kunlun-0.11.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b90e548395f4809a0b4aa592e2ba28f3871abc12938e46216524321bacfde8ab
MD5 e478a24f520e26879ba2437715663519
BLAKE2b-256 ddb26eb7e38fd09a4e3154c9a726eb2f5ae0f300027b125c2a29dae9cd0deab9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page