vLLM Kunlun3 backend plugin
Project description
๐ Documentation | ๐ Quick Start | ๐ฆ Installation | ๐ฌ Slack
Latest News ๐ฅ
- [2026/02] ๐ง GLM model family support โ Added GLM5, GLM-4.7 MTP (Multi-Token Prediction), and GLM-47 tool parser with thinking/non-thinking mode toggle
- [2026/02] โก Performance optimizations โ Fused MoE with small batches, optimized attention metadata building, Multi-LoRA inference achieves 80%+ of non-LoRA performance
- [2026/02] ๐ง DeepSeek-V3.2 MTP support โ Added MTP (Multi-Token Prediction) for DeepSeek-V3.2, with RoPE and decoding stage kernel optimizations
- [2026/01] ๐ข New quantization methods โ Support for compressed-tensors W4A16, AWQ MoE W4A16, and DeepSeek-V3.2 W8A8 quantization
- [2026/01] ๐ ๏ธ CI/CD overhaul โ Added E2E tests, unit test CI, ruff format checks, and modular CI workflow refactoring
- [2025/12] ๐ v0.11.0rc1 released โ Added Qwen3-Omni, Qwen3-Next, Seed-OSS support (Release Notes)
- [2025/12] ๐ฆ v0.10.1.1 released โ 5+ multimodal models, AWQ/GPTQ quantization for dense models, Piecewise CUDA Graph, vLLM V1 engine, Flash-Infer Top-K/Top-P sampling with 10-100ร speedup (Release Notes)
- [2025/12] ๐ Initial release of vLLM Kunlun โ Open sourced on Dec 8, 2025
Overview
vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the RFC Hardware Pluggable.
This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM. By utilizing vLLM Kunlun, popular open-source models โ including Transformer-like, Mixture-of-Expert (MoE), Embedding, and Multi-modal LLMs โ can run effortlessly on the Kunlun XPU.
โจ Key Features
- Seamless Plugin Integration โ Works as a standard vLLM platform plugin via Python entry points, no need to modify vLLM source code
- Broad Model Support โ Supports 15+ mainstream LLMs including Qwen, Llama, DeepSeek, Kimi-K2, and multimodal models
- Quantization Support โ INT8 and other quantization methods for MoE and dense models
- LoRA Fine-Tuning โ LoRA adapter support for Qwen series models
- Piecewise Kunlun Graph โ Hardware-accelerated graph optimization for high-performance inference
- FlashMLA Attention โ Optimized multi-head latent attention for DeepSeek MLA architectures
- Tensor Parallelism โ Multi-device parallel inference with distributed execution support
- OpenAI-Compatible API โ Serve models with the standard OpenAI API interface
Prerequisites
- Hardware: Kunlun3 P800
- OS: Ubuntu 22.04
- Software:
- Python >= 3.10
- PyTorch >= 2.5.1
- vLLM (same version as vllm-kunlun)
- transformers >= 4.57.0
Supported Models
Generative Models
| Model | Support | Quantization | LoRA | Kunlun Graph |
|---|---|---|---|---|
| Qwen2 | โ | โ | โ | โ |
| Qwen2.5 | โ | โ | โ | โ |
| Qwen3 | โ | โ | โ | โ |
| Qwen3-Moe | โ | โ | โ | |
| Qwen3-Next | โ | โ | โ | |
| Qwen3.5 | โ | โ | โ | |
| MiMo-V2-Flash | โ | โ | โ | |
| Llama2 | โ | โ | โ | โ |
| Llama3 | โ | โ | โ | โ |
| Llama3.1 | โ | โ | โ | |
| gpt-oss | โ | โ | ||
| GLM4.5 | โ | โ | โ | |
| GLM4.5Air | โ | โ | โ | |
| GLM4.7 | โ | โ | โ | |
| GLM5 | โ | โ | โ | |
| DeepSeek-R1 | โ | โ | โ | |
| DeepSeek-V3 | โ | โ | โ | |
| DeepSeek-V3.2 | โ | โ | โ | |
| Kimi-K2 | โ | โ | โ | |
| Minimax-M2.5 | โ | โ | โ |
Multimodal Language Models
| Model | Support | Quantization | LoRA | Kunlun Graph |
|---|---|---|---|---|
| Qwen2-VL | โ | โ | โ | |
| Qwen2.5-VL | โ | โ | โ | |
| Qwen3-VL | โ | โ | โ | |
| Qwen3-VL-MoE | โ | โ | โ | |
| Qwen3-Omni-MoE | โ | โ | ||
| InternVL-2.5 | โ | โ | ||
| InternVL-3.5 | โ | โ | ||
| InternS1 | โ | โ | ||
| Kimi-K2.5 | โ | โ | โ |
Performance Visualization ๐
High-performance computing at work: How different models perform on the Kunlun3 P800.
Current environment: 16-way concurrency, input/output size 2048.
Quick Start
Start an OpenAI-Compatible API Server
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8356 \
--model <your-model-path> \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--max_num_seqs 128 \
--max_num_batched_tokens 32768 \
--block-size 128 \
--distributed-executor-backend mp \
--served-model-name <your-model-name>
Send a Request
curl http://localhost:8356/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<your-model-name>",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 512
}'
Version Matrix
| Version | Release Type | Documentation |
|---|---|---|
| v0.11.0 | Latest stable version | Quick Start ยท Installation |
Architecture
vllm-kunlun/
โโโ vllm_kunlun/ # Core plugin package
โ โโโ platforms/ # Kunlun XPU platform implementation
โ โโโ models/ # Model implementations (DeepSeek, Qwen, Llama, etc.)
โ โโโ ops/ # Custom operators (attention, linear, sampling, etc.)
โ โ โโโ attention/ # FlashMLA, paged attention, merge attention states
โ โ โโโ fla/ # Flash linear attention operations
โ โ โโโ sample/ # Sampling operators
โ โโโ v1/ # vLLM V1 engine adaptations
โ โโโ compilation/ # Torch compile wrapper for Kunlun Graph
โ โโโ csrc/ # C++ extensions (custom CUDA-compatible kernels)
โ โโโ config/ # Model configuration overrides
โโโ tests/ # Test suite
โโโ docs/ # Documentation (Sphinx-based, ReadTheDocs hosted)
โโโ ci/ # CI pipeline configurations
โโโ setup.py # Legacy build script (with C++ extensions)
โโโ pyproject.toml # Modern Python build configuration (hatchling)
Contributing
We welcome contributions from the community! Please read our Contributing Guide before submitting a PR.
PR Classification
Use the following prefixes for PR titles:
[Attention]โ Attention mechanism features/optimizations[Core]โ Core vllm-kunlun logic (platform, attention, communicators, model runner)[Kernel]โ Compute kernels and ops[Bugfix]โ Bug fixes[Doc]โ Documentation improvements[Test]โ Tests[CI]โ CI/CD improvements[Misc]โ Other changes
Star History ๐ฅ
We opened the project at Dec 8, 2025. We love open source and collaboration โค๏ธ
Sponsors ๐
We sincerely appreciate the KunLunXin team for their support in providing XPU resources, which enabled efficient model adaptation debugging, comprehensive end-to-end testing, and broader model compatibility.
License
Apache License 2.0, as found in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_kunlun-0.11.1.tar.gz.
File metadata
- Download URL: vllm_kunlun-0.11.1.tar.gz
- Upload date:
- Size: 661.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d45c87c7e76128a66ba6707eccf3571dcbecf9369f6be032444c8c26a20ff53
|
|
| MD5 |
c990124431f044a523b4743dccdf4bc5
|
|
| BLAKE2b-256 |
fce8d1cd0adbaa895c44e5ace0d94b6881595ab142834f970ca2cb355a960d03
|
File details
Details for the file vllm_kunlun-0.11.1-py3-none-any.whl.
File metadata
- Download URL: vllm_kunlun-0.11.1-py3-none-any.whl
- Upload date:
- Size: 741.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b90e548395f4809a0b4aa592e2ba28f3871abc12938e46216524321bacfde8ab
|
|
| MD5 |
e478a24f520e26879ba2437715663519
|
|
| BLAKE2b-256 |
ddb26eb7e38fd09a4e3154c9a726eb2f5ae0f300027b125c2a29dae9cd0deab9
|