a lightweight vLLM implementation built from scratch and runs on NPU.

Project description

Nano-vLLM-NPU

A lightweight vLLM implementation built from scratch.

Nano-vllm-npu originally forks from nano-vllm, then adapts it to Ascend NPU.

Installation

pip install git+https://github.com/voidvelocity/nano-vllm-npu.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:

import torch._dynamo
torch._dynamo.config.suppress_errors = True

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Note:

In my test, enforce_eager should always set as True. (I don't why now)
I export ASCEND_LAUNCH_BLOCKING=1, I'm not sure if it's necessary.
torch._dynamo.config.suppress_errors = True is needed to suppress errors from _dynamo.

Compare with `nano-vllm`

Main changes compared with nano-vllm

nano-vllm runs on GPU, while nano-vllm-npu runs on NPU. In the code, torch.cuda is replaced by torch.npu and CUDAGraph by NPUGraph.
nano-vllm uses triton and flash-attention to optimize kv-cache store and attention, which improves performance. While currently we just try to write attention by pytorch, which is much slower. We'll optimize it in the future. More see file nanovllmnpu/layers/attention.py, you can compare with nanovllm/layers/attention.py.

Demo

Hardware:
CANN Version:
PyTorch Version:
Torch NPU Version:

Output:

# git clone https://github.com/voidvelocity/nano-vllm-npu.git
# cd nano-vllm-npu
# python example.py
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT rms_forward /home/my_demo/nano-vllm-npu/nanovllm/layers/layernorm.py line 16
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] due to:
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]   File "/usr/local/lib/python3.11/site-packages/torch_npu/utils/_dynamo.py", line 428, in _check_wrapper_exist
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]     raise AssertionError(f"Device {device_type} not supported" + pta_error(ErrCode.NOT_SUPPORT))
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] AssertionError: Device npu not supported
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] [ERROR] 2026-01-31-17:52:38 (PID:3862154, Device:0, RankID:0) ERR00007 PTA feature not supported
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]
...
[rank0]:[2026-01-31 17:52:41,046] torch._dynamo.convert_frame: [WARNING]
Generating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:39<00:00, 139.61s/it, Prefill=68tok/s, Decode=0tok/s]


Prompt: '<|im_start|>user\nintroduce yourself<|im_end|>\n<|im_start|>assistant\n'
Completion: "<think>\nOkay, the user wants me to introduce myself. First, I need to provide a general and friendly description. I should mention my name, age, and background. But since I'm an AI, I don't have a real name, so I'll say I'm an AI assistant. I should also mention my purpose, like helping users with questions. I should keep it simple and positive. Let me make sure I'm not using any technical terms and keep it conversational. Alright, that should cover it.\n</think>\n\nHello! I'm an AI assistant designed to help you with questions, tasks, and support. I can assist with a wide range of topics, from general knowledge to specific queries. How can I assist you today?<|im_end|>"


Prompt: '<|im_start|>user\nlist all prime numbers within 100<|im_end|>\n<|im_start|>assistant\n'
Completion: "<think>\nOkay, so I need to list all the prime numbers between 100. Let me think about how to approach this. First, I remember that a prime number is a number greater than 1 that has no positive divisors other than 1 and itself. So, starting from 100, I need to check each number and see if it's prime.\n\nLet me start by recalling some prime numbers. The first few primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97. But wait, these are all primes up to 100. So, I need to make sure that I don't miss any.\n\nLet me start checking from 100. Since 100 is even, it's not prime. The next number is 101. Let me check if 101"

Conclusion: Although current version is well optimized for performance, at least it works 😀

Project details

Release history Release notifications | RSS feed

0.0.2

Feb 2, 2026

This version

0.0.1.post1

Feb 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nano_vllm_npu-0.0.1.post1.tar.gz (18.8 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nano_vllm_npu-0.0.1.post1-py3-none-any.whl (24.7 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file nano_vllm_npu-0.0.1.post1.tar.gz.

File metadata

Download URL: nano_vllm_npu-0.0.1.post1.tar.gz
Upload date: Feb 1, 2026
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nano_vllm_npu-0.0.1.post1.tar.gz
Algorithm	Hash digest
SHA256	`d79abc73e0a4eade05eee69ed03c532cc7139ac749cfebaebb0285b7a8e2b00c`
MD5	`20869840c8f6bdda4691dd03b5daf217`
BLAKE2b-256	`fdfcf8e56afeb40e2d1d3f9e2ed6766cd0b2c001b1d2562abca96b03d6a251f0`

See more details on using hashes here.

File details

Details for the file nano_vllm_npu-0.0.1.post1-py3-none-any.whl.

File metadata

Download URL: nano_vllm_npu-0.0.1.post1-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 24.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nano_vllm_npu-0.0.1.post1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b85ea4f57c474113c1774e5513c724ffb80910c55acc7cb5a48017422d9727bb`
MD5	`5a782b8b7851544ad7d61d03ae48cfdb`
BLAKE2b-256	`1e77371a28e24adb78c64c60a5613a078998a693e16b820b5920d72f8e8ef78b`

See more details on using hashes here.

nano-vllm-npu 0.0.1.post1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Nano-vLLM-NPU

Installation

Model Download

Quick Start

Compare with `nano-vllm`

Demo

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

nano-vllm-npu 0.0.1.post1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Nano-vLLM-NPU

Installation

Model Download

Quick Start

Compare with nano-vllm

Demo

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Compare with `nano-vllm`