Skip to main content

a lightweight vLLM implementation built from scratch and runs on NPU.

Project description

Nano-vLLM-NPU

A lightweight vLLM implementation built from scratch.

Nano-vllm-npu originally forks from nano-vllm, then adapts it to Ascend NPU.

changes

  • 2026/02/02 Using triton-ascend to optimize kv-cache load and store in file nanovllmnpu/layers/attention.py. generate in example.py time varies from 219 seconds to 43 seconds, 5x faster.

Installation

pip install git+https://github.com/voidvelocity/nano-vllm-npu.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:

import torch._dynamo
torch._dynamo.config.suppress_errors = True

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Note:

  • In my test, enforce_eager should always set as True. (I don't why now)
  • I export ASCEND_LAUNCH_BLOCKING=1, I'm not sure if it's necessary.
  • torch._dynamo.config.suppress_errors = True is needed to suppress errors from _dynamo.

Compare with nano-vllm

Main changes compared with nano-vllm

Demo

  • Hardware:
  • CANN Version:
  • PyTorch Version:
  • Torch NPU Version:

Output:

# git clone https://github.com/voidvelocity/nano-vllm-npu.git
# cd nano-vllm-npu
# python example.py
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT rms_forward /home/my_demo/nano-vllm-npu/nanovllm/layers/layernorm.py line 16
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] due to:
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]   File "/usr/local/lib/python3.11/site-packages/torch_npu/utils/_dynamo.py", line 428, in _check_wrapper_exist
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]     raise AssertionError(f"Device {device_type} not supported" + pta_error(ErrCode.NOT_SUPPORT))
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] AssertionError: Device npu not supported
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] [ERROR] 2026-01-31-17:52:38 (PID:3862154, Device:0, RankID:0) ERR00007 PTA feature not supported
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]
...
[rank0]:[2026-01-31 17:52:41,046] torch._dynamo.convert_frame: [WARNING]
Generating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:39<00:00, 139.61s/it, Prefill=68tok/s, Decode=0tok/s]


Prompt: '<|im_start|>user\nintroduce yourself<|im_end|>\n<|im_start|>assistant\n'
Completion: "<think>\nOkay, the user wants me to introduce myself. First, I need to provide a general and friendly description. I should mention my name, age, and background. But since I'm an AI, I don't have a real name, so I'll say I'm an AI assistant. I should also mention my purpose, like helping users with questions. I should keep it simple and positive. Let me make sure I'm not using any technical terms and keep it conversational. Alright, that should cover it.\n</think>\n\nHello! I'm an AI assistant designed to help you with questions, tasks, and support. I can assist with a wide range of topics, from general knowledge to specific queries. How can I assist you today?<|im_end|>"


Prompt: '<|im_start|>user\nlist all prime numbers within 100<|im_end|>\n<|im_start|>assistant\n'
Completion: "<think>\nOkay, so I need to list all the prime numbers between 100. Let me think about how to approach this. First, I remember that a prime number is a number greater than 1 that has no positive divisors other than 1 and itself. So, starting from 100, I need to check each number and see if it's prime.\n\nLet me start by recalling some prime numbers. The first few primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97. But wait, these are all primes up to 100. So, I need to make sure that I don't miss any.\n\nLet me start checking from 100. Since 100 is even, it's not prime. The next number is 101. Let me check if 101"

Conclusion: Although current version is well optimized for performance, at least it works 😀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nano_vllm_npu-0.0.2.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nano_vllm_npu-0.0.2-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file nano_vllm_npu-0.0.2.tar.gz.

File metadata

  • Download URL: nano_vllm_npu-0.0.2.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nano_vllm_npu-0.0.2.tar.gz
Algorithm Hash digest
SHA256 da5d8d0f0f7091aecd48877286f9d62b82c1ddb27cc38e795e730dff2c739df4
MD5 6ea88f3b03f02fc66bee64b053734cbe
BLAKE2b-256 07039ac4d540e3a47ed9dd61a44831ace74ae9762e67b2f08049132bfec8780b

See more details on using hashes here.

File details

Details for the file nano_vllm_npu-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: nano_vllm_npu-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nano_vllm_npu-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 14ad6de3a8ba42103489c0f540db25f86f02390548f43871a62533e99d4875d2
MD5 568f798f19b24e643efda29499adc7f0
BLAKE2b-256 41895f818b6527e3a9b9e2b36da82fc6f9f9c52c2683b2c569261798b8834e4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page