Skip to main content

a lightweight vLLM implementation built from scratch and runs on NPU.

Project description

Nano-vLLM-NPU

A lightweight vLLM implementation built from scratch.

Nano-vllm-npu originally forks from nano-vllm, then adapts it to Ascend NPU.

Installation

pip install git+https://github.com/voidvelocity/nano-vllm-npu.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:

import torch._dynamo
torch._dynamo.config.suppress_errors = True

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Note:

  • In my test, enforce_eager should always set as True. (I don't why now)
  • I export ASCEND_LAUNCH_BLOCKING=1, I'm not sure if it's necessary.
  • torch._dynamo.config.suppress_errors = True is needed to suppress errors from _dynamo.

Compare with nano-vllm

Main changes compared with nano-vllm

Demo

  • Hardware:
  • CANN Version:
  • PyTorch Version:
  • Torch NPU Version:

Output:

# git clone https://github.com/voidvelocity/nano-vllm-npu.git
# cd nano-vllm-npu
# python example.py
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT rms_forward /home/my_demo/nano-vllm-npu/nanovllm/layers/layernorm.py line 16
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] due to:
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]   File "/usr/local/lib/python3.11/site-packages/torch_npu/utils/_dynamo.py", line 428, in _check_wrapper_exist
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]     raise AssertionError(f"Device {device_type} not supported" + pta_error(ErrCode.NOT_SUPPORT))
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] AssertionError: Device npu not supported
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] [ERROR] 2026-01-31-17:52:38 (PID:3862154, Device:0, RankID:0) ERR00007 PTA feature not supported
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING] Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
[rank0]:[2026-01-31 17:52:38,326] torch._dynamo.convert_frame: [WARNING]
...
[rank0]:[2026-01-31 17:52:41,046] torch._dynamo.convert_frame: [WARNING]
Generating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [04:39<00:00, 139.61s/it, Prefill=68tok/s, Decode=0tok/s]


Prompt: '<|im_start|>user\nintroduce yourself<|im_end|>\n<|im_start|>assistant\n'
Completion: "<think>\nOkay, the user wants me to introduce myself. First, I need to provide a general and friendly description. I should mention my name, age, and background. But since I'm an AI, I don't have a real name, so I'll say I'm an AI assistant. I should also mention my purpose, like helping users with questions. I should keep it simple and positive. Let me make sure I'm not using any technical terms and keep it conversational. Alright, that should cover it.\n</think>\n\nHello! I'm an AI assistant designed to help you with questions, tasks, and support. I can assist with a wide range of topics, from general knowledge to specific queries. How can I assist you today?<|im_end|>"


Prompt: '<|im_start|>user\nlist all prime numbers within 100<|im_end|>\n<|im_start|>assistant\n'
Completion: "<think>\nOkay, so I need to list all the prime numbers between 100. Let me think about how to approach this. First, I remember that a prime number is a number greater than 1 that has no positive divisors other than 1 and itself. So, starting from 100, I need to check each number and see if it's prime.\n\nLet me start by recalling some prime numbers. The first few primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97. But wait, these are all primes up to 100. So, I need to make sure that I don't miss any.\n\nLet me start checking from 100. Since 100 is even, it's not prime. The next number is 101. Let me check if 101"

Conclusion: Although current version is well optimized for performance, at least it works 😀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nano_vllm_npu-0.0.1.post1.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nano_vllm_npu-0.0.1.post1-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file nano_vllm_npu-0.0.1.post1.tar.gz.

File metadata

  • Download URL: nano_vllm_npu-0.0.1.post1.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nano_vllm_npu-0.0.1.post1.tar.gz
Algorithm Hash digest
SHA256 d79abc73e0a4eade05eee69ed03c532cc7139ac749cfebaebb0285b7a8e2b00c
MD5 20869840c8f6bdda4691dd03b5daf217
BLAKE2b-256 fdfcf8e56afeb40e2d1d3f9e2ed6766cd0b2c001b1d2562abca96b03d6a251f0

See more details on using hashes here.

File details

Details for the file nano_vllm_npu-0.0.1.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for nano_vllm_npu-0.0.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 b85ea4f57c474113c1774e5513c724ffb80910c55acc7cb5a48017422d9727bb
MD5 5a782b8b7851544ad7d61d03ae48cfdb
BLAKE2b-256 1e77371a28e24adb78c64c60a5613a078998a693e16b820b5920d72f8e8ef78b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page