An out-of-tree vLLM plugin for Mobilint NPU runtime integration.
Project description
vLLM MBLT
vllm-mblt is an out-of-tree vLLM plugin that integrates Mobilint NPU runtime support into the vLLM serving and benchmarking stack.
It provides a custom vLLM platform, worker, and model registry hooks so Mobilint-optimized LLM/VLM artifacts can be served through familiar vLLM commands and OpenAI-compatible APIs.
Highlights
- Out-of-tree vLLM plugin: registers the
mbltplatform without patching vLLM itself. - Mobilint NPU worker: dispatches text-generation and multimodal execution to Mobilint runtime models.
- Model registry integration: supports Mobilint wrappers for Llama, HyperCLOVAX, EXAONE/EXAONE4, Qwen2/3, and Qwen2/3-VL families.
- Runtime-aware scheduling: reads model-configured
npu_prefill_chunk_sizeandmax_batch_sizevalues to tune chunked prefill and scheduler concurrency automatically. - vLLM benchmark compatibility: works with
vllm serve,vllm bench serve, andvllm bench throughput.
Requirements
- Python 3.10+
vllm==0.11.2mblt-model-zoo[transformers] >= 1.5.1- A Mobilint NPU environment. If you are not yet a Mobilint customer, please contact tech-support@mobilint.com.
The package pins vLLM for compatibility:
vllm>=0.11.2,<=0.11.2
Installation
Install from PyPI:
pip install vllm-mblt
Or install the latest source checkout:
git clone https://github.com/mobilint/vllm-mblt.git
cd vllm-mblt
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .
Quick Start
1. Verify Plugin Registration
After installation, run:
vllm --help
You should see plugin logs indicating that the Mobilint mblt platform plugin has been discovered and activated.
2. Serve a Text Model
vllm serve mobilint/Llama-3.2-1B-Instruct --trust-remote-code
Then query the OpenAI-compatible endpoint:
curl http://127.0.0.1:8000/v1/models
3. Serve a VLM Model
Qwen2-VL and Qwen3-VL Mobilint models can be loaded through the same vLLM server path:
vllm serve mobilint/Qwen2-VL-2B-Instruct --trust-remote-code
vllm serve mobilint/Qwen3-VL-2B-Instruct --trust-remote-code
Current Mobilint Qwen2/3-VL notes:
- The worker loads VLMs through
AutoModelForImageTextToText. - Image inputs are processed through vLLM's multimodal pipeline and merged into Mobilint language-model prompt embeddings inside the custom worker.
- The NPU path currently supports exactly one image in the initial multimodal request.
- Subsequent turns in the same session must be text-only or reuse the same image-token position.
- Video inputs are not supported by the current Mobilint Qwen2/3-VL NPU path.
Runtime Tuning
Runtime Layout Overrides
By default, vllm-mblt follows the runtime layout encoded in the Mobilint model artifact/config. Use
--model-loader-extra-config only when you intentionally want to override runtime placement or testing knobs.
Runtime settings such as dev_no, target_cores, target_clusters, core_mode, and max_batch_size are
forwarded to from_pretrained(...) through --model-loader-extra-config.
For detailed core_mode and multicore runtime layout guidance, see the
Mobilint multicore documentation.
vllm serve mobilint/Llama-3.2-1B-Instruct \
--trust-remote-code \
--model-loader-extra-config '{"dev_no": 0, "target_cores": ["1:0"]}'
Chunked Prefill Auto-Tuning
If a model config includes npu_prefill_chunk_size, vllm-mblt uses it to tune vLLM chunked prefill.
- Integer values are used directly.
- Dict values are selected by
core_mode. core_modeis resolved from--model-loader-extra-configfirst, then from the model config default.- The selected value is applied to vLLM's
max_num_batched_tokensfor chunked prefill. - If no matching value is found,
vllm-mbltfalls back to128. - For batch-compiled models with
max_batch_size > 1, the effective chunked prefill limit is clamped to128to match the qbruntime batch execution limit used by the worker.
Example model config:
{
"npu_prefill_chunk_size": {
"single": 64,
"global4": 256,
"global8": 512
}
}
With this command, vllm-mblt selects 256 for global4:
vllm serve mobilint/YourModel \
--trust-remote-code \
--model-loader-extra-config '{"dev_no": 0, "core_mode": "global4", "target_clusters": [0]}'
If you also pass --max-num-batched-tokens, the effective value becomes the smaller of the user-provided value
and the model-configured npu_prefill_chunk_size.
Use --block-size only when you intentionally want to override the model-configured/default block size:
vllm serve mobilint/Llama-3.2-1B-Instruct \
--trust-remote-code \
--block-size 64
Model-Configured Batch Capacity
If a model config includes max_batch_size, vllm-mblt uses that value to support batch-compiled Mobilint models.
- The worker uses
max_batch_sizefor KV cache memory sizing. - The platform applies it to vLLM
max_num_seqsautomatically. - You do not need to pass
--max-num-seqsunless you intentionally want a smaller scheduler cap. max_batch_sizealso supports the samecore_modekeyed dict form asnpu_prefill_chunk_size.- For local testing,
--model-loader-extra-config '{"max_batch_size": 32}'overrides the model config value.
Example:
vllm serve mobilint/Llama-3.2-1B-Instruct-Batch32 --trust-remote-code
For batch-compiled MXQs such as mobilint/Llama-3.2-1B-Instruct-Batch32, the plugin also caps the effective
chunked prefill limit to 128, even when the model config advertises a larger npu_prefill_chunk_size.
Benchmarking
This repository includes sonnet.txt, which can be used with vLLM benchmark commands.
Serve Benchmark
Terminal 1:
vllm serve --model mobilint/Llama-3.2-1B-Instruct --trust-remote-code
Terminal 2:
vllm bench serve --model mobilint/Llama-3.2-1B-Instruct \
--trust-remote-code \
--port 8000 \
--num-warmups 1 \
--dataset-name sonnet \
--dataset-path sonnet.txt \
--num-prompts 10
Throughput Benchmark
vllm bench throughput --model mobilint/Llama-3.2-1B-Instruct \
--trust-remote-code \
--dataset-name sonnet \
--dataset-path sonnet.txt \
--num-prompts 10
Notes:
vllm bench serveuses a separate server process;vllm bench throughputruns the engine directly.vllm bench serve --max-concurrencyis a benchmark client load setting, not the server-side scheduler limit.- Reported latency and throughput are environment-dependent. Capture results from your target board for documentation or performance comparisons.
Supported Model Families
vllm-mblt registers Mobilint model wrappers for:
| Family | Registry class |
|---|---|
| Llama / HyperCLOVAX-compatible text models | MobilintLlamaForCausalLM |
| EXAONE | MobilintExaoneForCausalLM |
| EXAONE4 | MobilintExaone4ForCausalLM |
| Qwen2 | MobilintQwen2ForCausalLM |
| Qwen3 | MobilintQwen3ForCausalLM |
| Qwen2-VL | MobilintQwen2VLForConditionalGeneration |
| Qwen3-VL | MobilintQwen3VLForConditionalGeneration |
Model artifacts are available through Mobilint model repositories such as the Mobilint Hugging Face Hub.
Cache Behavior
MbltWorker uses snapshot-based KV cache reuse with these policies:
- Event-driven dump, not every step.
- Reuse live cache for same-request continuous decode.
- Keep finished-session snapshots for prefix reuse.
- Evict finished snapshots with an LRU cap of 16 sessions.
Implementation file: vllm_mblt/mblt_worker.py
Tests
python -m pytest tests
Project Structure
vllm_mblt/
├── __init__.py # vLLM plugin and model registration entry points
├── mblt_platform.py # platform config overrides and runtime-aware defaults
├── mblt_worker.py # custom worker, prefill/decode flow, KV snapshot logic
└── models/ # Mobilint model wrappers for LLM/VLM families
tests/
├── test_kv_cache_swap_spec.py
├── test_mblt_platform_prefill.py
└── test_mblt_worker_optimizations.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_mblt-0.1.0.tar.gz.
File metadata
- Download URL: vllm_mblt-0.1.0.tar.gz
- Upload date:
- Size: 37.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3d0dc3bb2a131c565580d21d7249d0bbd1e0c6cae43e035aad1ad2f5abd966b
|
|
| MD5 |
d3697dc03746d9e7c941ccfeb6c26dc3
|
|
| BLAKE2b-256 |
5dcf7f013ae05f9fc1af51787fe5dc5f0be3d1505ec2894ef1f7bdcdb1405e5d
|
Provenance
The following attestation bundles were made for vllm_mblt-0.1.0.tar.gz:
Publisher:
publish.yml on mobilint/vllm-mblt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vllm_mblt-0.1.0.tar.gz -
Subject digest:
d3d0dc3bb2a131c565580d21d7249d0bbd1e0c6cae43e035aad1ad2f5abd966b - Sigstore transparency entry: 1833967319
- Sigstore integration time:
-
Permalink:
mobilint/vllm-mblt@fd2596b33f700718d5b409f179ecac0b3ba965f6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mobilint
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fd2596b33f700718d5b409f179ecac0b3ba965f6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file vllm_mblt-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vllm_mblt-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36defb516cb1138dd381e8699ffaa12d15bb4859701d400665af8bb66d4058f8
|
|
| MD5 |
1092ea0fa536fe2c144ee1c1ce26d479
|
|
| BLAKE2b-256 |
567277a8ae38894d9d2686a863d58e5efcd7acc9ec2753347b93036b16c70ddc
|
Provenance
The following attestation bundles were made for vllm_mblt-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on mobilint/vllm-mblt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vllm_mblt-0.1.0-py3-none-any.whl -
Subject digest:
36defb516cb1138dd381e8699ffaa12d15bb4859701d400665af8bb66d4058f8 - Sigstore transparency entry: 1833967390
- Sigstore integration time:
-
Permalink:
mobilint/vllm-mblt@fd2596b33f700718d5b409f179ecac0b3ba965f6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/mobilint
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fd2596b33f700718d5b409f179ecac0b3ba965f6 -
Trigger Event:
release
-
Statement type: