vLLM plugin: out-of-tree registration of canon-layer architectures (e.g. LlamaCanonForCausalLM from PhysicsLM4)
Project description
vllm-canon
An out-of-tree vLLM plugin
that adds support for the LlamaCanonForCausalLM architecture — the
"canon layer" variant of Llama introduced in Zeyuan Allen-Zhu's
PhysicsLM4 / Canon Layers work.
A canon layer is a depthwise causal short convolution (kernel=4 by default) inserted at up to four positions in each decoder block:
- canonA — on the residual stream after
input_layernorm, before attention - canonB — on the fused
qkvstream before RoPE - canonC — on the residual stream after
post_attention_layernorm, before MLP - canonD — on the fused
gate_upstream beforesilu * mul
Install
pip install vllm-canon
After install, vLLM auto-discovers the plugin via its
vllm.general_plugins entry point.
Use
Pass trust_remote_code=True so HuggingFace autoloads the custom
LlamaCanonConfig from your model directory:
from vllm import LLM, SamplingParams
llm = LLM(
model="/path/to/your/canon-model",
trust_remote_code=True,
tensor_parallel_size=1,
dtype="bfloat16",
enforce_eager=True,
)
print(llm.generate(["hello"], SamplingParams(temperature=0, max_tokens=32))[0].outputs[0].text)
Or start a vLLM server:
vllm serve /path/to/your/canon-model \
--trust-remote-code --tensor-parallel-size 1 --dtype bfloat16 \
--enforce-eager --port 8000 --served-model-name canon
What the plugin does
- Registers
LlamaCanonForCausalLMinModelRegistryvia thevllm.general_pluginsentry point — no edits to the vLLM source tree. - Rebuilds the Llama block with vLLM primitives (
QKVParallelLinear,MergedColumnParallelLinear, paged attention, partial RoPE viapartial_rotary_factor) and inserts the four canon convolutions at the HF reference positions. - Each canon conv is a
MambaBasewithmamba_type="short_conv"so vLLM's V1 engine allocates a per-request(kernel-1, dim)rolling state alongside the KV cache. The model declaresHasInnerStateandIsHybridso the engine plumbs that state correctly. - The conv forward is written in pure PyTorch (
F.conv1dfor prefill, shift-append + dot for decode). The triton kernel (causal_conv1d_fn/causal_conv1d_update) produced state-update results that diverged from the reference in this setting; the canon width is tiny so the pure-torch path is fine. - The HF checkpoint loads via vLLM's standard stacked-weight mapping:
q_proj/k_proj/v_proj → qkv_proj,gate_proj/up_proj → gate_up_proj, canon weights by name.lm_headis tied toembed_tokenswhentie_word_embeddings=True.
Limitations
tensor_parallel_size=1only. Canon B and canon D operate on fused QKV / gate_up streams; per-shard weight layouts under TP>1 need separate work.rope_version='huggingface'only. Lingua-style interleaved RoPE is not supported.enforce_eager=Truerecommended. The model class is not decorated with@support_torch_compile; adding it would require an explicitdynamic_arg_dims.
Compatibility
- vLLM
>=0.15,<0.17(tested on 0.15.1) transformers >= 4.57- PyTorch
>= 2.5 - Python
>= 3.10
Parity
Verified against HuggingFace .generate() on the
qwen1.5-0.5b-newtok-canon PhysicsLM4 checkpoint: 16/16 greedy tokens
match for both a 1-token prompt (exercises the decode-path conv state
update) and a 12-token prompt (exercises the prefill conv).
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_canon-0.1.0.tar.gz.
File metadata
- Download URL: vllm_canon-0.1.0.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
420386c7bfdea9e1d64d126258806300edad4c9cafe40df93da6a938aed809e3
|
|
| MD5 |
f7d0bdf4f5de93d8c5194aef008cd1dc
|
|
| BLAKE2b-256 |
f2d88cef6e98fa6068310f6bb9d20982ceb558b9e966534ddf4ac475ad8a7a1d
|
File details
Details for the file vllm_canon-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vllm_canon-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
475ae62e9d95128dfe5716bc1a4ff29125122eb21f1e9f9917775a8b5708c0f9
|
|
| MD5 |
54122ae2d6bc1ba3284dbec395be4f5d
|
|
| BLAKE2b-256 |
c9d7fba9e3458c165ef7f7c2e10099add02486a00ace08eb0024178499773086
|