Skip to main content

vLLM plugin: out-of-tree registration of canon-layer architectures (e.g. LlamaCanonForCausalLM from PhysicsLM4)

Project description

vllm-canon

An out-of-tree vLLM plugin that adds support for the LlamaCanonForCausalLM architecture — the "canon layer" variant of Llama introduced in Zeyuan Allen-Zhu's PhysicsLM4 / Canon Layers work.

A canon layer is a depthwise causal short convolution (kernel=4 by default) inserted at up to four positions in each decoder block:

  • canonA — on the residual stream after input_layernorm, before attention
  • canonB — on the fused qkv stream before RoPE
  • canonC — on the residual stream after post_attention_layernorm, before MLP
  • canonD — on the fused gate_up stream before silu * mul

Install

pip install vllm-canon

After install, vLLM auto-discovers the plugin via its vllm.general_plugins entry point.

Use

Pass trust_remote_code=True so HuggingFace autoloads the custom LlamaCanonConfig from your model directory:

from vllm import LLM, SamplingParams

llm = LLM(
    model="/path/to/your/canon-model",
    trust_remote_code=True,
    tensor_parallel_size=1,
    dtype="bfloat16",
    enforce_eager=True,
)
print(llm.generate(["hello"], SamplingParams(temperature=0, max_tokens=32))[0].outputs[0].text)

Or start a vLLM server:

vllm serve /path/to/your/canon-model \
  --trust-remote-code --tensor-parallel-size 1 --dtype bfloat16 \
  --enforce-eager --port 8000 --served-model-name canon

What the plugin does

  • Registers LlamaCanonForCausalLM in ModelRegistry via the vllm.general_plugins entry point — no edits to the vLLM source tree.
  • Rebuilds the Llama block with vLLM primitives (QKVParallelLinear, MergedColumnParallelLinear, paged attention, partial RoPE via partial_rotary_factor) and inserts the four canon convolutions at the HF reference positions.
  • Each canon conv is a MambaBase with mamba_type="short_conv" so vLLM's V1 engine allocates a per-request (kernel-1, dim) rolling state alongside the KV cache. The model declares HasInnerState and IsHybrid so the engine plumbs that state correctly.
  • The conv forward is written in pure PyTorch (F.conv1d for prefill, shift-append + dot for decode). The triton kernel (causal_conv1d_fn / causal_conv1d_update) produced state-update results that diverged from the reference in this setting; the canon width is tiny so the pure-torch path is fine.
  • The HF checkpoint loads via vLLM's standard stacked-weight mapping: q_proj/k_proj/v_proj → qkv_proj, gate_proj/up_proj → gate_up_proj, canon weights by name. lm_head is tied to embed_tokens when tie_word_embeddings=True.

Limitations

  • tensor_parallel_size=1 only. Canon B and canon D operate on fused QKV / gate_up streams; per-shard weight layouts under TP>1 need separate work.
  • rope_version='huggingface' only. Lingua-style interleaved RoPE is not supported.
  • enforce_eager=True recommended. The model class is not decorated with @support_torch_compile; adding it would require an explicit dynamic_arg_dims.

Compatibility

  • vLLM >=0.15,<0.17 (tested on 0.15.1)
  • transformers >= 4.57
  • PyTorch >= 2.5
  • Python >= 3.10

Parity

Verified against HuggingFace .generate() on the qwen1.5-0.5b-newtok-canon PhysicsLM4 checkpoint: 16/16 greedy tokens match for both a 1-token prompt (exercises the decode-path conv state update) and a 12-token prompt (exercises the prefill conv).

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_canon-0.1.0.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_canon-0.1.0-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file vllm_canon-0.1.0.tar.gz.

File metadata

  • Download URL: vllm_canon-0.1.0.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for vllm_canon-0.1.0.tar.gz
Algorithm Hash digest
SHA256 420386c7bfdea9e1d64d126258806300edad4c9cafe40df93da6a938aed809e3
MD5 f7d0bdf4f5de93d8c5194aef008cd1dc
BLAKE2b-256 f2d88cef6e98fa6068310f6bb9d20982ceb558b9e966534ddf4ac475ad8a7a1d

See more details on using hashes here.

File details

Details for the file vllm_canon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vllm_canon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for vllm_canon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 475ae62e9d95128dfe5716bc1a4ff29125122eb21f1e9f9917775a8b5708c0f9
MD5 54122ae2d6bc1ba3284dbec395be4f5d
BLAKE2b-256 c9d7fba9e3458c165ef7f7c2e10099add02486a00ace08eb0024178499773086

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page