Skip to main content

vLLM plugin for Qwerky AI MambaInLlama hybrid models

Project description

Qwerky vLLM Models

A vLLM plugin for serving Qwerky AI's MambaInLlama hybrid models without the --trust-remote-code flag.

Installation

pip install vllm qwerky-vllm-models

Usage

After installing, serve Qwerky models with vLLM:

vllm serve QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill --max-model-len 4096

The plugin automatically registers the model architecture with vLLM on import.

Supported Models

  • QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill

How It Works

This package uses vLLM's plugin system (vllm.general_plugins entry point) to register the MambaInLlama model architecture. This means:

  • No fork of vLLM required
  • No --trust-remote-code flag needed
  • Works with standard vLLM installation
  • Uses vLLM's native Triton-accelerated Mamba kernels

Requirements

  • Python >= 3.10
  • vLLM >= 0.14.0
  • PyTorch >= 2.0.0

Changelog

0.2.25

  • MAJOR: Conform to vLLM's caching style for CUDA graph compatibility
  • MambaInLlamaMambaMixer now inherits from vLLM's MambaBase class
  • Implements get_state_shape(), get_state_dtype(), and mamba_type property
  • Registers layers in static_forward_context for CUDA graph support
  • Added state_indices support for proper batch indexing via attn_metadata
  • Added copy_inputs_before_cuda_graphs() and get_seqlen_agnostic_capture_inputs()
  • Passes attn_metadata through the model forward chain
  • Should fix state persistence issues causing output degeneration/repetition

0.2.24

  • FIX: Restore double bias in dt/delta computation
  • Reference implementation intentionally applies dt_proj.bias twice:
    1. Once in dt_proj(dt) (Linear includes bias)
    2. Again in softplus(dt + bias) before discretization
  • Model was trained with this double-bias behavior, so we must match it
  • This fixes repetition issues from v0.2.22-0.2.23

0.2.23

  • CRITICAL FIX: Wrong in_proj split order causing gibberish output
  • Reference implementation uses: [z(d_inner), x(d_xb), B(d_xb), C(d_inner), dt(dt_rank)]
  • Our code incorrectly had: [z(d_inner), x(d_inner), B(d_xb), C(d_xb), dt(dt_rank)]
  • x is d_xb (needs repeat_kv expansion), C is d_inner (already full size)
  • Fixed _prefill and _decode_step to handle x/C dimensions correctly

0.2.22

  • FIX: Attempted to fix double bias (WRONG - model was trained with double bias)
  • Removed redundant bias addition - this broke the model

0.2.21

  • FIX: Dtype mismatch in rotary position embeddings
  • Cast cos/sin to match q's dtype before applying rotation
  • Fixes RuntimeError: expected scalar type Float but found BFloat16 in Q×K matmul

0.2.20

  • FIX: Dtype mismatch in attention matmul
  • After softmax (computed in float32), convert to v.dtype instead of q.dtype
  • Fixes RuntimeError: expected scalar type Float but found BFloat16

0.2.19

  • FIX: Handle vLLM warmup where seq_len exceeds KV cache size
  • During warmup/autotune, max_num_batched_tokens=8192 but cache only holds 2048
  • Skip KV caching when tokens don't fit, allowing warmup to complete

0.2.18

  • Added extensive debug logging to diagnose attention layer shape issue
  • Logs: input shape, batch_size, seq_len, Q/K/V shapes, rotary output, KV cache shapes

0.2.17

  • Added debug logging in MHADecoderLayer to trace tensor shapes

0.2.16

  • Fixed attention layer to handle vLLM's flattened 2D tensor format
  • vLLM passes [total_tokens, hidden] but attention needs [batch, seq, hidden]
  • Added automatic batch dimension handling in MHADecoderLayer

0.2.15

  • Fixed attention layer KV cache shape mismatch
  • Removed incorrect tensor transpositions in KV cache assignment

0.2.14

  • Fixed mamba_config.json loading - removed local_files_only=True restriction
  • Now properly downloads mamba_config.json from HuggingFace Hub if not cached
  • Added more detailed logging for config loading

0.2.13

  • CRITICAL FIX: Load mamba_config.json for attn_layers, d_inner, d_xb
  • MambaInLlama models store Mamba-specific config in separate mamba_config.json file
  • Main config.json has model_type: "llama" without Mamba params
  • Fixed: Model was treating ALL layers as Mamba (attn_layers=[]) because config wasn't loaded
  • Added better logging for weight loading diagnostics
  • Attention layers at indices [3, 8, 13, 18, 23, 27] now properly recognized

0.2.12

  • CRITICAL FIX: Corrected d_xb default to match qwerky-distill PR #81
  • d_xb = num_key_value_heads * head_dim (GQA-style, e.g., 8×128=1024 for 8B)
  • Fixed in_proj split: [z(d_inner), x(d_inner), B(d_xb), C(d_xb), dt(dt_rank)]
  • Added repeat_kv expansion for C (same as B) in Mamba1 architecture
  • Fixed head count: num_heads = d_inner // d_state after B/C expansion

0.2.11

  • CRITICAL FIX: Changed d_inner default from intermediate_size to hidden_size
  • MambaInLlama Mamba layers use d_inner = hidden_size, not intermediate_size
  • Fixed d_xb default: hidden_size // 16 (was hidden_size // 4)
  • This fixes the shape mismatch for all Mamba layer weights (A_log, D, conv1d, dt_proj, in_proj, out_proj)

0.2.10

  • Added debug logging to weight loading to diagnose parameter mapping issues
  • Logs first 20 model params, first 20 checkpoint weights, and all skipped weights

0.2.9

  • Fixed weight loading: split fused mha.in_proj into separate q/k/v projections
  • Renamed mha.out_proj to o_proj for checkpoint compatibility
  • Should now load all ~395 parameters instead of just 163

0.2.8

  • Fixed dtype mismatch in SSM scan: F.softplus/torch.exp compute in float32, now cast back to original dtype
  • This caused "expected BFloat16 but found Float" error in einsum

0.2.7

  • Fixed tensor broadcasting bug in _ssm_scan: A.unsqueeze(0).unsqueeze(-1) -> A.unsqueeze(0).unsqueeze(2)
  • This caused shape mismatch (8192 vs 16) during SSM discretization

0.2.6

  • Added embed_input_ids method required by vLLM's VllmModelForTextGeneration interface
  • This was the root cause of "This model does not support --runner generate" error

0.2.5

  • Fixed vLLM runner detection: added MambaInLlamaMambaForCausalLM alias for HF config compatibility
  • Added proper protocol inheritance (HasInnerState, IsHybrid) from vllm.model_executor.models.interfaces
  • Fixed class variable type hints (ClassVar[Literal[True]]) for vLLM model inspection
  • Simplified model registration code

0.2.4

  • Complete architecture rewrite with explicit state cache management
  • Separate prefill and decode paths for Mamba layers
  • Grouped-head Mamba support (num_xb_head, num_C_head, repeat_group)
  • Pure PyTorch SSM implementation (preparing for vLLM Triton op integration)

0.2.3

  • Fixed d_xb default value computation in configuration
  • Removed unsupported device/dtype kwargs from RMSNorm calls

0.2.2

  • Fixed vLLM 0.14+ compatibility issues with Mamba ops API

0.2.1

  • Updated README, removed SFT model reference

0.2.0

  • Initial public release with vLLM plugin system integration

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwerky_vllm_models-0.2.25.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwerky_vllm_models-0.2.25-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file qwerky_vllm_models-0.2.25.tar.gz.

File metadata

  • Download URL: qwerky_vllm_models-0.2.25.tar.gz
  • Upload date:
  • Size: 24.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for qwerky_vllm_models-0.2.25.tar.gz
Algorithm Hash digest
SHA256 7c702f4454660c5de30633ab79c678d1f5097509d98820048fa9da4bcaeb2e37
MD5 bb861389b27810f5d904689a57bfc6f9
BLAKE2b-256 3f3046b9aa561ddda4d209f5554098e09a9fc44d76ad905dc22ff328f4416f32

See more details on using hashes here.

File details

Details for the file qwerky_vllm_models-0.2.25-py3-none-any.whl.

File metadata

File hashes

Hashes for qwerky_vllm_models-0.2.25-py3-none-any.whl
Algorithm Hash digest
SHA256 c769ae8677c0cb3270a4c303edc5eed9e108ff2af3a9bb0d825c53dfa915d7bf
MD5 3043c491b13df53acb4974a1c3c2217d
BLAKE2b-256 658628e32b80a8ecb6a5cb1a59d6b088d770e2115c4a48db4862f1bd94722d39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page