Skip to main content

vLLM plugin for Qwerky AI MambaInLlama hybrid models

Project description

Qwerky vLLM Models

A vLLM plugin for serving Qwerky AI's MambaInLlama hybrid models without the --trust-remote-code flag.

Installation

pip install vllm qwerky-vllm-models

Usage

After installing, serve Qwerky models with vLLM:

vllm serve QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill --max-model-len 4096

The plugin automatically registers the model architecture with vLLM on import.

Supported Models

  • QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill

How It Works

This package uses vLLM's plugin system (vllm.general_plugins entry point) to register the MambaInLlama model architecture. This means:

  • No fork of vLLM required
  • No --trust-remote-code flag needed
  • Works with standard vLLM installation
  • Uses vLLM's native Triton-accelerated Mamba kernels

Requirements

  • Python >= 3.10
  • vLLM >= 0.14.0
  • PyTorch >= 2.0.0

Changelog

0.2.11

  • CRITICAL FIX: Changed d_inner default from intermediate_size to hidden_size
  • MambaInLlama Mamba layers use d_inner = hidden_size, not intermediate_size
  • Fixed d_xb default: hidden_size // 16 (was hidden_size // 4)
  • This fixes the shape mismatch for all Mamba layer weights (A_log, D, conv1d, dt_proj, in_proj, out_proj)

0.2.10

  • Added debug logging to weight loading to diagnose parameter mapping issues
  • Logs first 20 model params, first 20 checkpoint weights, and all skipped weights

0.2.9

  • Fixed weight loading: split fused mha.in_proj into separate q/k/v projections
  • Renamed mha.out_proj to o_proj for checkpoint compatibility
  • Should now load all ~395 parameters instead of just 163

0.2.8

  • Fixed dtype mismatch in SSM scan: F.softplus/torch.exp compute in float32, now cast back to original dtype
  • This caused "expected BFloat16 but found Float" error in einsum

0.2.7

  • Fixed tensor broadcasting bug in _ssm_scan: A.unsqueeze(0).unsqueeze(-1) -> A.unsqueeze(0).unsqueeze(2)
  • This caused shape mismatch (8192 vs 16) during SSM discretization

0.2.6

  • Added embed_input_ids method required by vLLM's VllmModelForTextGeneration interface
  • This was the root cause of "This model does not support --runner generate" error

0.2.5

  • Fixed vLLM runner detection: added MambaInLlamaMambaForCausalLM alias for HF config compatibility
  • Added proper protocol inheritance (HasInnerState, IsHybrid) from vllm.model_executor.models.interfaces
  • Fixed class variable type hints (ClassVar[Literal[True]]) for vLLM model inspection
  • Simplified model registration code

0.2.4

  • Complete architecture rewrite with explicit state cache management
  • Separate prefill and decode paths for Mamba layers
  • Grouped-head Mamba support (num_xb_head, num_C_head, repeat_group)
  • Pure PyTorch SSM implementation (preparing for vLLM Triton op integration)

0.2.3

  • Fixed d_xb default value computation in configuration
  • Removed unsupported device/dtype kwargs from RMSNorm calls

0.2.2

  • Fixed vLLM 0.14+ compatibility issues with Mamba ops API

0.2.1

  • Updated README, removed SFT model reference

0.2.0

  • Initial public release with vLLM plugin system integration

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwerky_vllm_models-0.2.11.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwerky_vllm_models-0.2.11-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file qwerky_vllm_models-0.2.11.tar.gz.

File metadata

  • Download URL: qwerky_vllm_models-0.2.11.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for qwerky_vllm_models-0.2.11.tar.gz
Algorithm Hash digest
SHA256 856b06dd56af0bcf2d74a5fd9019665c8722e5bbd601b374a4af1162f23b30f8
MD5 42662e0b08395996f6b36ab988137c39
BLAKE2b-256 19e32e7dc71647be3f2da834c9a4361574668ad73eff2ca77992e5fa713e673e

See more details on using hashes here.

File details

Details for the file qwerky_vllm_models-0.2.11-py3-none-any.whl.

File metadata

File hashes

Hashes for qwerky_vllm_models-0.2.11-py3-none-any.whl
Algorithm Hash digest
SHA256 196f8fabee670722303273326ed710fda838265afa9f349f1003572a45b74884
MD5 5a3f7abf33dc94f4ba5cdcd15883e1f0
BLAKE2b-256 b2a9d688d3aa6deb85983706976a929e6bea36b1b75ce78e0c9d911580d92912

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page