Skip to main content

vLLM plugin for Qwerky AI MambaInLlama hybrid models

Project description

Qwerky vLLM Models

A vLLM plugin for serving Qwerky AI's MambaInLlama hybrid models without the --trust-remote-code flag.

Installation

pip install vllm qwerky-vllm-models

Usage

After installing, serve Qwerky models with vLLM:

vllm serve QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill --max-model-len 4096

The plugin automatically registers the model architecture with vLLM on import.

Supported Models

  • QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill

How It Works

This package uses vLLM's plugin system (vllm.general_plugins entry point) to register the MambaInLlama model architecture. This means:

  • No fork of vLLM required
  • No --trust-remote-code flag needed
  • Works with standard vLLM installation
  • Uses vLLM's native Triton-accelerated Mamba kernels

Requirements

  • Python >= 3.10
  • vLLM >= 0.14.0
  • PyTorch >= 2.0.0

Changelog

0.2.6

  • Added embed_input_ids method required by vLLM's VllmModelForTextGeneration interface
  • This was the root cause of "This model does not support --runner generate" error

0.2.5

  • Fixed vLLM runner detection: added MambaInLlamaMambaForCausalLM alias for HF config compatibility
  • Added proper protocol inheritance (HasInnerState, IsHybrid) from vllm.model_executor.models.interfaces
  • Fixed class variable type hints (ClassVar[Literal[True]]) for vLLM model inspection
  • Simplified model registration code

0.2.4

  • Complete architecture rewrite with explicit state cache management
  • Separate prefill and decode paths for Mamba layers
  • Grouped-head Mamba support (num_xb_head, num_C_head, repeat_group)
  • Pure PyTorch SSM implementation (preparing for vLLM Triton op integration)

0.2.3

  • Fixed d_xb default value computation in configuration
  • Removed unsupported device/dtype kwargs from RMSNorm calls

0.2.2

  • Fixed vLLM 0.14+ compatibility issues with Mamba ops API

0.2.1

  • Updated README, removed SFT model reference

0.2.0

  • Initial public release with vLLM plugin system integration

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwerky_vllm_models-0.2.6.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwerky_vllm_models-0.2.6-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file qwerky_vllm_models-0.2.6.tar.gz.

File metadata

  • Download URL: qwerky_vllm_models-0.2.6.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for qwerky_vllm_models-0.2.6.tar.gz
Algorithm Hash digest
SHA256 3d38e8e0c342b64d71c951c4ae86f092a8de7a2aa7a43cf4d5bea29cc877dba6
MD5 1a1673f76385b893e08a48891d488793
BLAKE2b-256 d81824c968ec2da2b62cb6f01470acee14064a87ed2d6f1629756d947806db97

See more details on using hashes here.

File details

Details for the file qwerky_vllm_models-0.2.6-py3-none-any.whl.

File metadata

File hashes

Hashes for qwerky_vllm_models-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 36bbfe4dfd132fce631c5d8dedead89249d7a0f4131dfd48212256ef4d4b3b7c
MD5 dcc0010ed2b9767c5319d7b80c40adf4
BLAKE2b-256 8bc30742fe094dffe82c68d4e8dd8cfc5846232b43e7ffc2f54f719a9f2f72ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page