Skip to main content

Granite Switch: Composable model building

Project description

Granite Switch — Build AI models like you build software

License Python 3.9+ corelib raglib guardianlib

| Browse adapter functions | Pre-composed Models on HF | Tutorials |

Software is built from libraries — you pick the ones you need, compose them, and ship. Granite Switch brings this to AI models: choose adapter functions for RAG, safety, factuality, and more, compose them into a single model, and deploy with one command. Swap or upgrade any component independently, just like updating a dependency.

An adapter function is a LoRA adapter trained to a specific input/output contract — a score, a decision, a rewritten query — with the output schema enforced at the token level by Mellea. This is what makes them composable as software: each function has a known signature, not just a general-purpose text output.

Small models with the right adapter functions consistently outperform much larger generalist models on targeted tasks. Activated LoRA (aLoRA) makes this practical at scale: all adapter functions share one KV cache, activating on demand — so one deployment serves many capabilities with no memory or latency overhead.

Granite Switch: adapters stack, accuracy improves

Key Features

  • Composable — Combine independently developed adapter functions into one checkpoint, whether IBM's or yours. Swap, upgrade, or customize without retraining.
  • Fast — Built on IBM's Activated LoRA technology for efficient KV cache reuse, low latency, and high inference throughput.
  • Accurate — Task-specific adapter functions can match and even surpass the accuracy of significantly larger generalist models, while requiring only a fraction of the serving cost. See the adapter function catalog for benchmark comparisons across all 12 adapter functions.
  • Inference-ready — Deploy with vLLM for production or HuggingFace for prototyping. Same checkpoint, no conversion step.

aLoRA vs LoRA live race telemetry — aLoRA at 10/16 queries done with 73% KV hit rate while LoRA is at 1/16 with 7%

Live race telemetry: aLoRA (73% KV cache hit rate, 0.64s TTFT) vs LoRA (7% KV hit rate, 2.08s TTFT) — same model, same hardware, different adapter technology.
Reproduce it yourself on Colab →

Quick Start

Install

pip install "granite-switch[vllm]"

Other install options depending on your use case:

pip install "granite-switch[compose]"   # Compose modular models
pip install "granite-switch[hf]"        # HuggingFace inference
pip install "granite-switch[vllm20]"    # vLLM 0.20+ (requires CUDA 13+)
pip install "granite-switch[dev]"       # Everything

Requires Python 3.9+ and PyTorch 2.0+. Two vLLM backends are available: .[vllm] for broad CUDA 12.x compatibility (0.19.x), and .[vllm20] for the latest performance improvements (CUDA 13+).

Compose a Model

Compose a base Granite model with adapter libraries into a single deployable checkpoint:

python -m granite_switch.composer.compose_granite_switch \
  --base-model ibm-granite/granite-4.1-3b \
  --adapters ibm-granite/granitelib-core-r1.0 ibm-granite/granitelib-rag-r1.0  ibm-granite/granitelib-guardian-r1.0 \
  --output ./my-model

Use the adapter function composer to browse available adapter functions, compare benchmarks, and generate a ready-to-run compose command.

This downloads the base model, embeds compatible LoRA adapters (with a preference towards activated LoRA), adds control tokens and a chat template, and produces a model directory that works with both HuggingFace and vLLM.

Or skip composition and use a pre-composed model:

Run Inference

pip install mellea
python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-switch-4.1-3b-preview --port 8000
from mellea.backends.openai import OpenAIBackend
from mellea.stdlib.components.chat import Message
from mellea.stdlib.components.intrinsic.guardian import guardian_check
from mellea.stdlib.context import ChatContext

backend = OpenAIBackend(
    model_id="ibm-granite/granite-switch-4.1-3b-preview",
    base_url="http://localhost:8000/v1",
    api_key="unused",
)
backend.register_embedded_adapter_model("ibm-granite/granite-switch-4.1-3b-preview")

ctx = ChatContext().add(Message("user", "Group X people are all lazy."))
score = guardian_check(ctx, backend, "social_bias", scoring_schema="user_prompt")
print(f"social_bias score: {score:.3f}")
# => social_bias score: 0.964

How It Works

With standard LoRA, each adapter is trained against its own KV distribution — so switching adapter functions across complex flow control means discarding and recomputing the KV cache at every step. aLoRA adapter functions are instead trained against a common normalized KV cache, so they can all coexist in a single checkpoint and activate on demand without cross-contamination:

  1. Control tokens — Each adapter function has a dedicated control token (e.g., <guardian>, <query_rewrite>). Placing the token in the input sequence is what triggers activation — the adapter function's LoRA weights apply from that position forward.
  2. KV cache normalization — Because all adapter functions are trained against the same normalized KV cache, they never interfere with each other's internal state. Each activates on top of the shared base KV cache, which is what makes independent development, benchmarking, and composition possible without joint training.
  3. Prefill reuse — LoRA weights are selected per token position, not per request. Because all adapter functions share the same normalized KV cache, the prefill from earlier steps is reused rather than recomputed — eliminating the main latency cost of multi-adapter complex flow control.

Like functions in a software library, adapter functions can be developed and benchmarked independently or jointly. They compose into one deployable model that contains all capabilities, in analogy to statically linked object code.

Tutorials

New here? Start with a 5-minute notebook and work your way up:

Notebook What you'll build Time
Hello Mellea Call adapters through a clean Python API 5 min Open In Colab
RAG Flow Query rewrite + answerability + citations in one model 30 min Open In Colab
Compose Your Own Build a custom checkpoint from adapter function libraries 15 min Open In Colab

All notebooks run on Colab. See tutorials/README.md for the full list and guided learning paths.

Ecosystem

Granite Switch is part of a coordinated stack:

  • Granite Models — The base models that Granite Switch builds on. Granite 4.1 is available in 3B, 8B, and 30B parameter sizes on Hugging Face.
  • Granite Libraries — Pre-trained adapter functions for RAG, safety, and core capabilities, published on Hugging Face. These are the components you compose into a Switch model.
  • Mellea — Reliable, testable LLM output for Python. Type hints become schemas, docstrings become prompts, and valid output is enforced at the token level — not retried into existence. Mellea orchestrates Granite Switch adapter functions through an API built for complex flow control, handling control tokens and constrained decoding so you work with typed function calls, not raw tokens.
  • Granite Switch (this repo) — The model architecture and composer toolchain for embedding adapter functions into a base model and producing a deployable checkpoint.

Contributing

Granite Switch was started by IBM Research and is developed in the open. We welcome bug reports, feature requests, and pull requests — see CONTRIBUTING.md for guidelines or open an issue.

License

Apache-2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

granite_switch-0.0.9.tar.gz (107.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

granite_switch-0.0.9-py3-none-any.whl (120.2 kB view details)

Uploaded Python 3

File details

Details for the file granite_switch-0.0.9.tar.gz.

File metadata

  • Download URL: granite_switch-0.0.9.tar.gz
  • Upload date:
  • Size: 107.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for granite_switch-0.0.9.tar.gz
Algorithm Hash digest
SHA256 9adb71bc1ec976636c81a71aa91bf5bafb7e397045de728a57364fe925ff88b1
MD5 3f429d4f5a762a4c440319331256db63
BLAKE2b-256 2beefbb8e1da367f481d6fca16af3b89f1f69270ebb36e323cc02e457d9972ec

See more details on using hashes here.

File details

Details for the file granite_switch-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: granite_switch-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 120.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for granite_switch-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 994351a6b37e272c2f73724d0736b6f0f1b095ecc1e0dc5c797cedf8f7e7c69b
MD5 12e14b120aa89cf73a9e83536d0bc24c
BLAKE2b-256 ceb960c9729507e88b26682cd2144ca213bd5c404f39a54e009e03d4547e6489

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page