Skip to main content

Granite Switch: Composable model building

Project description

Granite Switch — Build AI models like you build software

License Python 3.9+ corelib raglib guardianlib

| Browse Adapters | Pre-composed Models on HF | Tutorials |

Software is built from libraries — you pick the ones you need, compose them, and ship. Granite Switch brings this to AI models, starting with the Granite family: choose adapters for RAG, safety, factuality, and more, compose them into a single model, and deploy with one command. Swap or upgrade any component independently, just like updating a dependency.

Small models with the right adapters consistently outperform much larger generalist models on targeted tasks. Activated LoRA (aLoRA) makes this practical at scale: all adapters share one KV cache, activating on demand — so one deployment serves many capabilities with no memory or latency overhead.

Granite Switch: adapters stack, accuracy improves

Key Features

  • Composable — Combine independently developed adapters into one checkpoint, whether IBM's or yours. Swap, upgrade, or customize without retraining.
  • Fast — Built on IBM's Activated LoRA technology for efficient KV cache reuse, low latency, and high inference throughput.
  • Accurate — Task-specific adapters can match and even surpass the accuracy of significantly larger generalist models, while requiring only a fraction of the serving cost. See the adapter catalog for benchmark comparisons across all 12 adapters.
  • Inference-ready — Deploy with vLLM for production or HuggingFace for prototyping. Same checkpoint, no conversion step.

aLoRA vs LoRA live race — aLoRA finishes first with KV cache reuse

aLoRA completes 20 of 32 RAG queries while standard LoRA is still waiting — same model, same hardware, different adapter technology.
Reproduce it yourself on Colab →

Quick Start

Install

pip install "granite-switch[vllm]"

Other install options depending on your use case:

pip install "granite-switch[compose]"   # Compose modular models
pip install "granite-switch[hf]"        # HuggingFace inference
pip install "granite-switch[vllm20]"    # vLLM 0.20+ (requires CUDA 13+)
pip install "granite-switch[dev]"       # Everything

Requires Python 3.9+ and PyTorch 2.0+. Two vLLM backends are available: .[vllm] for broad CUDA 12.x compatibility (0.19.x), and .[vllm20] for the latest performance improvements (CUDA 13+).

Compose a Model

Compose a base Granite model with adapter libraries into a single deployable checkpoint:

python -m granite_switch.composer.compose_granite_switch \
  --base-model ibm-granite/granite-4.1-3b \
  --adapters ibm-granite/granitelib-core-r1.0 ibm-granite/granitelib-rag-r1.0  ibm-granite/granitelib-guardian-r1.0 \
  --output ./my-model

Use the Adapter Composer to browse available adapters, compare benchmarks, and generate a ready-to-run compose command.

This downloads the base model, embeds compatible LoRA adapters (with a preference towards activated LoRA), adds control tokens and a chat template, and produces a model directory that works with both HuggingFace and vLLM.

Or skip composition and use a pre-composed model:

Run Inference

pip install mellea
python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-switch-4.1-3b-preview --port 8000
from mellea.backends.openai import OpenAIBackend
from mellea.stdlib.components.chat import Message
from mellea.stdlib.components.intrinsic.guardian import guardian_check
from mellea.stdlib.context import ChatContext

backend = OpenAIBackend(
    model_id="ibm-granite/granite-switch-4.1-3b-preview",
    base_url="http://localhost:8000/v1",
    api_key="unused",
)
backend.register_embedded_adapter_model("ibm-granite/granite-switch-4.1-3b-preview")

ctx = ChatContext().add(Message("user", "Group X people are all lazy."))
score = guardian_check(ctx, backend, "social_bias", target_role="user")
print(f"social_bias score: {score:.3f}")
# => social_bias score: 0.964

How It Works

With standard LoRA, switching adapters in a multi-step pipeline means discarding and recomputing the KV cache for each step. Granite Switch embeds all adapters in a single checkpoint and activates them on demand via control tokens — a technique called Activated LoRA (aLoRA):

  1. Control tokens — Each adapter has a dedicated token (e.g., <guardian>, <query_rewrite>). When the token appears in the input, its adapter activates for subsequent positions.
  2. KV cache isolation — Adapters never see each other's internal state. Every adapter reads from the base model's KV cache only, which is what allows independent development and composition without joint training.
  3. Per-position routing — LoRA weights are selected per token position, not per request. This means the same KV cache is reused across adapter invocations, eliminating redundant prefill and enabling high-throughput multi-step pipelines.

The technique is architecture-general; Granite is the first supported family. Adapters are developed, benchmarked, and published independently — yet compose into one model that loads in vLLM with zero code changes and serves all capabilities through a single KV cache.

Tutorials

New here? Start with a 5-minute notebook and work your way up:

Notebook What you'll build Time
Hello Mellea Call adapters through a clean Python API 5 min Open In Colab
RAG Pipeline Query rewrite + answerability + citations in one model 30 min Open In Colab
Compose Your Own Build a custom checkpoint from adapter libraries 15 min Open In Colab

All notebooks run on Colab. See tutorials/README.md for the full list and guided learning paths.

Ecosystem

Granite Switch is part of a coordinated stack:

  • Granite Libraries — Pre-trained adapters for RAG, safety, and core capabilities, published on Hugging Face. These are the components you compose into a Switch model.
  • Mellea — Reliable, testable LLM output for Python. Type hints become schemas, docstrings become prompts, and valid output is enforced at the token level — not retried into existence. Mellea orchestrates Granite Switch adapters through a pipeline-oriented API, handling control tokens and constrained decoding so you work with typed function calls, not raw tokens.
  • Granite Switch (this repo) — The composition and serving layer that brings libraries and inference together into one deployable model.

Contributing

Granite Switch was started by IBM Research and is developed in the open. We welcome bug reports, feature requests, and pull requests — see CONTRIBUTING.md for guidelines or open an issue.

License

Apache-2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

granite_switch-0.0.7.tar.gz (108.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

granite_switch-0.0.7-py3-none-any.whl (121.9 kB view details)

Uploaded Python 3

File details

Details for the file granite_switch-0.0.7.tar.gz.

File metadata

  • Download URL: granite_switch-0.0.7.tar.gz
  • Upload date:
  • Size: 108.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for granite_switch-0.0.7.tar.gz
Algorithm Hash digest
SHA256 92712afde142ad0f8b560feea21207029a7d2b0a26e768f9c0bc043b82744932
MD5 961ac2d9df89db96de85049cb6dd8ef9
BLAKE2b-256 457d59007b88c164e28f62e0b74ee60b94552c28ecf77156e784c1cdbd1a4632

See more details on using hashes here.

File details

Details for the file granite_switch-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: granite_switch-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 121.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for granite_switch-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 8584f32b463f3fa4c7de428858ec8d9a2e2b17b925a4e02fc0492322444692af
MD5 5ccf44e9abab96c60034b7a3ec947d88
BLAKE2b-256 3333738819829c505fbc04887684104e2b5b45884c58d38c7dd8db1bb49a1c8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page