Chain small language models to outperform large ones — runs locally on 8GB RAM
Project description
S.U.T.R.A
Structured Universal Transfer via Retrieval Adaptation
A lightweight Python library for orchestrating structured demonstration handoffs between small language models — enabling chains of 2B–7B models to collectively approach SOTA-level performance.
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ AgentHandoff.run() │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Query │────▶│ Model A │────▶│ Parser │────▶│ Cache │ │
│ │ │ │ (3B) │ │ │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌──────────┐ ┌──────────┐ │ │
│ │ Output │◀────│ Model B │◀─────────┘ │
│ │ │ │ (7B) │ demonstrations │
│ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Model A generates an answer and 3 behavioral demonstrations.
The parser extracts those demonstrations into a HandoffPacket.
Model B receives the demonstrations as few-shot examples alongside the original query — producing a higher-quality final answer.
Before / After
❌ Raw Context Passing
import ollama
# Just forward the raw query — Model B has no style guidance
response = ollama.generate(model="mistral:7b", prompt="Write a safe Rust function to read a file")
print(response["response"])
✅ Structured Handoff (S.U.T.R.A)
from agent_handoff import AgentHandoff
handoff = AgentHandoff(model_a="llama3.2:3b", model_b="mistral:7b")
result = handoff.run("Write a safe Rust function to read a file")
print(result) # Model B's answer, guided by Model A's demonstrations
Model B receives curated demonstrations that prime it for the right style, format, and reasoning — without any extra training.
Installation
# Clone and install
git clone <your-repo-url>
cd S.U.T.R.A
pip install -r requirements.txt
# Ensure Ollama is running with your models
ollama pull llama3.2:3b
ollama pull mistral:7b
Requirements: Python 3.10+, Ollama running locally.
Quick Start
from agent_handoff import AgentHandoff, DemonstrationCache
# Basic usage
handoff = AgentHandoff(model_a="llama3.2:3b", model_b="mistral:7b")
answer = handoff.run("Explain the CAP theorem with real-world analogies")
print(answer)
# Detailed result with metrics
result = handoff.run_detailed("Write a Python merge sort")
print(f"Answer: {result.answer[:200]}…")
print(f"Latency: {result.latency_ms:.0f} ms")
print(f"Cache hit: {result.cache_hit}")
print(f"Demonstrations used: {len(result.demonstrations)}")
Manual Chaining (Advanced)
# Step 1: Generate demonstrations only
packet = handoff.generate_demonstrations("Implement a binary search tree in Python")
print(f"Got {len(packet.demonstrations)} demonstrations")
# Step 2: Use demonstrations with any model
final = handoff.generate_final("Implement a binary search tree in Python", packet.demonstrations)
print(final)
Caching
Demonstrations are cached by default (SHA-256 hash of the query, 1-hour TTL).
from agent_handoff import AgentHandoff, DemonstrationCache
# Custom cache with 24h TTL
cache = DemonstrationCache(default_ttl=86400)
handoff = AgentHandoff(model_a="llama3.2:3b", model_b="mistral:7b", cache=cache)
# First call generates & caches demonstrations
handoff.run("Explain monads")
# Second call is faster — demonstrations served from cache
handoff.run("Explain monads")
# Persist cache to disk
cache.save("demo_cache.json")
# Load cache in a new session
new_cache = DemonstrationCache()
new_cache.load("demo_cache.json")
Benchmark
Compare handoff vs. raw context passing across 10 test queries:
# Default models
python -m agent_handoff.benchmark
# Custom models
python -m agent_handoff.benchmark --model-a qwen2.5:3b --model-b deepseek-coder:6.7b
# Coding queries only
python -m agent_handoff.benchmark --coding-only
# Custom Ollama host
python -m agent_handoff.benchmark --host http://192.168.1.100:11434
Output:
Query Method Latency(ms) Prompt Tok Compl Tok
──────────────────────────────────────────────────────────────────────────────────────────────────────────
Write a Python function that merges two sorted lists… raw 1234.5 45 312
Write a Python function that merges two sorted lists… handoff 3456.7 182 298
...
AVG RAW 1456.3 52 287
AVG HANDOFF 3891.2 195 301
Custom Templates
Override the default prompts to tailor the handoff for your domain:
handoff = AgentHandoff(
model_a="llama3.2:3b",
model_b="mistral:7b",
templates={
"prompt_a": "You are a security expert. Answer in <answer> tags, "
"then give 3 vulnerability examples in <demonstration> tags "
"inside <demonstrations>.\nQuery: {query}",
"prompt_b": "Security examples:\n{demonstrations}\n\n"
"Apply the same security mindset to answer:\n{query}",
},
)
Interactive CLI
Launch the Claude Code-style interactive terminal:
# Auto-detect models from Ollama
python -m agent_handoff
# Or specify models directly
python -m agent_handoff --model-a llama3.2:3b --model-b mistral:7b
# Run benchmarks instead
python -m agent_handoff benchmark
The CLI will:
- Auto-detect all models pulled in your local Ollama instance
- Let you pick Model A (demonstration generator) and Model B (final responder)
- Enter an interactive REPL where you type queries and see the full pipeline in real-time
Slash commands inside the REPL:
| Command | Description |
|---|---|
/models |
Show current model pair |
/swap |
Swap Model A ↔ Model B |
/cache |
Show cache stats |
/clear |
Clear demonstration cache |
/help |
Show help |
/exit |
Quit |
Project Structure
agent_handoff/
├── __init__.py # Package exports
├── __main__.py # python -m agent_handoff (CLI) or benchmark
├── cli.py # Interactive Claude Code-style terminal REPL
├── protocol.py # HandoffPacket, HandoffResult dataclasses
├── parser.py # XML + regex extraction of answer/demonstrations
├── cache.py # In-memory cache with TTL + file persistence
├── templates.py # Prompt templates for Model A and Model B
├── handoff.py # AgentHandoff orchestrator
├── benchmark.py # Latency/token comparison script
└── utils.py # hash_query, truncate_text helpers
tests/
├── test_parser.py # Parser unit tests
├── test_cache.py # Cache unit tests
├── test_protocol.py # Protocol unit tests
└── test_utils.py # Utils unit tests
Core Principles
| Principle | Detail |
|---|---|
| Zero-training | Pure inference — no fine-tuning, no RLHF |
| Low-resource | Designed for 8 GB RAM, CPU-only with 2B–7B models |
| Extensible | Chain 3+ models by composing generate_demonstrations / generate_final |
| Production-ready | Type hints, docstrings, logging, error handling throughout |
Future Plans
- Multi-hop chaining — Chain N models where each refines the previous demonstrations
- Quality scoring — Automated evaluation of handoff vs. raw outputs using a judge model
- Async pipeline — Parallel demonstration generation across multiple queries
- Demonstration selection — Intelligent filtering of the most relevant demonstrations
- Council mode — Multiple models vote on the best demonstrations before forwarding
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sutra_llm-0.1.1.tar.gz.
File metadata
- Download URL: sutra_llm-0.1.1.tar.gz
- Upload date:
- Size: 35.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e0ec6ab8a8759a29342a6e716cb777d884a8bacfc30f8c3306530c487a230db
|
|
| MD5 |
2a0b31a67948e867bb28bf10a21728be
|
|
| BLAKE2b-256 |
64ba82ed5bdd57b83c7c2014339aec61c699fe95e7f56ef4039468444c6fb010
|
File details
Details for the file sutra_llm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sutra_llm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 32.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
643a4a673540ae39df0193c30698c72512f08ff64c2099aa8779ca656f3a5574
|
|
| MD5 |
374b7e13f3940830f691676805f44595
|
|
| BLAKE2b-256 |
be8710961f30eee7ac27e7fd12a511201fd49ebc43034fef73e25e5c5dd2840b
|