Chain small language models to outperform large ones — runs locally on 8GB RAM

These details have not been verified by PyPI

Project links

Project description

S.U.T.R.A

Structured Universal Transfer via Retrieval Adaptation

A lightweight Python library for orchestrating structured demonstration handoffs between small language models — enabling chains of 2B–7B models to collectively approach SOTA-level performance.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         AgentHandoff.run()                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐ │
│   │  Query    │────▶│ Model A  │────▶│  Parser  │────▶│  Cache   │ │
│   │          │     │ (3B)     │     │          │     │          │  │
│   └──────────┘     └──────────┘     └──────────┘     └────┬─────┘ │
│                                                            │       │
│                     ┌──────────┐     ┌──────────┐          │       │
│                     │  Output  │◀────│ Model B  │◀─────────┘       │
│                     │          │     │ (7B)     │  demonstrations  │
│                     └──────────┘     └──────────┘                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Model A generates an answer and 3 behavioral demonstrations.
The parser extracts those demonstrations into a HandoffPacket.
Model B receives the demonstrations as few-shot examples alongside the original query — producing a higher-quality final answer.

Before / After

❌ Raw Context Passing

import ollama

# Just forward the raw query — Model B has no style guidance
response = ollama.generate(model="mistral:7b", prompt="Write a safe Rust function to read a file")
print(response["response"])

✅ Structured Handoff (S.U.T.R.A)

from agent_handoff import AgentHandoff

handoff = AgentHandoff(model_a="llama3.2:3b", model_b="mistral:7b")
result = handoff.run("Write a safe Rust function to read a file")
print(result)  # Model B's answer, guided by Model A's demonstrations

Model B receives curated demonstrations that prime it for the right style, format, and reasoning — without any extra training.

Installation

# Clone and install
git clone <your-repo-url>
cd S.U.T.R.A
pip install -r requirements.txt

# Ensure Ollama is running with your models
ollama pull llama3.2:3b
ollama pull mistral:7b

Requirements: Python 3.10+, Ollama running locally.

Quick Start

from agent_handoff import AgentHandoff, DemonstrationCache

# Basic usage
handoff = AgentHandoff(model_a="llama3.2:3b", model_b="mistral:7b")
answer = handoff.run("Explain the CAP theorem with real-world analogies")
print(answer)

# Detailed result with metrics
result = handoff.run_detailed("Write a Python merge sort")
print(f"Answer: {result.answer[:200]}…")
print(f"Latency: {result.latency_ms:.0f} ms")
print(f"Cache hit: {result.cache_hit}")
print(f"Demonstrations used: {len(result.demonstrations)}")

Manual Chaining (Advanced)

# Step 1: Generate demonstrations only
packet = handoff.generate_demonstrations("Implement a binary search tree in Python")
print(f"Got {len(packet.demonstrations)} demonstrations")

# Step 2: Use demonstrations with any model
final = handoff.generate_final("Implement a binary search tree in Python", packet.demonstrations)
print(final)

Caching

Demonstrations are cached by default (SHA-256 hash of the query, 1-hour TTL).

from agent_handoff import AgentHandoff, DemonstrationCache

# Custom cache with 24h TTL
cache = DemonstrationCache(default_ttl=86400)
handoff = AgentHandoff(model_a="llama3.2:3b", model_b="mistral:7b", cache=cache)

# First call generates & caches demonstrations
handoff.run("Explain monads")

# Second call is faster — demonstrations served from cache
handoff.run("Explain monads")

# Persist cache to disk
cache.save("demo_cache.json")

# Load cache in a new session
new_cache = DemonstrationCache()
new_cache.load("demo_cache.json")

Benchmark

Compare handoff vs. raw context passing across 10 test queries:

# Default models
python -m agent_handoff.benchmark

# Custom models
python -m agent_handoff.benchmark --model-a qwen2.5:3b --model-b deepseek-coder:6.7b

# Coding queries only
python -m agent_handoff.benchmark --coding-only

# Custom Ollama host
python -m agent_handoff.benchmark --host http://192.168.1.100:11434

Output:

Query                                                          Method    Latency(ms)  Prompt Tok  Compl Tok
──────────────────────────────────────────────────────────────────────────────────────────────────────────
Write a Python function that merges two sorted lists…          raw          1234.5          45        312
Write a Python function that merges two sorted lists…          handoff      3456.7         182        298
...
AVG RAW                                                                     1456.3          52        287
AVG HANDOFF                                                                 3891.2         195        301

Custom Templates

Override the default prompts to tailor the handoff for your domain:

handoff = AgentHandoff(
    model_a="llama3.2:3b",
    model_b="mistral:7b",
    templates={
        "prompt_a": "You are a security expert. Answer in <answer> tags, "
                    "then give 3 vulnerability examples in <demonstration> tags "
                    "inside <demonstrations>.\nQuery: {query}",
        "prompt_b": "Security examples:\n{demonstrations}\n\n"
                    "Apply the same security mindset to answer:\n{query}",
    },
)

Interactive CLI

Launch the Claude Code-style interactive terminal:

# Auto-detect models from Ollama
python -m agent_handoff

# Or specify models directly
python -m agent_handoff --model-a llama3.2:3b --model-b mistral:7b

# Run benchmarks instead
python -m agent_handoff benchmark

The CLI will:

Auto-detect all models pulled in your local Ollama instance
Let you pick Model A (demonstration generator) and Model B (final responder)
Enter an interactive REPL where you type queries and see the full pipeline in real-time

Slash commands inside the REPL:

Command	Description
`/models`	Show current model pair
`/swap`	Swap Model A ↔ Model B
`/cache`	Show cache stats
`/clear`	Clear demonstration cache
`/help`	Show help
`/exit`	Quit

Project Structure

agent_handoff/
├── __init__.py       # Package exports
├── __main__.py       # python -m agent_handoff (CLI) or benchmark
├── cli.py            # Interactive Claude Code-style terminal REPL
├── protocol.py       # HandoffPacket, HandoffResult dataclasses
├── parser.py         # XML + regex extraction of answer/demonstrations
├── cache.py          # In-memory cache with TTL + file persistence
├── templates.py      # Prompt templates for Model A and Model B
├── handoff.py        # AgentHandoff orchestrator
├── benchmark.py      # Latency/token comparison script
└── utils.py          # hash_query, truncate_text helpers
tests/
├── test_parser.py    # Parser unit tests
├── test_cache.py     # Cache unit tests
├── test_protocol.py  # Protocol unit tests
└── test_utils.py     # Utils unit tests

Core Principles

Principle	Detail
Zero-training	Pure inference — no fine-tuning, no RLHF
Low-resource	Designed for 8 GB RAM, CPU-only with 2B–7B models
Extensible	Chain 3+ models by composing `generate_demonstrations` / `generate_final`
Production-ready	Type hints, docstrings, logging, error handling throughout

Future Plans

Multi-hop chaining — Chain N models where each refines the previous demonstrations
Quality scoring — Automated evaluation of handoff vs. raw outputs using a judge model
Async pipeline — Parallel demonstration generation across multiple queries
Demonstration selection — Intelligent filtering of the most relevant demonstrations
Council mode — Multiple models vote on the best demonstrations before forwarding

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Apr 2, 2026

This version

0.1.1

Mar 28, 2026

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sutra_llm-0.1.1.tar.gz (35.4 kB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sutra_llm-0.1.1-py3-none-any.whl (32.7 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file sutra_llm-0.1.1.tar.gz.

File metadata

Download URL: sutra_llm-0.1.1.tar.gz
Upload date: Mar 28, 2026
Size: 35.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for sutra_llm-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`7e0ec6ab8a8759a29342a6e716cb777d884a8bacfc30f8c3306530c487a230db`
MD5	`2a0b31a67948e867bb28bf10a21728be`
BLAKE2b-256	`64ba82ed5bdd57b83c7c2014339aec61c699fe95e7f56ef4039468444c6fb010`

See more details on using hashes here.

File details

Details for the file sutra_llm-0.1.1-py3-none-any.whl.

File metadata

Download URL: sutra_llm-0.1.1-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 32.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for sutra_llm-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`643a4a673540ae39df0193c30698c72512f08ff64c2099aa8779ca656f3a5574`
MD5	`374b7e13f3940830f691676805f44595`
BLAKE2b-256	`be8710961f30eee7ac27e7fd12a511201fd49ebc43034fef73e25e5c5dd2840b`

See more details on using hashes here.

sutra-llm 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

S.U.T.R.A

Structured Universal Transfer via Retrieval Adaptation

Architecture

Before / After

❌ Raw Context Passing

✅ Structured Handoff (S.U.T.R.A)

Installation

Quick Start

Manual Chaining (Advanced)

Caching

Benchmark

Custom Templates

Interactive CLI

Project Structure

Core Principles

Future Plans

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes