Chain small language models to outperform large ones — runs locally on 8GB RAM

These details have not been verified by PyPI

Project links

Project description

S.U.T.R.A

 ███████╗ ██╗   ██╗ ████████╗ ██████╗   █████╗
 ██╔════╝ ██║   ██║ ╚══██╔══╝ ██╔══██╗ ██╔══██╗
 ███████╗ ██║   ██║    ██║    ██████╔╝ ███████║
 ╚════██║ ██║   ██║    ██║    ██╔══██╗ ██╔══██║
 ███████║ ╚██████╔╝    ██║    ██║  ██║ ██║  ██║
 ╚══════╝  ╚═════╝     ╚═╝    ╚═╝  ╚═╝ ╚═╝  ╚═╝

Structured Universal Transfer via Retrieval Adaptation by Shubham Kumar

ok so idk how to explain this properly but basically i got tired of paying for API calls and also tired of my laptop dying trying to run a 70B model so i built this thing.

the idea came from a weird place — i was reading about how LLMs do in-context learning and thought, what if instead of trying to make one model smarter, you just... make multiple small models argue with each other and then have a slightly bigger one read the argument and write the answer. turns out that actually works. not always. but enough.

the numbers first

because that's what you're actually here for.

ran the full HumanEval benchmark. 164 python coding problems. standard eval everyone uses.

SUTRA vs cloud models (HumanEval pass@1)

SUTRA uses llama-3.3-70b as the backbone + two llama-3.1-8b council passes. cloud model scores from published benchmarks. this is not a parameter-count comparison — SUTRA costs 4 inference calls vs 1, but runs on consumer hardware with no API subscription.

SUTRA sits above GPT-4 original. on a machine with 8GB RAM. running entirely offline.

full benchmark breakdown

method	pass@1	solved	delta
raw `llama-3.3-70b` (baseline)	0.805	132/164	—
SUTRA council	0.854	140/164	+8 problems

council only ran on the 32 problems the baseline failed. rescued 8 of them. zero regressions — if baseline passed it, council never ran on it so it literally cannot make things worse.

rescued: HumanEval/8 /32 /64 /65 /83 /86 /93 /121

hard problems specifically (20 problems, HumanEval 32–163 range)

this is where it actually matters. easy problems don't need council — the large model handles them fine alone.

method	pass@1	solved
raw `llama-3.3-70b`	0.600	12/20
SUTRA council	0.800	16/20
delta	+0.200	+4 problems

+20% on hard problems. 25% rescue rate on failures. that's the real signal.

easy problems (same 20-problem test)

method	pass@1	solved
raw `llama-3.3-70b`	0.900	18/20
SUTRA council	0.800	16/20
delta	-0.100	-2 problems

council makes easy problems slightly worse. the 8b critique adds noise when the 70b already knows the answer. that's why /quick exists — skip council when you don't need it.

raw data in benchmark/sutra_full_results.json.

what even is this

it's a pipeline. four stages. runs entirely locally on ollama. never puts two models in RAM at the same time because 8GB doesn't go very far.

your query
  ↓
small model (temp 0.3)  →  answer A
small model (temp 0.8)  →  answer B   ← same model, different temp = genuinely different answer
small model (temp 0.3)  →  critiques both
large model (temp 0.2)  →  reads everything, writes final answer

the temperature thing matters. running the same model twice at different temps gives you one conservative answer and one exploratory one. the critique finds where they disagree or where both went wrong. the large model synthesizes something that (hopefully) avoids both failure modes.

is this just fancy prompting? kind of. but it's fancy prompting with a real benchmark behind it so.

is it fast

lmaooo no. on CPU with 8GB RAM you're looking at 5-10 minutes per query in council mode. on a machine with a GPU it drops to like 1-2 minutes. on Groq's free API it runs in ~30 seconds.

the architecture is sound — the slowness is a hardware constraint not a design flaw. /quick mode skips council and just asks the large model directly when you need speed.

install

you need Ollama running. that's it.

git clone https://github.com/metalheadshubham/SUTRA
cd SUTRA
pip install -r requirements.txt
ollama pull qwen2.5-coder:3b
ollama pull qwen2.5-coder:7b
python -m agent_handoff

Python 3.10+. 8GB RAM minimum. works on Windows, Mac, Linux (Mac is faster bc Apple Silicon).

models

default setup i benchmarked:

small model (stages 1, 2, 3): qwen2.5-coder:3b
large model (synthesis): qwen2.5-coder:7b

if you have more RAM or a GPU:

deepseek-r1:8b as large — actual chain-of-thought reasoning, significantly better on hard problems but very slow on CPU
llama3.1:8b as large — good general purpose alternative

the CLI auto-detects your Ollama models and recommends roles based on parameter count.

commands

/quick <query>      skip council, ask large model directly (fast path)
/council <query>    force council mode explicitly
/save               save last output — asks you where
/load <file>        attach a file to context for next query
/unload             remove attached files
/project            see files saved + loaded this session
/logs               last 5 runs with timing
/plugins            loaded plugins + hooks
/install <url>      install a plugin from a .py URL
/help               all commands

multiline input — type """ to open a block, paste whatever, """ to send. good for pasting code you want fixed.

workspace

load files into context so the council reads them without you copy-pasting anything.

❯ implement a FastAPI router for user auth

❯ /save
  Save to (path or filename): src/auth.py
  ✓ Saved to C:\project\src\auth.py · registered in workspace

❯ /load auth.py
  ✓ Loaded auth.py (~340 tokens) · attached to context

❯ add rate limiting to this
  [council now sees auth.py automatically]

if the file is large (>800 tokens) it warns you to switch to a bigger model before continuing.

plugins

drop a .py file into ~/.sutra/plugins/. loaded on startup. hooks:

def pre_query(query): ...
def post_answer_a(text): ...
def post_answer_b(text): ...
def post_critique(text): ...
def post_synthesis(text): ...

built-in logger saves every run to ~/.sutra/logs/YYYY-MM-DD.jsonl automatically.

roadmap

difficulty router — auto-detect when to use council vs quick
MCP server — expose council as an MCP tool so Claude/Cursor can call it
math + reasoning benchmarks — HumanEval is coding only, want GSM8K
ablation study — prove the critique step is doing something (vs just calling twice)
same benchmark on fully local models (no Groq)

project structure

agent_handoff/
├── cli.py          # terminal UI — where you spend your time
├── handoff.py      # AgentHandoff (old) + CouncilHandoff (new)
├── templates.py    # prompts for each council stage
├── protocol.py     # HandoffPacket, CouncilResult dataclasses
├── parser.py       # extracts structured output from model responses
├── cache.py        # SHA-256 keyed cache with TTL
└── utils.py        # helpers
benchmark/
└── sutra_full_results.json   # raw data behind the numbers above
tests/              # 56 tests, all pass

license

MIT. do whatever.just give me the credits

shit works.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Apr 2, 2026

0.1.1

Mar 28, 2026

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sutra_llm-0.2.1.tar.gz (47.7 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sutra_llm-0.2.1-py3-none-any.whl (43.1 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file sutra_llm-0.2.1.tar.gz.

File metadata

Download URL: sutra_llm-0.2.1.tar.gz
Upload date: Apr 2, 2026
Size: 47.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for sutra_llm-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`cdb5dcb194a50a5a2e8679a63b9ff251daa06d6bceea823e24b77d715ffd80e6`
MD5	`188e3932979edd9a2c938c4abf1df61f`
BLAKE2b-256	`e9826e06a92ed7b21475199e5ee1d9985a4b184ef4f06146deee3942c40079e8`

See more details on using hashes here.

File details

Details for the file sutra_llm-0.2.1-py3-none-any.whl.

File metadata

Download URL: sutra_llm-0.2.1-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 43.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for sutra_llm-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`efe8b58548bf0cc4bb1ef4ba9147d65d4dfcc8a697993a3b075cfe6ca5f7c6fb`
MD5	`8acb7e8d980b61de2394345e32b7ad02`
BLAKE2b-256	`da476efe618d871b023d6a7a7aef7d765e1809ee8f3bde42a516f08ce3bdfcee`

See more details on using hashes here.

sutra-llm 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

S.U.T.R.A

the numbers first

SUTRA vs cloud models (HumanEval pass@1)

full benchmark breakdown

hard problems specifically (20 problems, HumanEval 32–163 range)

easy problems (same 20-problem test)

what even is this

is it fast

install

models

commands

workspace

plugins

roadmap

project structure

license

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes