Skip to main content

Chain small language models to outperform large ones — runs locally on 8GB RAM

Project description

S.U.T.R.A

 ███████╗ ██╗   ██╗ ████████╗ ██████╗   █████╗
 ██╔════╝ ██║   ██║ ╚══██╔══╝ ██╔══██╗ ██╔══██╗
 ███████╗ ██║   ██║    ██║    ██████╔╝ ███████║
 ╚════██║ ██║   ██║    ██║    ██╔══██╗ ██╔══██║
 ███████║ ╚██████╔╝    ██║    ██║  ██║ ██║  ██║
 ╚══════╝  ╚═════╝     ╚═╝    ╚═╝  ╚═╝ ╚═╝  ╚═╝

Structured Universal Transfer via Retrieval Adaptation by Shubham Kumar


ok so idk how to explain this properly but basically i got tired of paying for API calls and also tired of my laptop dying trying to run a 70B model so i built this thing.

the idea came from a weird place — i was reading about how LLMs do in-context learning and thought, what if instead of trying to make one model smarter, you just... make multiple small models argue with each other and then have a slightly bigger one read the argument and write the answer. turns out that actually works. not always. but enough.


the numbers first

because that's what you're actually here for.

ran the full HumanEval benchmark. 164 python coding problems. standard eval everyone uses.

SUTRA vs cloud models (HumanEval pass@1)

Benchmark

SUTRA uses llama-3.3-70b as the backbone + two llama-3.1-8b council passes. cloud model scores from published benchmarks. this is not a parameter-count comparison — SUTRA costs 4 inference calls vs 1, but runs on consumer hardware with no API subscription.

SUTRA sits above GPT-4 original. on a machine with 8GB RAM. running entirely offline.

full benchmark breakdown

method pass@1 solved delta
raw llama-3.3-70b (baseline) 0.805 132/164
SUTRA council 0.854 140/164 +8 problems

council only ran on the 32 problems the baseline failed. rescued 8 of them. zero regressions — if baseline passed it, council never ran on it so it literally cannot make things worse.

rescued: HumanEval/8 /32 /64 /65 /83 /86 /93 /121

hard problems specifically (20 problems, HumanEval 32–163 range)

this is where it actually matters. easy problems don't need council — the large model handles them fine alone.

method pass@1 solved
raw llama-3.3-70b 0.600 12/20
SUTRA council 0.800 16/20
delta +0.200 +4 problems

+20% on hard problems. 25% rescue rate on failures. that's the real signal.

easy problems (same 20-problem test)

method pass@1 solved
raw llama-3.3-70b 0.900 18/20
SUTRA council 0.800 16/20
delta -0.100 -2 problems

council makes easy problems slightly worse. the 8b critique adds noise when the 70b already knows the answer. that's why /quick exists — skip council when you don't need it.

raw data in benchmark/sutra_full_results.json.


what even is this

it's a pipeline. four stages. runs entirely locally on ollama. never puts two models in RAM at the same time because 8GB doesn't go very far.

your query
  ↓
small model (temp 0.3)  →  answer A
small model (temp 0.8)  →  answer B   ← same model, different temp = genuinely different answer
small model (temp 0.3)  →  critiques both
large model (temp 0.2)  →  reads everything, writes final answer

the temperature thing matters. running the same model twice at different temps gives you one conservative answer and one exploratory one. the critique finds where they disagree or where both went wrong. the large model synthesizes something that (hopefully) avoids both failure modes.

is this just fancy prompting? kind of. but it's fancy prompting with a real benchmark behind it so.


is it fast

lmaooo no. on CPU with 8GB RAM you're looking at 5-10 minutes per query in council mode. on a machine with a GPU it drops to like 1-2 minutes. on Groq's free API it runs in ~30 seconds.

the architecture is sound — the slowness is a hardware constraint not a design flaw. /quick mode skips council and just asks the large model directly when you need speed.


install

you need Ollama running. that's it.

git clone https://github.com/metalheadshubham/SUTRA
cd SUTRA
pip install -r requirements.txt
ollama pull qwen2.5-coder:3b
ollama pull qwen2.5-coder:7b
python -m agent_handoff

Python 3.10+. 8GB RAM minimum. works on Windows, Mac, Linux (Mac is faster bc Apple Silicon).


models

default setup i benchmarked:

  • small model (stages 1, 2, 3): qwen2.5-coder:3b
  • large model (synthesis): qwen2.5-coder:7b

if you have more RAM or a GPU:

  • deepseek-r1:8b as large — actual chain-of-thought reasoning, significantly better on hard problems but very slow on CPU
  • llama3.1:8b as large — good general purpose alternative

the CLI auto-detects your Ollama models and recommends roles based on parameter count.


commands

/quick <query>      skip council, ask large model directly (fast path)
/council <query>    force council mode explicitly
/save               save last output — asks you where
/load <file>        attach a file to context for next query
/unload             remove attached files
/project            see files saved + loaded this session
/logs               last 5 runs with timing
/plugins            loaded plugins + hooks
/install <url>      install a plugin from a .py URL
/help               all commands

multiline input — type """ to open a block, paste whatever, """ to send. good for pasting code you want fixed.


workspace

load files into context so the council reads them without you copy-pasting anything.

 implement a FastAPI router for user auth

❯ /save
  Save to (path or filename): src/auth.py
   Saved to C:\project\src\auth.py · registered in workspace

❯ /load auth.py
   Loaded auth.py (~340 tokens) · attached to context

❯ add rate limiting to this
  [council now sees auth.py automatically]

if the file is large (>800 tokens) it warns you to switch to a bigger model before continuing.


plugins

drop a .py file into ~/.sutra/plugins/. loaded on startup. hooks:

def pre_query(query): ...
def post_answer_a(text): ...
def post_answer_b(text): ...
def post_critique(text): ...
def post_synthesis(text): ...

built-in logger saves every run to ~/.sutra/logs/YYYY-MM-DD.jsonl automatically.


roadmap

  • difficulty router — auto-detect when to use council vs quick
  • MCP server — expose council as an MCP tool so Claude/Cursor can call it
  • math + reasoning benchmarks — HumanEval is coding only, want GSM8K
  • ablation study — prove the critique step is doing something (vs just calling twice)
  • same benchmark on fully local models (no Groq)

project structure

agent_handoff/
├── cli.py          # terminal UI — where you spend your time
├── handoff.py      # AgentHandoff (old) + CouncilHandoff (new)
├── templates.py    # prompts for each council stage
├── protocol.py     # HandoffPacket, CouncilResult dataclasses
├── parser.py       # extracts structured output from model responses
├── cache.py        # SHA-256 keyed cache with TTL
└── utils.py        # helpers
benchmark/
└── sutra_full_results.json   # raw data behind the numbers above
tests/              # 56 tests, all pass

license

MIT. do whatever.just give me the credits


shit works.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sutra_llm-0.2.1.tar.gz (47.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sutra_llm-0.2.1-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file sutra_llm-0.2.1.tar.gz.

File metadata

  • Download URL: sutra_llm-0.2.1.tar.gz
  • Upload date:
  • Size: 47.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for sutra_llm-0.2.1.tar.gz
Algorithm Hash digest
SHA256 cdb5dcb194a50a5a2e8679a63b9ff251daa06d6bceea823e24b77d715ffd80e6
MD5 188e3932979edd9a2c938c4abf1df61f
BLAKE2b-256 e9826e06a92ed7b21475199e5ee1d9985a4b184ef4f06146deee3942c40079e8

See more details on using hashes here.

File details

Details for the file sutra_llm-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: sutra_llm-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for sutra_llm-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 efe8b58548bf0cc4bb1ef4ba9147d65d4dfcc8a697993a3b075cfe6ca5f7c6fb
MD5 8acb7e8d980b61de2394345e32b7ad02
BLAKE2b-256 da476efe618d871b023d6a7a7aef7d765e1809ee8f3bde42a516f08ce3bdfcee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page