Skip to main content

Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens.

Project description

simplicio-cli

Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).

PyPI Python License: MIT

simplicio-cli pipeline hero: one-line task to verified code change

"hide the Delete button for non-admins" → diff + test + applied + verified. Works with OpenRouter, OpenAI, Anthropic, GLM, DeepSeek, Ollama — one env var.

pip install simplicio-cli

Why it works — the numbers

Same model. Same task. Only the prompt changes. Measured, reproducible, deterministic. Fourteen models tested across three runs — five sub-4B tiny models, six frontier 2026 models, and three mid-tier 7B–12B open models. Every one gained at least +14 points when wrapped in simplicio's 6-layer contract.

Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)

Model Without simplicio With simplicio Gain
Gemma 3 4B (google/gemma-3-4b-it) 38% 96% +58 pts
Llama 3.2 3B (meta-llama/llama-3.2-3b-instruct) 28% 73% +45 pts
Gemma 3n e4B (google/gemma-3n-e4b-it) 44% 88% +44 pts
Phi-4 mini (microsoft/phi-4-mini-instruct) 36% 73% +37 pts
Llama 3.2 1B (meta-llama/llama-3.2-1b-instruct) 26% 40% +14 pts
Tiny avg (5 models · 10 cases · 260 checks) 35% 74% +39 pts (+112%)

Not hosted on OpenRouter (requested but skipped): Gemma 3 270M, Gemma 3 1B, Gemma 2 2B, Qwen3 0.6B, Qwen3 1.7B, Qwen2.5 0.5B, Qwen2.5 1.5B, Qwen 3B, Nemotron Nano 4B (OR's smallest Nemotron is 9B). Sub-4B substitutes used above. simplicio still gains +14 to +58 points even on a 1B-param model.

Frontier 2026 models — run on 2026-05-26 (60 runs/side, 312 checks)

Model Without simplicio With simplicio Gain
GPT-5.5 (openai/gpt-5.5) 38% 100% +62 pts
Kimi K2.6 (moonshotai/kimi-k2.6) 40% 100% +60 pts
Gemini 3.5 Flash (google/gemini-3.5-flash) 42% 100% +58 pts
Qwen 3.7 Max (qwen/qwen3.7-max) 44% 100% +56 pts
Claude Opus 4.7 (anthropic/claude-opus-4.7) 42% 98% +56 pts
DeepSeek V4 Pro (deepseek/deepseek-v4-pro) 44% 96% +52 pts
Frontier avg (6 models · 10 cases · 312 checks) 41% 99% +58 pts (+136%)

Mid-tier 7B–12B open models — earlier run (v0.2.2, 30 runs/side, 156 checks)

Model Without simplicio With simplicio Gain
Gemma 3 12B (google/gemma-3-12b-it) 34% 92% +58 pts
Llama 3.1 8B (meta-llama/llama-3.1-8b-instruct) 36% 90% +54 pts
Qwen 2.5 7B (qwen/qwen-2.5-7b-instruct) 34% 88% +54 pts
Mid-tier avg (3 models · 10 cases · 156 checks) 35% 90% +55 pts (+156%)

Across all 14 models tested across three runs, the average gain is +51 points. Smallest: +14 pts (Llama 3.2 1B — the contract still moves a 1B-param model). Largest: +62 pts (GPT-5.5). The contract helps tiny sub-4B models, frontier reasoning models, and mid-tier 7B–12B alike — five of the six frontier models hit 100% pass-rate.

Output-quality signals (rate across all 60 frontier runs)

Signal Raw prompt With simplicio
DIFF block present 36% 98%
Target file mentioned 1% 100%
TEST block present 88% 98%

Cost — tokens & wall-clock (measured, not estimated)

Same provider, same models, same cases. Token counts pulled from the API usage field; latency from time.perf_counter() around each call.

Side Tokens / run Wall-clock / run Total tokens (60 runs) Total time
Raw prompt 1,967 46.1s 118,040 46m 07s
With simplicio 3,168 57.6s 190,119 57m 33s
Δ +61% +24% +72,079 +11m 26s

simplicio wraps the objective in a 6-layer contract — more input tokens up front, longer completions because the model produces the full DIFF + TEST + EVIDENCE the contract demands instead of a one-line guess. The bill goes up, but so does the pass-rate (41% → 99%) and the DIFF-block rate (36% → 98%) — useful tokens, not chat.

Six frontier models — GPT-5.5, Kimi K2.6, Gemini 3.5 Flash, Qwen 3.7 Max, Claude Opus 4.7, DeepSeek V4 Pro — gained +52 to +62 points when wrapped in simplicio's 6-layer contract. Without changing the model. Without fine-tuning. Five of six landed at 100% pass-rate with simplicio.

Full report: bench/results.md · bench/results.pdf · raw outputs under .simplicio/bench_runs/.


How it works

mapper        WHERE   project structure + latest state
precedent     HOW-1   the real snippet in THIS repo that already does it
skill-router  HOW-2   the ONE mapper skill that matches (ranked, not all)
simplicio     BUILD   stacks the 6 layers into one prompt (cache-friendly)
test          JUDGE   contract written as testable states
verify        PROOF   ran it — did it actually pass? loop-fix up to 3x

The idea in one line: don't ask the model to guess — hand it the path. Each layer terminates one decision the model would otherwise hallucinate. Relevant > complete — inject the right context, never all of it.


Install

pip install simplicio-cli           # from PyPI
# or
pip install -e .                    # from this repo

Auto-activation in Claude Code (often zero-step)

pip install puts simplicio on your PATH. To make Claude Code automatically route code-edit tasks through simplicio, a skill + hook need to land in ~/.claude/.

Zero-step path (recommended). The first time you run any simplicio command after install, if Claude Code is present (~/.claude/ exists) and the hook is missing, simplicio installs both for you and prints one stderr line. PEP 517 wheels can't execute code on pip install, so this is the closest equivalent that works on every machine.

pip install simplicio-cli
simplicio smoke         # ← first call also installs skill + hook (idempotent)
# stderr: "simplicio: auto-activation installed in Claude Code …"

Opt out before the first call:

export SIMPLICIO_SKIP_AUTO_INIT=1

Explicit path. Same effect, no auto-magic:

simplicio init                 # idempotent
simplicio init --dry-run       # preview only
simplicio init --claude-home <path>   # override target dir

Either way, two files land in ~/.claude/:

File Purpose
~/.claude/skills/simplicio-cli/SKILL.md Skill the agent matches by description when your prompt looks like a code edit
~/.claude/hooks/simplicio-userpromptsubmit.sh + entry in ~/.claude/settings.json UserPromptSubmit hook that runs simplicio detect on every prompt and injects a hint when the heuristic catches a code-edit task the skill could miss

A backup of your previous settings.json is written to settings.json.bak before any merge.

How it works at runtime

After install, every prompt you type in Claude Code flows through two layers:

  1. Skill layer (semantic). Claude reads the SKILL.md description. When your prompt looks like a programming task ("add X to Y.tsx", "fix the auth bug in middleware.py"), Claude considers using simplicio task instead of writing code directly.
  2. Hook layer (deterministic). Every prompt fires simplicio detect via the UserPromptSubmit hook. The classifier scores the prompt (verbs + file extensions + code nouns − read-only cues). Score ≥ 3 → it emits a [SIMPLICIO_PROMPT_HINT] block on stderr. Claude sees the hint alongside your prompt — a hard nudge toward simplicio task <prompt> <repo>.

The layers are complementary. Skill = "Claude might pick simplicio". Hook = "Claude sees the hint regardless".

Why UserPromptSubmit and not PreToolUse

UserPromptSubmit fires once, before Claude decides which tool to call — exactly when we want to steer. PreToolUse fires after the decision is made, and again for every tool call in the turn, with no access to the original user prompt. UserPromptSubmit is the right pre-hook for routing decisions.

Disable / re-enable

Goal How
Block the auto-bootstrap export SIMPLICIO_SKIP_AUTO_INIT=1 before the first simplicio call
Disable hook permanently Delete ~/.claude/hooks/simplicio-userpromptsubmit.sh and its entry in ~/.claude/settings.json
Re-install / repair simplicio init (idempotent — won't double-write)
Preview without writing simplicio init --dry-run
Skill-only (no hook) Copy .skills/simplicio-cli/SKILL.md to ~/.claude/skills/simplicio-cli/SKILL.md manually, skip simplicio init

Configure — any LLM, nothing hardcoded

Provider SIMPLICIO_MODEL SIMPLICIO_BASE_URL
OpenRouter anthropic/claude-opus-4 https://openrouter.ai/api/v1
GLM (z.ai) glm-4.6 https://api.z.ai/api/paas/v4
DeepSeek deepseek-chat https://api.deepseek.com
OpenAI gpt-4.1 https://api.openai.com/v1
Local (Ollama) llama3 http://localhost:11434/v1
Anthropic native claude-opus-4-7 (leave unset)

If SIMPLICIO_BASE_URL is unset and the key is ANTHROPIC_API_KEY, it uses the native Anthropic SDK. Otherwise it uses an OpenAI-compatible client pointed at your base_url — so any OpenAI-like provider works without code changes.

simplicio smoke      # prints provider config + one test call

Use

# index once (caches embeddings; re-run after big changes)
simplicio index --stack angular

# run a task
simplicio task "hide Delete button for non-admins" \
  --stack angular \
  --target src/app/screen/screen.component.html \
  --criteria "- no admin perm: button absent from DOM
- with admin perm: button present" \
  --constraints "- don't touch save flow
- build passes"

Each task: precedent (from cache) → skill match → 6 layers → LLM generates (diff + test + Playwright) → apply → run SIMPLICIO_TEST_CMD → pass? done : send the error back → fix → retry (up to 3x).


Cache — why it doesn't re-map every time

Embeddings are keyed by content hash, stored in .simplicio/. Unchanged code block → vector reused. Change one file → only that block re-embeds.

Run Blocks embedded Time
1st (cold cache) 3 ~baseline
2nd (no change) 0 ~instant
after editing 1 file 1 partial

Benchmark — reproduce in 30 seconds

OPENROUTER_API_KEY= \
  BENCH_MODELS="deepseek/deepseek-v4-pro,qwen/qwen3.7-max,moonshotai/kimi-k2.6,openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.5-flash" \
  python3 bench/run_offline.py

No project required, stdlib only, deterministic regex scoring — no LLM judges the LLM. Each case runs twice on the same model: raw one-line objective vs simplicio's 6-layer contract. Outputs scored on target-file mention, DIFF block, TEST block, contract-state words. Full numbers in bench/results.md.

Full harness (your real project, your real tests)

simplicio bench --cases bench/cases.json --stack angular

Runs each case two ways and runs your real test command (e.g. ng test --watch=false) on each output. Writes the true pass-rate to bench/results.md.

4-quadrant bench — agent × simplicio matrix

Adds the second axis: not just "does the 6-layer wrap help one call?" but "does it still help inside a retry loop?". Same model, same cases — only the cell logic changes.

no simplicio with simplicio
no agent (1 call) Q1 — baseline Q2 — current bench
with agent (loop) Q3 — loop only Q4 — composition
pip install -e ".[bench]"          # adds fpdf2 for PDF report
OPENROUTER_API_KEY= \
  BENCH_MODELS="google/gemma-3-4b-it" \
  BENCH_MAX_ITERS=3 \
  python3 bench/run_4quadrant.py

Outputs bench/results_4quadrant.{md,pdf,json} + SVG charts under bench/charts/4q_*.svg + per-iteration raw outputs under .simplicio/bench_4q/<model>/case_NN/q*_iter*.txt. Methodology and hypothesis decomposition: docs/benchmark-4quadrant.md.

The matrix decomposes:

  • Prompt effect alone: Q2 − Q1
  • Loop effect alone: Q3 − Q1
  • Prompt effect inside loop: Q4 − Q3 (does simplicio still matter once you loop?)
  • Composition gain over best single axis: Q4 − max(Q2, Q3)
  • Synergy vs linear stacking: Q4 − (Q1 + (Q2−Q1) + (Q3−Q1))

Run 1 — focused single-model, google/gemma-3-4b-it, 5 cases, max_iters=3 (2026-05-26)

Quadrant Prompt Execution Pass rate Avg iters Tokens / pass
Q1 raw goal 1-shot 0/5 (0%) 1.00 4,683
Q2 simplicio 6-layer 1-shot 3/5 (60%) 1.00 800
Q3 raw goal loop w/ feedback 2/5 (40%) 3.00 3,135
Q4 simplicio 6-layer loop w/ feedback 4/5 (80%) 1.80 1,018

Decomposition (rejection threshold |Δ| ≥ 5 pts):

Hypothesis Δ Verdict
Loop alone closes the gap (simplicio unnecessary once you loop) Q4 − Q3 = +40 pts rejected
Simplicio alone is enough (loop is overkill) Q4 − Q2 = +20 pts rejected
Gains stack linearly (no synergy) Q4 − linear = −20 pts rejected

Cost per passing case: Q1 = 4,683 tok / 236s — Q2 = 800 tok / 21s — Q3 = 3,135 tok / 109s — Q4 = 1,018 tok / 20s. Full table + charts in bench/results_4quadrant.md.

Run 2 — wider multi-model, 3 models × 10 cases (partial), max_iters=5 (2026-05-26)

Replicated the matrix across more models and more cases. qwen-2.5-7b covers only the first 5 of 10 cases (wide run was killed mid-execution); claude-3.5-haiku not reached. Aggregate counts every observed (model × case × quadrant) tuple as one observation:

Quadrant Prompt Execution Pass rate Avg iters Tokens / pass ms / pass
Q1 raw goal 1-shot 0/25 (0%) 1.00 22,387 817,437
Q2 simplicio 6-layer 1-shot 16/25 (64%) 1.00 1,093 14,797
Q3 raw goal loop w/ feedback 11/25 (44%) 4.00 7,154 106,382
Q4 simplicio 6-layer loop w/ feedback 19/25 (76%) 2.44 1,914 24,170

Per-model breakdown:

Model Cases Q1 Q2 Q3 Q4
google/gemma-3-4b-it 10/10 0/10 (0%) 7/10 (70%) 4/10 (40%) 8/10 (80%)
meta-llama/llama-3.2-3b-instruct 10/10 0/10 (0%) 5/10 (50%) 4/10 (40%) 6/10 (60%)
qwen/qwen-2.5-7b-instruct 5/10 0/5 (0%) 4/5 (80%) 3/5 (60%) 5/5 (100%)

Decomposition (rejection threshold |Δ| ≥ 5 pts):

Hypothesis Δ Verdict
Loop alone closes the gap (simplicio unnecessary once you loop) Q4 − Q3 = +32 pts rejected
Simplicio alone is enough (loop is overkill) Q4 − Q2 = +12 pts rejected
Gains stack linearly (no synergy) Q4 − linear = −32 pts rejected

Same picture at every scale: Q4 (composition) wins on pass-rate, and Q4 stays close to Q2 on cost (1.9k tok / 24s per pass vs. Q2's 1.1k / 15s) while Q3 burns 7.2k tok / 106s per pass for fewer passes. Full table + per-case breakdown in bench/results_4quadrant_wide.md.


Plug points (stubs marked in code)

File Replace with
prompt.py::_mapper your real llm-project-mapper
pipeline.py::_aplicar_e_testar extract diff → git apply → parse test result
skill_router.py point SIMPLICIO_SKILLS_DIR at your mapper's skills

Layout

simplicio/
  cli.py          # index | task | bench | smoke
  cache.py        # content-hash embedding cache
  precedent.py    # grep + semantic rank (uses cache)
  skill_router.py # picks the ONE matching skill
  prompt.py       # stacks the 6 layers
  providers.py    # any OpenAI-compatible endpoint + Anthropic native
  pipeline.py     # generate → test → fix loop
  bench.py        # with-vs-without harness
  templates/simplicio_prompt.md
bench/
  run_offline.py  # stdlib-only multi-model benchmark
  cases.json      # your benchmark tasks
  cases_offline.json
  results.md      # filled by `simplicio bench` / `run_offline.py`
  charts/         # SVG: overall, delta, by_case, by_stack

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplicio_cli-0.2.12.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simplicio_cli-0.2.12-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file simplicio_cli-0.2.12.tar.gz.

File metadata

  • Download URL: simplicio_cli-0.2.12.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for simplicio_cli-0.2.12.tar.gz
Algorithm Hash digest
SHA256 9b385c418b95f95152d9fd9ae6fa0202d2144c9b6e7f9f048654c748f43a85ea
MD5 2677fcea1b5531c76b93f9ee1b285be1
BLAKE2b-256 adba9a9bc933f0cfd83a0c4b3ee00b3fd7714091449a72fa69582049811b0849

See more details on using hashes here.

File details

Details for the file simplicio_cli-0.2.12-py3-none-any.whl.

File metadata

  • Download URL: simplicio_cli-0.2.12-py3-none-any.whl
  • Upload date:
  • Size: 30.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for simplicio_cli-0.2.12-py3-none-any.whl
Algorithm Hash digest
SHA256 7f97e0577a62dc253cd6ef17074ee2bf29088af2c2cb99137c9e6fbd4aeac1d1
MD5 43b1c6b6d86a3b0427133ba829a1370a
BLAKE2b-256 9404c6e7563101c0cb26e6bc5c7f65bf3b7bc2d7f172690caa08d54b945539c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page