Multi-axis bisect engine for finding LLM-agent regressions across prompt, model, tool-schema, and RAG-corpus changes
Project description
Sornaris
git bisect for LLM-agent regressions. When your agent's success rate
drops, sornaris binary-searches which change broke it — a prompt edit, a
silent model upgrade, a tool-schema diff, or a RAG-corpus refresh — in
log₂(N) eval runs instead of N.
Why this matters
Every eval framework can tell you that your agent regressed. None tell you
which of the many things you changed last week caused it. Real agents move on
four axes at once — the prompt, the model, the tool schema, and the retrieval
corpus — and bisecting them by hand means re-running your eval set over and
over. sornaris does the binary search for you and names the culprit version.
Zero runtime dependencies — pure standard library (with an optional sqlite
response cache). Bring your own provider, or use the built-in OpenAI / Anthropic
adapters (stdlib urllib, API key from the environment).
Install
pip install sornaris
Quickstart (offline, no API key)
import re
from sornaris import (
BaseProvider,
EvalExample,
ExactMatchScorer,
ModelVersion,
PromptVersion,
bisect_single_axis,
run_eval,
)
# Eight prompt versions in time order; a regression was introduced at v5.
prompts = [
PromptVersion(version_id=f"v{i}", content=f"build {i}: answer the user")
for i in range(8)
]
examples = [EvalExample(example_id=f"e{i}", input=f"q{i}", expected="ok") for i in range(5)]
class DemoProvider(BaseProvider): # stand-in for a real LLM call
def generate(self, prompt: str, model_id: str) -> str:
build = int(re.search(r"build (\d+)", prompt).group(1))
return "ok" if build < 5 else "BROKEN" # broke at build 5
model, provider, scorer = ModelVersion(model_id="demo"), DemoProvider(), ExactMatchScorer()
def evaluate(pv):
_, mean = run_eval(pv, model, examples, provider, scorer)
return mean
report = bisect_single_axis(prompts, evaluate, baseline_idx=0, current_idx=7, threshold=0.5)
print(report.version_id) # -> "v5"
print(len(report.steps)) # -> ~3 probe rounds, not 8
Runnable versions (single-axis and multi-axis) live in examples/.
CLI
Bisect prompt versions against a real model (the provider reads its key from the environment):
export OPENAI_API_KEY=sk-...
sornaris run \
--prompts examples/prompt_versions.jsonl \
--evals examples/eval.jsonl \
--provider openai \
--model-id gpt-4o-mini \
--scorer contains \
--threshold 0.75 \
--report bisect_report.json
--provider—fake(offline, for wiring checks),openai, oranthropic.--scorer—exactorcontains.--cache PATH— sqlite response cache; repeated runs over the same eval set get cheaper.--models models.jsonl— also bisect the model axis (prompt + model).
prompts / models / evals are JSONL, one object per line:
// versions.jsonl
{"version_id": "v1", "content": "Be concise.", "parent_id": null, "timestamp": 1.0}
// models.jsonl
{"model_id": "gpt-4o-mini", "provider": "openai"}
// eval.jsonl
{"example_id": "e1", "input": "what is 2+2?", "expected": "4"}
How it works
A regression introduced somewhere in an ordered list of N versions is, by
definition, monotonic: it's good before the culprit and bad from the culprit on.
That's exactly the precondition for binary search, so sornaris localizes it
in log₂(N) evaluations. The multi-axis orchestrator pins every other axis at
its latest value and walks one axis at a time — so it can say "the model axis is
the cause, the prompt axis is innocent." With the sqlite cache, repeated bisects
on the same eval set reuse prior scores.
Multi-axis is deliberately a one-axis-at-a-time search (other axes pinned at current), not a full grid — it finds the single axis that, rolled back, recovers the score. That covers the common "what did I change?" case cheaply.
Modules
models— value objects:PromptVersion,ModelVersion,EvalExample,EvalResult,BisectStep,BisectReport,AxisType.scoring—ExactMatchScorer,ContainsScorer,RegexScorer,CallableScorer.cache—BisectCache(sqlite-backed, on-disk or in-memory).providers—BaseProvider, offlineFakeProvider/ScriptedProvider, realOpenAIProvider/AnthropicProvider, andbuild_provider(name, ...).runner—run_eval(prompt, model, examples, provider, scorer, cache=None).search—bisect_single_axis(versions, evaluate_fn, baseline_idx, current_idx, threshold).multi—bisect_multi_axis(axes, evaluate_fn, threshold, priority=None).cli— thesornariscommand-line entry point.
Roadmap
- v0.1 — single- and multi-axis bisect, OpenAI/Anthropic adapters, sqlite cache, CLI.
- v0.2 — async providers, tool-schema and RAG-corpus axes wired into the CLI, richer scorers (LLM-judge), JSON-schema for reports.
- v1.0 — hosted dashboard (track regressions over time), CI action, and optional signed bisect reports.
Verifying a release
Every release is built and signed in CI via PyPI Trusted Publishing — no long-lived tokens, no hand-uploaded files. You can confirm an artifact is exactly what the workflow produced:
# 1. PyPI provenance (PEP 740 attestations) — shown on the project's PyPI page;
# pip verifies attestations automatically on install (pip >= 24.1).
pip install sornaris
# 2. Sigstore signatures — each wheel/sdist is signed (keyless, OIDC) and the
# .sigstore.json bundles are attached to the GitHub Release. Verify with:
python -m pip install sigstore
python -m sigstore verify identity \
--cert-identity "https://github.com/Sergiipis/sornaris/.github/workflows/publish.yml@refs/tags/v0.1.0" \
--cert-oidc-issuer "https://token.actions.githubusercontent.com" \
sornaris-0.1.0-py3-none-any.whl
# 3. Checksums — SHA256SUMS is attached to each GitHub Release.
sha256sum -c SHA256SUMS
A CycloneDX SBOM (sbom.cdx.json / .xml) is attached to every release.
Builds set SOURCE_DATE_EPOCH from the tag commit, so the wheel is reproducible.
License
MIT — see LICENSE. Free for any use, including commercial.
For paid consulting, custom features or integrations, contact @Sergiipis on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sornaris-0.1.0.tar.gz.
File metadata
- Download URL: sornaris-0.1.0.tar.gz
- Upload date:
- Size: 23.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b62d5772e46e20d37d6997bb5278129076fcef2b695f614ddec58cceb9a27594
|
|
| MD5 |
d962fc4cbeff80a974f3b183a020633d
|
|
| BLAKE2b-256 |
8e38dea06eacfab4a1f70245a6858c17f97b5091e84357f7f0435f177ed3ecc4
|
Provenance
The following attestation bundles were made for sornaris-0.1.0.tar.gz:
Publisher:
publish.yml on Sergiipis/sornaris
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sornaris-0.1.0.tar.gz -
Subject digest:
b62d5772e46e20d37d6997bb5278129076fcef2b695f614ddec58cceb9a27594 - Sigstore transparency entry: 1676740820
- Sigstore integration time:
-
Permalink:
Sergiipis/sornaris@af8918609e1e02c1081651bcc506644cc22f3a55 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Sergiipis
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@af8918609e1e02c1081651bcc506644cc22f3a55 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sornaris-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sornaris-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c88ab926ad25a8653e99ebac6b2acf949818ea32b7ec2ae1429cb0c8c5d5348
|
|
| MD5 |
bf97d0336c2ccea6d917eb5b48053ad0
|
|
| BLAKE2b-256 |
14720c1fbe062d000a8d0bc34b6e01ea736e45c4cb47c06335756385a8efc0e0
|
Provenance
The following attestation bundles were made for sornaris-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Sergiipis/sornaris
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sornaris-0.1.0-py3-none-any.whl -
Subject digest:
3c88ab926ad25a8653e99ebac6b2acf949818ea32b7ec2ae1429cb0c8c5d5348 - Sigstore transparency entry: 1676740825
- Sigstore integration time:
-
Permalink:
Sergiipis/sornaris@af8918609e1e02c1081651bcc506644cc22f3a55 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Sergiipis
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@af8918609e1e02c1081651bcc506644cc22f3a55 -
Trigger Event:
push
-
Statement type: