ROOST — deterministic trust layer for AI-written code: calibrated, read-only risk scores for every change + an honest outcome ledger
Project description
ROOST
Know which commit is going to break production — before you merge it.
The deterministic trust layer for the age of AI-written code: a read-only, calibrated
risk score for every change, plus an honest ledger of what actually happened.
LLM-free at the core. It never touches production.
Try it on your own repo — one command
No clone, no signup, no CI, no API key. With uv:
uvx --from roost-ai roost score-repo https://github.com/<you>/<your-repo>
uvx pulls the roost-ai package into a throwaway env, mines your repo's recent
history, and scores its riskiest commits with the baked-in cold-start model —
deterministic, LLM-free, and nothing leaves your machine. Prefer a permanent
install? pipx install roost-ai (or pip install roost-ai), then
roost score-repo <url> — the CLI is always just roost.
Contents
- Try it on your own repo — one command
- What it looks like
- Why this exists
- Why you can trust the number
- Quick start
- Reproduce the model from scratch
- How it works
- Risk tiers
- The Pellet ledger
- Command reference
- The bigger picture
- Project layout
- Contributing
- License
What it looks like
Open a pull request. Before a human even reads the diff, ROOST posts a verdict:
Augur risk: 73% — tier
network(acme/api@9f31c2ab)Top factors:
- 480 lines added (top 4% in this repo's history) → increases risk
- 5 subsystems touched → increases risk
- 12 prior changes to these files → increases risk
Review facts (from this repo's history):
- ⚠ 7 code file(s) changed, no test files touched
- hottest touched file:
core/auth/session.py— 6 prior fix commitsconfig/limits.yamlchanged together withcore/api/rates.pyin 9/11 past changes — untouched hereSplitting into 2 focused changes would score ~31% per part (42pp lower).
Tier
network: 89% of past changes at this tier were fix-inducing; the tier flags ~4% of all changes. Calibration: changes scored 70%–80% were fix-inducing ~73% of the time (n=55).
No LLM wrote that. Every line is a calibrated probability or a checkable fact, reproducible from a fixed seed, byte-for-byte:
- the score comes from the trained model — when ROOST says 73%, ~73% of changes like it really did get fixed later;
- the review facts answer what a senior reviewer asks before opening the diff: tests alongside the code? a fix-history hotspot touched? a usually-coupled file forgotten?
- the split counterfactual is the same model re-scoring the change as if it were two focused parts — an actionable number, not an opinion;
- each factor is anchored against this repo's own history, not a global average.
Why this exists
AI now writes more code than any human can carefully review. Agents open PRs in minutes; the diffs are bigger, more frequent, and land faster than a reviewer can keep up. So review quietly degrades into skimming — you approve, you merge, you hope. The tests are green, so it's probably fine. Right?
Your CI tells you the tests passed. It does not tell you which of today's twenty green PRs — half of them machine-written — is the one that quietly induces an incident three weeks from now. That call gets left to gut feeling, reviewer fatigue, and "looks fine to me."
ROOST gives that judgment back a backbone: it puts a calibrated number on each change so you can spend your scarce review attention where the risk actually is. Skim the 4% it flags network/destructive; let the 65% it scores low ride through with a lighter touch. Review smarter, not by reading every line a robot wrote.
Risk tools that do exist tend to fail in one of two ways:
- Uncalibrated rankers — they sort changes "risky → safe" but a "0.8" doesn't mean 80% of anything. You can't set a threshold you trust.
- LLM black boxes — non-deterministic, unauditable, and they'll happily hallucinate a rationale for a number they made up.
ROOST is the opposite of both. The score is calibrated (probabilities you can act on), deterministic (same input → same output, forever), and the core is LLM-free (a test literally asserts it never imports an LLM SDK). Then it remembers every prediction and checks it against what really happened — so the score sharpens on your own history instead of staying a one-shot guess.
Read-only, always. ROOST reads diffs and posts advisory verdicts. It never writes code, never merges, never blocks a build — unless you explicitly opt in with
--fail-at. Any LLM is an optional, swappable explainer that can only rephrase the verdict, never change the score. It's off by default.
Why you can trust the number
Most "AI for code" projects ask you to take their metrics on faith. We did the opposite — and this is the part we're proudest of.
We wrote down the pass/fail bar before we saw any results, committed it to git, and reported against it honestly. A clean FAIL would have been just as publishable as a PASS.
It passed — and these are the figures for the shipped cold-start model, trained and evaluated on ~200 public OSS repositories spanning 8 languages (Python, TypeScript, JavaScript, Java, Go, Ruby, C#, Rust; specific sources withheld — see the model card):
| What we measured | Result | Bar we set in advance |
|---|---|---|
| Top-20% riskiest changes vs. base rate | 2.9× more fix-inducing | ≥ 2.0× |
| Beats a "just count lines changed" baseline (PR-AUC) | +0.103 | ≥ 0.05 |
| Calibration error (Brier) | 0.099 | < 0.125 (base rate) |
| Ranking quality (ROC-AUC) | 0.839 | — |
| Generalizes to a repo it never trained on (leave-one-repo-out) | 2.8× mean (across 204 held-out repos) | cold-start sanity |
| Holds up under noisy labels | 2.5× | robustness |
The numbers are a touch lower than an earlier Python/JS-only build (lift 3.2×) — exactly what you'd expect: a far more diverse, messier 8-language corpus is harder to predict, so this is the more honest, more general number, not a worse one. The point is that the signal transfers across languages, which is what a cold-start model lands on in the wild.
And the caveats we don't hide: labels are an SZZ public-OSS proxy, not real production incidents; OSS ≠ your private code; the bespoke "blast-radius" feature honestly didn't earn its place, so we dropped it. We hold the same bar for new ideas: an experimental path-signal set (slim_paths) improves discrimination (PR-AUC +0.017) and calibration over the noise band, but its effort-aware lift gain stays within noise — a qualified result we report rather than dress up. The full warts-and-all findings log is in docs/DECISIONS.md; intended use and limits in the model card.
Calibration is a first-class output, not a footnote — the score comes from isotonic-calibrated LightGBM on a strict time-ordered split (never shuffled — temporal leakage is a pre-registered failure mode, not a thing we discovered later).
Quick start
Requires Python 3.11–3.14. No API key, no cloud, no LLM. The shipped cold-start model is baked into the package, so scoring works out of the box.
Just use it — install once, score anything:
pipx install roost-ai # or: pip install roost-ai / uvx --from roost-ai roost <cmd>
# Score an entire repo you've never seen (mines history, scores riskiest commits):
roost score-repo https://github.com/some/repo
# …or score one commit of a local checkout, e.g. in CI:
roost ci --commit HEAD --format md
That's it — nothing leaves your machine.
Drop it into GitHub Actions (advisory, ~10 lines)
name: Augur risk
on: [pull_request]
jobs:
risk:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 } # full history is required
- uses: ninoxAI/roost@v1
with:
fail-at: "" # advisory by default; set e.g. 0.9 to block
--format md drops straight into a PR comment or $GITHUB_STEP_SUMMARY; --format json pipes to jq. GitLab CI, a self-contained Docker image (model baked in, nothing leaves your infra), and a read-only GitHub App are in docs/ci.md, docs/github-app.md, and docs/deploy.md.
Reproduce the pipeline on your own corpus
From a clone of this repo (the pipeline targets need the source tree, not just the package).
Every step is deterministic from your repo list + seed; the LLM is off the whole way. Point
configs/repos.yaml at the repos you want (see configs/repos.example.yaml for the format) —
the shipped cold-start model is trained on a larger, withheld multi-language corpus, but
the method below is exactly what produced it.
make setup # uv: pinned py3.12 venv + locked deps
make init # create the local Pellet ledger
make ingest # mine the configured OSS repos → 'change'
make label # SZZ fix-inducing labels → 'outcome'
make features # 11 language-agnostic Kamei features → features.parquet
make train # calibrated LightGBM, strict time-ordered split → 'prediction'
make eval # full report + PASS/FAIL vs the pre-registered bar
make test # the test suite, LLM disabled
Beyond the pipeline targets, the CLI exposes roost robustness (multi-seed / rolling-origin CV / ablations), roost thresholds (data-driven tiers), and roost package (shippable cold-start model + model card). make ablation-paths runs the research track: the experimental slim_paths set vs the shipping slim baseline. See Command reference for the full surface.
How it works
mine → label (SZZ) → features (Kamei) → calibrate → score + tier → verdict
│
every score is kept and later checked against the real outcome
▼
Pellet ledger: change → prediction → action → outcome → recurrence
| Step | What happens |
|---|---|
| mine | Read-only PyDriller pass over a repo's git history → diff stats, sanitized messages, parents. |
| label | An SZZ-style blame trace marks each past change clean or fix_inducing — the training signal. |
| features | 11 language-agnostic Kamei change metrics: diffusion, size, purpose, history. No import graph, no leakage. |
| calibrate | LightGBM + isotonic CalibratedClassifierCV on a strict time-ordered split. |
| score | A calibrated probability + a risk tier on the read_only → write → execute → network → destructive scale. |
| record | The scored change + its prediction land in the Pellet ledger, ready to be graded later. |
Risk tiers
Each tier is a documented operating point you choose between — be conservative or aggressive on purpose, not by accident. The cut points below are the data-driven thresholds from the shipped model card; roost thresholds re-derives them for your own data, and tier_thresholds in configs/default.yaml sets the advisory defaults.
| tier | score ≥ | precision | recall | share of changes |
|---|---|---|---|---|
write |
0.086 | 0.31 | 0.99 | 65% |
execute |
0.200 | 0.51 | 0.77 | 31% |
network |
0.750 | 0.89 | 0.17 | 4% |
destructive |
1.000 | 1.00 | 0.09 | 2% |
The Pellet ledger
A prediction nobody checks is a horoscope. Pellet is the local system-of-record that closes the loop: every score is stored and later compared to what actually happened, so you build a verifiable track record instead of a stream of unaccountable guesses.
change → prediction → action → outcome → recurrence
(what landed) (Augur's call) (who acted) (what really happened) (did it come back?)
- Built to grow up.
actionandrecurrencealready exist in the schema (empty for now), so wiring in real incident/rollback signals or autonomous-agent actions later needs no migration — theoutcomelabel just upgrades from an OSS proxy to production truth. - No secrets, no PII.
author_idis a salted hash; raw names/emails never land in the ledger; commit messages are sanitized at ingest. Public data only. - Zero infra. It's a local DuckDB file (
data/ledger.duckdb) — columnar, regenerable, with content-hash keys that make every re-run byte-identical.
Command reference
The CLI is installed as roost (uv run roost <cmd>). Pipeline commands have matching make targets; the rest are run directly.
| Command | Make target | What it does |
|---|---|---|
roost init |
make init |
Create + migrate the Pellet ledger. --reset recreates it. |
roost info |
make info |
Show ledger row counts, seed, and explainer status. |
roost ingest |
make ingest |
Mine the configured OSS repos into the change table. |
roost label |
make label |
Write SZZ fix-inducing labels into outcome. |
roost features |
make features |
Build the Kamei feature matrix → features.parquet. |
roost train |
make train |
Train the calibrated LightGBM model → prediction. |
roost eval |
make eval |
Honest eval + PASS/FAIL vs the pre-registered bar. |
roost robustness |
— | Multi-seed bands, rolling-origin CV, ablations, importance. |
roost thresholds |
— | Derive score→tier cut points from calibration-slice targets. |
roost package |
— | Build a shippable cold-start model bundle + model card. |
roost ci |
— | Score one commit of a local checkout for a CI pipeline. |
roost score-repo <url> |
— | Score a repo Augur has never seen with the cold-start model. |
roost comment |
make comment |
Render the deterministic risk comment for a change. |
roost serve |
— | Local webhook simulator (needs the serve extra). |
roost version |
— | Print the version. |
roost ci is advisory by default (--warn-at 0.6); pass --fail-at to set a non-zero exit code. The optional LLM explainer is off everywhere unless you pass --explainer (or set explainer.enabled in config) and install the llm extra.
The bigger picture
ROOST is one corner of a deliberate design. Today's release builds and honestly evaluates the first two pieces; the rest are designed-for, not built.
| Module | Role | Status |
|---|---|---|
| AUGUR | score — calibrated risk over change features, before a change lands | here today |
| PELLET | record — the outcome ledger / system-of-record | here today |
| PARLIAMENT | grade — cross-vendor evaluation of other AI-ops agents | designed |
| TALON | gate — a permissioned write layer, earned only once Augur proves the bar on your own history | designed |
The thesis: autonomous agents are unreliable, so the layer that measures and bounds them must itself be deterministic and auditable. LLMs only ever show up as bounded, optional, swappable parts — never load-bearing decision logic.
Project layout
src/roost/
ledger/ Pellet schema, migrations, deterministic ids, DuckDB wrapper
ingest/ repo mining (PyDriller)
labeling/ SZZ fix-inducing labels
features/ Kamei change features
model/ calibrated LightGBM, feature sets, packaging, thresholds
evaluation/ PR-AUC, calibration, effort-aware lift, leave-one-repo-out, robustness
render/ deterministic risk comment
explain/ optional LLM explainer (no-op default)
serve/ cold-start scoring + local webhook simulator
models/ shipped cold-start model bundle
configs/ default.yaml, repos.yaml
docs/ spec, decisions, model card, CI / deploy / GitHub-App guides
Contributing
ROOST is young and contributions move it forward fast. Whether you fix a typo or add a whole language to the feature extractor, you're welcome here — see CONTRIBUTING.md for the full guide.
Good places to start:
- Score a new language. The feature extractor is intentionally language-agnostic — help us validate it on Go, Rust, TypeScript, Java.
- Add a repo to the evaluation set. More repos = a more honest, more general model. Mixed sizes/domains/languages especially.
- Try a calibration method. Beat isotonic on the reliability diagram without leaking time.
- Wire up a real outcome source. A connector that upgrades Pellet's
outcomefrom the SZZ proxy to genuine incident/rollback signals. - Reproduce a result and tell us if it doesn't hold. Honest negative findings are first-class here.
Every PR runs the test suite with the LLM off — the deterministic core must stay deterministic. The fastest way to get a change merged is a reproducible command and a test.
New contributors are welcome on Discord — say hi, ask anything, or bring a repo you want scored.
License
PolyForm Noncommercial 1.0.0 — free for any noncommercial use: personal projects, research, education, hobby OSS, evaluation. Commercial use requires a commercial license from ninoxai — for your company's repos, your CI, or your product. Get in touch and we'll sort it out quickly:
Versions up to and including v1.0.1 were released under Apache 2.0 and stay that way;
everything from this point forward ships under PolyForm Noncommercial.
Predict honestly. Record everything. Touch nothing.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file roost_ai-1.0.5.tar.gz.
File metadata
- Download URL: roost_ai-1.0.5.tar.gz
- Upload date:
- Size: 669.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3622d75296cb3ecf26577ced8dd7d43b0290dc2e6346aa713ae2b437fe065eda
|
|
| MD5 |
114737d2667cad8e0a4ae10bb1f54cde
|
|
| BLAKE2b-256 |
3e268684d09739762a849059207b4e4cfe87392e3394e980fc42676473a62b8d
|
Provenance
The following attestation bundles were made for roost_ai-1.0.5.tar.gz:
Publisher:
publish.yml on ninoxAI/roost
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
roost_ai-1.0.5.tar.gz -
Subject digest:
3622d75296cb3ecf26577ced8dd7d43b0290dc2e6346aa713ae2b437fe065eda - Sigstore transparency entry: 1771190967
- Sigstore integration time:
-
Permalink:
ninoxAI/roost@fbfcfe5b475a45824a522ee16a76c5587880e7fc -
Branch / Tag:
refs/tags/v1.0.5 - Owner: https://github.com/ninoxAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fbfcfe5b475a45824a522ee16a76c5587880e7fc -
Trigger Event:
release
-
Statement type:
File details
Details for the file roost_ai-1.0.5-py3-none-any.whl.
File metadata
- Download URL: roost_ai-1.0.5-py3-none-any.whl
- Upload date:
- Size: 524.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77062287a88583b9a11d6253b5cc8ee5be1c95c3d287f9fe6f7373a45ab51a45
|
|
| MD5 |
c4c4c50f05a364479ac6561c3278563a
|
|
| BLAKE2b-256 |
20c435bf77ed66831928013fa8ee4fb33bb8412f91287519d3a2e62ab7c65a6d
|
Provenance
The following attestation bundles were made for roost_ai-1.0.5-py3-none-any.whl:
Publisher:
publish.yml on ninoxAI/roost
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
roost_ai-1.0.5-py3-none-any.whl -
Subject digest:
77062287a88583b9a11d6253b5cc8ee5be1c95c3d287f9fe6f7373a45ab51a45 - Sigstore transparency entry: 1771191027
- Sigstore integration time:
-
Permalink:
ninoxAI/roost@fbfcfe5b475a45824a522ee16a76c5587880e7fc -
Branch / Tag:
refs/tags/v1.0.5 - Owner: https://github.com/ninoxAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fbfcfe5b475a45824a522ee16a76c5587880e7fc -
Trigger Event:
release
-
Statement type: