ROOST — deterministic trust layer for AI-written code: calibrated, read-only risk scores for every change + an honest outcome ledger

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

egorferber

These details have not been verified by PyPI

Project description

ROOST

Know which commit is going to break production — before you merge it.

The deterministic trust layer for the age of AI-written code: a read-only, calibrated probability on every change — when ROOST says 73%, ~73% of changes like it really did get fixed later — plus an honest ledger that grades every prediction against what actually happened. LLM-free at the core. It never touches production.

LLM code reviewers read the diff and hunt bugs — useful, and complementary.
ROOST tells you which diff a human (or an expensive reviewer) must read at all.
ninoxai.com/roost

LLM-free core Read-only Python 3.11–3.14

Try it on your own repo — one command

No clone, no signup, no CI, no API key. With uv:

uvx --from roost-ai roost score-repo https://github.com/<you>/<your-repo>

uvx pulls the roost-ai package into a throwaway env, mines your repo's recent history, and scores its riskiest commits with the baked-in cold-start model — deterministic, LLM-free, and nothing leaves your machine. Prefer a permanent install? pipx install roost-ai (or pip install roost-ai), then roost score-repo <url> — the CLI is always just roost.

Want proof instead of a demo? Backtest it on your own history:

uvx --from roost-ai roost backtest https://github.com/<you>/<your-repo>

This scores your repo's past commits with the same frozen model — which has never seen your repo — and then checks them against the fixes and reverts that actually followed: "these N commits would have been flagged; X of them really were fix-inducing (vs the base rate)." Honest by construction: commits too recent to judge are excluded and counted, and the labels are a stated SZZ proxy, not dressed-up incidents.

Try it on your own repo — one command
What it looks like
Why this exists
Why you can trust the number
Quick start
Reproduce the model from scratch
How it works
Risk tiers
The Pellet ledger
Command reference
The bigger picture
Project layout
Design partners
Contributing
License

What it looks like

Open a pull request. Before a human even reads the diff, ROOST posts a verdict:

Augur risk: 73% — tier network (acme/api@9f31c2ab)

Top factors:

480 lines added (top 4% in this repo's history) → increases risk

5 subsystems touched → increases risk

12 prior changes to these files → increases risk

Review facts (from this repo's history):

⚠ 7 code file(s) changed, no test files touched

hottest touched file: core/auth/session.py — 6 prior fix commits

config/limits.yaml changed together with core/api/rates.py in 9/11 past changes — untouched here

Splitting into 2 focused changes would score ~31% per part (42pp lower).

Tier network: 89% of past changes at this tier were fix-inducing; the tier flags ~4% of all changes. Calibration: changes scored 70%–80% were fix-inducing ~73% of the time (n=55).

No LLM wrote that. Every line is a calibrated probability or a checkable fact, reproducible from a fixed seed, byte-for-byte:

the score comes from the trained model — when ROOST says 73%, ~73% of changes like it really did get fixed later;
the review facts answer what a senior reviewer asks before opening the diff: tests alongside the code? a fix-history hotspot touched? a usually-coupled file forgotten?
the split counterfactual is the same model re-scoring the change as if it were two focused parts — an actionable number, not an opinion;
each factor is anchored against this repo's own history, not a global average.

Why this exists

AI now writes more code than any human can carefully review. Agents open PRs in minutes; the diffs are bigger, more frequent, and land faster than a reviewer can keep up. So review quietly degrades into skimming — you approve, you merge, you hope. The tests are green, so it's probably fine. Right?

Your CI tells you the tests passed. It does not tell you which of today's twenty green PRs — half of them machine-written — is the one that quietly induces an incident three weeks from now. That call gets left to gut feeling, reviewer fatigue, and "looks fine to me."

ROOST gives that judgment back a backbone: it puts a calibrated number on each change so you can spend your scarce review attention where the risk actually is. Skim the 4% it flags network/destructive; let the 65% it scores low ride through with a lighter touch. Review smarter, not by reading every line a robot wrote.

Risk tools that do exist tend to fail in one of two ways:

Uncalibrated rankers — they sort changes "risky → safe" but a "0.8" doesn't mean 80% of anything. You can't set a threshold you trust.
LLM black boxes — non-deterministic, unauditable, and they'll happily hallucinate a rationale for a number they made up. (The LLM reviewers — CodeRabbit, Greptile & co. — solve a different problem: they hunt bugs inside a diff. Run them after ROOST tells you which diffs deserve the attention; their own published accuracy numbers disagree with each other by 3×, which is exactly why the layer that prioritizes must be deterministic and calibrated.)

ROOST is the opposite of both. The score is calibrated (probabilities you can act on), deterministic (same input → same output, forever), and the core is LLM-free (a test literally asserts it never imports an LLM SDK). Then it remembers every prediction and checks it against what really happened — so the score sharpens on your own history instead of staying a one-shot guess.

Read-only, always. ROOST reads diffs and posts advisory verdicts. It never writes code, never merges, never blocks a build — unless you explicitly opt in with --fail-at. Any LLM is an optional, swappable explainer that can only rephrase the verdict, never change the score. It's off by default.

Why you can trust the number

Most "AI for code" projects ask you to take their metrics on faith. We did the opposite — and this is the part we're proudest of.

We wrote down the pass/fail bar before we saw any results, committed it to git, and reported against it honestly. A clean FAIL would have been just as publishable as a PASS.

It passed — and these are the figures for the shipped cold-start model, trained and evaluated on ~200 public OSS repositories spanning 8 languages (Python, TypeScript, JavaScript, Java, Go, Ruby, C#, Rust; specific sources withheld — see the model card):

What we measured	Result	Bar we set in advance
Top-20% riskiest changes vs. base rate	2.9× more fix-inducing	≥ 2.0×
Beats a "just count lines changed" baseline (PR-AUC)	+0.103	≥ 0.05
Calibration error (Brier)	0.099	< 0.125 (base rate)
Ranking quality (ROC-AUC)	0.839	—
Generalizes to a repo it never trained on (leave-one-repo-out)	2.8× mean (across 204 held-out repos)	cold-start sanity
Holds up under noisy labels	2.5×	robustness

The numbers are a touch lower than an earlier Python/JS-only build (lift 3.2×) — exactly what you'd expect: a far more diverse, messier 8-language corpus is harder to predict, so this is the more honest, more general number, not a worse one. The point is that the signal transfers across languages, which is what a cold-start model lands on in the wild.

We also audit ourselves: a bias & gaming audit (June 2026) found no newcomer bias (flagging tracks true risk share within ~1pp across author-experience buckets) and published a real gaming vector — mechanically splitting a flagged change can drop its parts below the flag tier — together with today's mitigations and the planned stacked-PR re-aggregation. An LLM-judge head-to-head benchmark is pre-registered and pending.

And the caveats we don't hide: labels are an SZZ public-OSS proxy, not real production incidents; OSS ≠ your private code; the bespoke "blast-radius" feature honestly didn't earn its place, so we dropped it. We hold the same bar for new ideas: an experimental path-signal set (slim_paths) improves discrimination (PR-AUC +0.017) and calibration over the noise band, but its effort-aware lift gain stays within noise — a qualified result we report rather than dress up. The full warts-and-all findings log is in docs/DECISIONS.md; intended use and limits in the model card.

Calibration is a first-class output, not a footnote — the score comes from isotonic-calibrated LightGBM on a strict time-ordered split (never shuffled — temporal leakage is a pre-registered failure mode, not a thing we discovered later).

Quick start

Requires Python 3.11–3.14. No API key, no cloud, no LLM. The shipped cold-start model is baked into the package, so scoring works out of the box.

Just use it — install once, score anything:

pipx install roost-ai               # or: pip install roost-ai  /  uvx --from roost-ai roost <cmd>

# Score an entire repo you've never seen (mines history, scores riskiest commits):
roost score-repo https://github.com/some/repo

# …or score one commit of a local checkout, e.g. in CI:
roost ci --commit HEAD --format md

That's it — nothing leaves your machine.

Drop it into GitHub Actions (advisory, ~10 lines)

name: Augur risk
on: [pull_request]
jobs:
  risk:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }      # full history is required
      - uses: ninoxAI/roost@v1
        with:
          fail-at: ""                 # advisory by default; set e.g. 0.9 to block

--format md drops straight into a PR comment or $GITHUB_STEP_SUMMARY; --format json pipes to jq. GitLab CI, a self-contained Docker image (model baked in, nothing leaves your infra), and a read-only GitHub App are in docs/ci.md, docs/github-app.md, and docs/deploy.md.

Give your coding agent a risk sense (MCP server)

AI agents write the code — let them check their own risk before they push. ROOST ships an MCP server exposing score_commit, score_repo, and backtest_repo as tools (read-only, advisory, deterministic — the agent is the only LLM involved):

pipx install "roost-ai[mcp]"        # or: pip install "roost-ai[mcp]"
claude mcp add roost -- roost mcp   # Claude Code; any MCP client works the same

Then an agent can ask for a verdict on the commit it just made (score_commit) and treat tier network/destructive as "a human should look first".

Prefer plain git hooks? The repo ships a pre-commit hook (pre-push stage, advisory):

repos:
  - repo: https://github.com/ninoxAI/roost
    rev: v1.1.0
    hooks:
      - id: roost-risk

Reproduce the pipeline on your own corpus

From a clone of this repo (the pipeline targets need the source tree, not just the package). Every step is deterministic from your repo list + seed; the LLM is off the whole way. Point configs/repos.yaml at the repos you want (see configs/repos.example.yaml for the format) — the shipped cold-start model is trained on a larger, withheld multi-language corpus, but the method below is exactly what produced it.

make setup       # uv: pinned py3.12 venv + locked deps
make init        # create the local Pellet ledger
make ingest      # mine the configured OSS repos → 'change'
make label       # SZZ fix-inducing labels → 'outcome'
make features    # 11 language-agnostic Kamei features → features.parquet
make train       # calibrated LightGBM, strict time-ordered split → 'prediction'
make eval        # full report + PASS/FAIL vs the pre-registered bar
make test        # the test suite, LLM disabled

Beyond the pipeline targets, the CLI exposes roost robustness (multi-seed / rolling-origin CV / ablations), roost thresholds (data-driven tiers), and roost package (shippable cold-start model + model card). make ablation-paths runs the research track: the experimental slim_paths set vs the shipping slim baseline. See Command reference for the full surface.

How it works

mine → label (SZZ) → features (Kamei) → calibrate → score + tier → verdict
                                                         │
        every score is kept and later checked against the real outcome
                                                         ▼
        Pellet ledger:  change → prediction → action → outcome → recurrence

Step	What happens
mine	Read-only PyDriller pass over a repo's git history → diff stats, sanitized messages, parents.
label	An SZZ-style blame trace marks each past change `clean` or `fix_inducing` — the training signal.
features	11 language-agnostic Kamei change metrics: diffusion, size, purpose, history. No import graph, no leakage.
calibrate	LightGBM + isotonic `CalibratedClassifierCV` on a strict time-ordered split.
score	A calibrated probability + a risk tier on the `read_only → write → execute → network → destructive` scale.
record	The scored change + its prediction land in the Pellet ledger, ready to be graded later.

Risk tiers

Each tier is a documented operating point you choose between — be conservative or aggressive on purpose, not by accident. The cut points below are the data-driven thresholds from the shipped model card; roost thresholds re-derives them for your own data, and tier_thresholds in configs/default.yaml sets the advisory defaults.

tier	score ≥	precision	recall	share of changes
`write`	0.086	0.31	0.99	65%
`execute`	0.200	0.51	0.77	31%
`network`	0.750	0.89	0.17	4%
`destructive`	1.000	1.00	0.09	2%

The Pellet ledger

A prediction nobody checks is a horoscope. Pellet is the local system-of-record that closes the loop: every score is stored and later compared to what actually happened, so you build a verifiable track record instead of a stream of unaccountable guesses.

change  →  prediction  →  action  →  outcome  →  recurrence
(what landed) (Augur's call) (who acted) (what really happened) (did it come back?)

You see the loop close. roost digest renders the week from the ledger: what was scored, which scored changes were since reverted, and whether ROOST had flagged them before the revert — hits and misses both, because a track record you can't lose is not a track record. The hosted App serves the same digest at /digest.
It knows who wrote the code. Every change is attributed — deterministically, no LLM — to a human or a named agent (Claude Code, Copilot, Cursor, Devin, …) via bot logins and commit trailers. The digest reports the track record per agent: "claude-code: 23 scored, 4.3% reverted · github-copilot: 8.9% · humans: 6.1%". Humans stay one aggregated bucket, never individuals.
Repeat offenders surface. Rollbacks record repo:file recurrence patterns; files that keep coming back show up in the digest with their count.
It tunes itself to your repo — when it's earned. roost tune re-derives the tier cut points from your repo's own graded outcomes, and refuses honestly below 50 graded changes. Verdicts carry a confidence field and say plainly when their grounding is thin.
Built to grow up. action and recurrence already exist in the schema (empty for now), so wiring in real incident/rollback signals or autonomous-agent actions later needs no migration — the outcome label just upgrades from an OSS proxy to production truth.
No secrets, no PII. author_id is a salted hash; raw names/emails never land in the ledger; commit messages are sanitized at ingest. Public data only.
Zero infra. It's a local DuckDB file (data/ledger.duckdb) — columnar, regenerable, with content-hash keys that make every re-run byte-identical.

Command reference

The CLI is installed as roost (uv run roost <cmd>). Pipeline commands have matching make targets; the rest are run directly.

Command	Make target	What it does
`roost init`	`make init`	Create + migrate the Pellet ledger. `--reset` recreates it.
`roost info`	`make info`	Show ledger row counts, seed, and explainer status.
`roost ingest`	`make ingest`	Mine the configured OSS repos into the `change` table.
`roost label`	`make label`	Write SZZ fix-inducing labels into `outcome`.
`roost features`	`make features`	Build the Kamei feature matrix → `features.parquet`.
`roost train`	`make train`	Train the calibrated LightGBM model → `prediction`.
`roost eval`	`make eval`	Honest eval + PASS/FAIL vs the pre-registered bar.
`roost robustness`	—	Multi-seed bands, rolling-origin CV, ablations, importance.
`roost thresholds`	—	Derive score→tier cut points from calibration-slice targets.
`roost package`	—	Build a shippable cold-start model bundle + model card.
`roost ci`	—	Score one commit of a local checkout for a CI pipeline.
`roost score-repo <url>`	—	Score a repo Augur has never seen with the cold-start model.
`roost backtest <url>`	—	Retrospective proof: grade frozen-model scores against the repo's own later fixes/reverts.
`roost digest`	—	Loop-closure digest from the ledger: scored changes, reverts, hits and misses, per-agent track record, repeat offenders.
`roost tune`	—	Re-derive tier thresholds from a repo's own graded outcomes; refuses honestly on thin data.
`roost mcp`	—	MCP server (stdio) for coding agents: `score_commit` / `score_repo` / `backtest_repo`. Needs the `mcp` extra.
`roost comment`	`make comment`	Render the deterministic risk comment for a change.
`roost serve`	—	Local webhook simulator (needs the `serve` extra).
`roost version`	—	Print the version.

roost ci is advisory by default (--warn-at 0.6); pass --fail-at to set a non-zero exit code. The optional LLM explainer is off everywhere unless you pass --explainer (or set explainer.enabled in config) and install the llm extra.

The bigger picture

ROOST is one corner of a deliberate design. Today's release builds and honestly evaluates the first two pieces; the rest are designed-for, not built.

Module	Role	Status
AUGUR	score — calibrated risk over change features, before a change lands	here today
PELLET	record — the outcome ledger / system-of-record	here today
PARLIAMENT	grade — cross-vendor evaluation of other AI-ops agents	seeded — per-agent track record ships today
TALON	gate — a permissioned write layer, earned only once Augur proves the bar on your own history	designed

The thesis: autonomous agents are unreliable, so the layer that measures and bounds them must itself be deterministic and auditable. LLMs only ever show up as bounded, optional, swappable parts — never load-bearing decision logic.

Project layout

src/roost/
  ledger/      Pellet schema, migrations, deterministic ids, DuckDB wrapper
  ingest/      repo mining (PyDriller)
  labeling/    SZZ fix-inducing labels
  features/    Kamei change features
  model/       calibrated LightGBM, feature sets, packaging, thresholds
  evaluation/  PR-AUC, calibration, effort-aware lift, leave-one-repo-out, robustness
  render/      deterministic risk comment
  explain/     optional LLM explainer (no-op default)
  serve/       cold-start scoring + local webhook simulator
  backtest.py  retrospective proof on a repo's own history
  digest.py    loop-closure digest over the Pellet ledger
  mcp_server.py  MCP tools for coding agents (optional `mcp` extra)
  models/      shipped cold-start model bundle
configs/       default.yaml, repos.yaml
docs/          spec, decisions, model card, CI / deploy / GitHub-App guides

Design partners — we're looking for 3–5 teams

ROOST's public numbers come from open-source history (and we say so plainly). The next bar we've set ourselves is proving the lift on private repos against real reverts and incidents — with a handful of teams whose AI tooling already outruns their review capacity.

You're a fit if your team merges 50+ PRs a month (a growing share AI-written), you run GitHub or any git host with CI, and you'd rather see honest hit/miss numbers than a demo. You get hands-on onboarding, direct influence on the roadmap, a monthly track-record report on your own repos (hits, misses, calibration — the same digest we ship, on your data), and locked-in early pricing when paid plans launch.

One 30-minute call; bring a repo and we'll backtest it live: 📧 ferbegor@gmail.com · ninoxai.com/roost

Contributing

ROOST is young and contributions move it forward fast. Whether you fix a typo or add a whole language to the feature extractor, you're welcome here — see CONTRIBUTING.md for the full guide.

Good places to start:

Score a new language. The feature extractor is intentionally language-agnostic — help us validate it on Go, Rust, TypeScript, Java.
Add a repo to the evaluation set. More repos = a more honest, more general model. Mixed sizes/domains/languages especially.
Try a calibration method. Beat isotonic on the reliability diagram without leaking time.
Wire up a real outcome source. A connector that upgrades Pellet's outcome from the SZZ proxy to genuine incident/rollback signals.
Reproduce a result and tell us if it doesn't hold. Honest negative findings are first-class here.

Every PR runs the test suite with the LLM off — the deterministic core must stay deterministic. The fastest way to get a change merged is a reproducible command and a test.

New contributors are welcome on Discord — say hi, ask anything, or bring a repo you want scored.

License

FSL-1.1-ALv2 — the Functional Source License: free for any use except building a competing product. Run it in your company, on your private repos, in your CI, commercially — no license needed, no strings. And every release automatically becomes Apache 2.0 two years after it ships, so nothing you adopt today can be locked away from you later.

The only thing that requires a commercial license from ninoxai is offering ROOST itself (or a substantially similar product or service) commercially. If that's you, get in touch and we'll sort it out quickly:

License history, plainly: versions up to and including v1.0.1 were Apache 2.0 and stay that way; v1.0.2–v1.0.6 shipped under PolyForm Noncommercial (that grant stands for those versions); everything from v1.1.0 on is FSL-1.1-ALv2 — strictly more permissive for you than PolyForm was.

Predict honestly. Record everything. Touch nothing.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

egorferber

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.0

Jun 10, 2026

1.1.1

Jun 10, 2026

1.1.0

Jun 10, 2026

1.0.6

Jun 10, 2026

1.0.5

Jun 9, 2026

1.0.4

Jun 9, 2026

1.0.3

Jun 9, 2026

1.0.2

Jun 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roost_ai-1.2.0.tar.gz (735.8 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

roost_ai-1.2.0-py3-none-any.whl (552.7 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file roost_ai-1.2.0.tar.gz.

File metadata

Download URL: roost_ai-1.2.0.tar.gz
Upload date: Jun 10, 2026
Size: 735.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for roost_ai-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8947042818cd586ae277d54aa833d8fe01c5052383f3df3462904d6b8967523f`
MD5	`b87f6155621bf2fc2cd7de9cfc0e1ff2`
BLAKE2b-256	`49ab72ba31bddbfb04acc2a70af2661fcec41377a704171c779e9f86f61e29a9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for roost_ai-1.2.0.tar.gz:

Publisher: publish.yml on ninoxAI/roost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: roost_ai-1.2.0.tar.gz
- Subject digest: 8947042818cd586ae277d54aa833d8fe01c5052383f3df3462904d6b8967523f
- Sigstore transparency entry: 1778332687
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: ninoxAI/roost@7dfe6f52b01e2d28867b40d8576f22460eb05762
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/ninoxAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7dfe6f52b01e2d28867b40d8576f22460eb05762
- Trigger Event: release

File details

Details for the file roost_ai-1.2.0-py3-none-any.whl.

File metadata

Download URL: roost_ai-1.2.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 552.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for roost_ai-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e937a03db5cbab5395eb43a0bce08d0c600ea9e467aefbe8c14497042cdeca3`
MD5	`b083fad74a21cc7b8d9a01d714348ed3`
BLAKE2b-256	`def356f0d1f78cf15551d6a42affc1576a926309008a5d3ab2be4c71bdd468e7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for roost_ai-1.2.0-py3-none-any.whl:

Publisher: publish.yml on ninoxAI/roost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: roost_ai-1.2.0-py3-none-any.whl
- Subject digest: 1e937a03db5cbab5395eb43a0bce08d0c600ea9e467aefbe8c14497042cdeca3
- Sigstore transparency entry: 1778332759
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: ninoxAI/roost@7dfe6f52b01e2d28867b40d8576f22460eb05762
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/ninoxAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7dfe6f52b01e2d28867b40d8576f22460eb05762
- Trigger Event: release

roost-ai 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ROOST

Try it on your own repo — one command

Contents

What it looks like

Why this exists

Why you can trust the number

Quick start

Reproduce the pipeline on your own corpus

How it works

Risk tiers

The Pellet ledger

Command reference

The bigger picture

Project layout

Design partners — we're looking for 3–5 teams

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance