A lie detector for AI coding agents: audits diffs and traces what actually runs.
Project description
Verdict
A lie detector for AI coding agents.
Verdict audits an AI-generated diff — statically and by tracing what actually runs when the tests execute — and returns a single scorecard:
PASS·SUSPICIOUS·LIED
Built for the IBM Bob hackathon (May 2026).
Why Verdict exists
AI coding agents are getting good at looking right. They produce diffs that compile, pass type-checks, even pass the tests they wrote themselves. But when you actually dig in, you find:
- A "fix" that adds a function nothing ever calls.
- A new test that mocks out everything it was supposed to verify, and asserts on the mocks.
- A call to
requests.get_json()— a method that does not exist. - A claim that a file was created, but the file is not on disk.
- A
try/except: passthat buries the very bug the agent was asked to fix. - A new code path that pytest covers 0 % of, even though the test suite is "green".
These are not syntax errors. Linters miss them. Type checkers miss them. CI passes. The agent says "done." A human reviewer skims the diff and assumes the green check mark means something.
Verdict catches these by not trusting the agent's word. It re-reads the diff, walks the AST, and runs the tests under a tracer to see what code actually executed. If the story the agent told doesn't match what the code does, Verdict says so.
The seven things Verdict looks for
Verdict runs seven independent checks on every diff. Each one targets a distinct class of agent failure. Five are static (AST/grep-based, no execution needed) and two are dynamic (run the test suite under instrumentation).
| # | Check | Kind | What it catches |
|---|---|---|---|
| 1 | dead_function |
static | A function was added in the diff but is never referenced anywhere in the repo (not called, not imported, not exported via __all__ or __init__.py). Classic agent symptom: writes helper functions to "look productive," never wires them up. |
| 2 | vacuous_tests |
static | A test function that doesn't actually test anything. Four heuristics fire: empty body (pass/.../docstring only), no assert statements, mock-only assertions (only checks calls on Mock() objects the test itself created), or the test never reaches any newly-added code. Catches "I wrote a test for it" lies. |
| 3 | hallucinated_api |
static | A call to a method or attribute that does not exist on the inferred type. requests.get_json(), my_list.add(x), dict.get_or_default(). Powered by Jedi static type inference, so it doesn't need to execute the code to know the call is bogus. |
| 4 | phantom_files |
static | The agent's transcript or diff claims a file was created, but the file isn't on disk. Scans for patterns like "created file: path/to/x.py" and cross-checks against the filesystem. Catches hallucinated file creation. |
| 5 | suppressed_exception |
static | A try/except block in newly-added code that silently swallows exceptions — bare except:, except Exception: pass, or except: logger.error(...) with no re-raise. Catches the "make the error go away" anti-fix. |
| 6 | trace |
dynamic | Runs the test suite under sys.settrace and records every function that actually executes. Any newly-added function whose body never runs gets flagged. Catches agents who add code and tests, but the tests never reach the new code. |
| 7 | coverage_delta |
dynamic | Line-level coverage on just the lines the diff added. If a newly-added line is below 50 % covered, it's flagged. Sharper than whole-file coverage because it ignores pre-existing untested code. |
Each finding carries:
{
"kind": "vacuous_test",
"file": "tests/test_payment.py",
"line": 42,
"message": "test_refund only asserts on mocks created in the test",
"confidence": 0.85
}
The scorecard rule (current, intentionally simple): any finding with confidence > 0.8 → LIED; any findings at all → SUSPICIOUS; otherwise → PASS. (See verdict/report.py for the planned post-hackathon Bayesian-aggregation replacement.)
Architecture at a glance
┌────────────────────────────────────────────────────────────────┐
│ You / Bob agent │
└──────────┬────────────────────────┬─────────────────────────────┘
│ │
│ MCP tool call │ CLI invocation
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ verdict-mcp │ │ verdict run │
│ (MCP server) │ │ (Click CLI) │
└────────┬─────────┘ └────────┬─────────┘
│ │
└───────────┬───────────┘
▼
┌──────────────────────────────┐
│ Check discovery & runner │
│ (pkgutil → 7 checks) │
└──────────────┬───────────────┘
│
┌───────────────┼────────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐
│ static │ │ static │ ... │ dynamic │
│ checks │ │ checks │ │ checks │
└────┬───┘ └────┬───┘ └────┬─────┘
└──────────────┼─────────────────┘
▼
┌────────────────────┐
│ Scorecard JSON │ ← verdict-report.json
│ PASS/SUSPICIOUS/ │
│ LIED + findings │
└─────────┬──────────┘
│
┌─────────┴──────────┐
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Verdict tab │ │ Dashboard │
│ (VS Code/ │ │ (git-history│
│ Bob ext.) │ │ analytics) │
└─────────────┘ └──────────────┘
Five distinct surfaces, one scorecard format:
verdictCLI —verdict runon any git repo, prints a scorecard, writesverdict-report.json.verdict-mcpMCP server — exposescheck_diffas an MCP tool that Bob (or any MCP client) can call mid-conversation.- Bob Custom Mode +
/verifyslash command — turns Bob into a read-only auditor that runs Verdict and reports findings without making edits. - Verdict VS Code tab — a bottom-panel webview inside Bob (or upstream VS Code) that renders the latest scorecard, persists dismiss/resolve status, and offers a "Fix with Bob" handoff.
- Verdict dashboard — backfills the entire git history of a repo, scores each commit, and serves a local web UI for trend analytics.
Use cases
Reviewing an AI's pull request. Before you read the diff, run verdict run on the branch. If the verdict is LIED, start by reading those findings — the agent is probably hiding something. If PASS, the diff is at least internally consistent.
Inside a Bob coding session. Install the MCP server and the Verifier mode. After Bob finishes a coding task, switch into Verifier mode (or type /verify). Bob will audit its own diff and report findings verbatim with file:line citations. Tight feedback loop, no human review needed for trivial diffs.
As a CI gate. verdict run --fail-on lied exits 1 if any high-confidence lies are detected. Drop it in GitHub Actions after pytest. Catches the "tests are green but the new code never runs" failure mode that test-runners can't see.
As a quality dashboard. verdict dashboard backfills history and surfaces trends. Useful for spotting when AI-generated commits started slipping past review, or which authors / which areas of the codebase get the most SUSPICIOUS verdicts.
As an editor surface. Install the Verdict VS Code tab. Hit "Run Audit" in the toolbar, click any finding to jump to the exact line, dismiss false positives, or hand the finding off to Bob with one click to fix.
Installation
Prerequisites
- Python 3.10+ (for the CLI and MCP server)
- Git (Verdict diffs against
HEADby default) - Bob IDE or VS Code 1.74+ (for the Verdict tab; CLI works without an editor)
- Node.js 20+ (only if you want to rebuild the VS Code extension from source)
TL;DR (the 5-minute demo flow)
On a fresh machine, end-to-end:
# 1. install the CLI
pip install myverdict
verdict --help
# 2. hook into Bob (one-time, machine-wide)
verdict mcp-install --global
verdict bob-mode-install --global
# restart Bob
# 3. install the editor tab
# download verdict-vscode-0.2.0.vsix from this repo's
# verdict-vscode/ folder, drop it in any project folder,
# right-click → Install Extension VSIX, reload the window.
# 4. try it on a real repo
cd <some-git-repo>
verdict run --diff-range HEAD~1
# 5. add it to a repo's CI (one .github/workflows/verdict.yml file, see below)
# 6. browse history
verdict dashboard
Each step is explained in detail below.
1. Install the Verdict CLI
From anywhere:
pip install myverdict
Or, if you've cloned this repo and want an editable dev install:
pip install -e . # from the repo root
Either way, this installs two console scripts:
verdict— the audit CLIverdict-mcp— the MCP server entry point
Verify:
verdict --help
verdict run --help
2. Run an audit
From inside any git repository:
verdict run # audit HEAD vs working tree
verdict run --diff-range HEAD~1 # audit the last commit
verdict run --static-only # skip dynamic checks (no pytest run)
verdict run --json # emit JSON to stdout
verdict run --fail-on lied # CI mode: exit 1 on LIED verdict
Output:
Verdict: SUSPICIOUS (3 findings, 12 new functions analyzed)
[dead_function] verdict/foo.py:42 helper_unused never referenced (0.90)
[vacuous_test] tests/test_foo.py:18 test_helper has no assertions (0.85)
[suppressed_exc] verdict/foo.py:67 bare except swallows exception (0.75)
Verdict also writes the full report to verdict-report.json in the repo root. That's what the VS Code tab reads.
3. Install the Bob MCP integration (optional)
If you're using Bob, hook Verdict into Bob's tool list so agents can call verdict.check_diff mid-conversation. Two scopes:
# Recommended: install once, available in every project on this machine
verdict mcp-install --global
verdict bob-mode-install --global
# Or, per-project install (writes into ./.bob/)
verdict mcp-install
verdict bob-mode-install
--global writes to ~/.bob/settings/mcp_settings.json, ~/.bob/settings/custom_modes.yaml, and ~/.bob/commands/verify.md. Drop --global to install only into the current project's .bob/ directory.
Restart Bob. You should now see:
- A Verifier mode in Bob's mode picker (read-only auditor).
- A
/verifyslash command available in any mode (one-shot audit, doesn't switch modes). - A
verdicttool group in Bob's tool list.
Both installers are non-destructive — they preserve existing config and merge.
4. Install the Verdict VS Code tab
A prebuilt .vsix ships with this repo. Three ways to install it — pick whichever is easiest:
A. Download → drop in your project → right-click (easiest, works on a fresh machine without cloning the repo).
- Download
verdict-vscode-0.2.0.vsixfrom this repo'sverdict-vscode/folder. (Right-click the link → Save Link As, or grab it from the GitHub web UI.) - Move the
.vsixinto any project folder you have open in Bob / VS Code. - In the editor's file explorer (left sidebar), right-click the
.vsixfile → Install Extension VSIX. - Reload the window when prompted.
B. Command Palette. Ctrl+Shift+P (or Cmd+Shift+P) → Extensions: Install from VSIX… → navigate to the .vsix → Install. Reload.
C. Terminal — one-liner. From wherever the .vsix lives (works for both Bob and VS Code — Bob ships the same code CLI):
code --install-extension verdict-vscode-0.2.0.vsix --force
The --force flag overwrites any previous install of the same version. If you get command not found: code, the CLI isn't on PATH — open the editor, hit Ctrl+Shift+P, run Shell Command: Install 'code' command in PATH, then try again. Or just use method A or B — they don't need the shell command.
After installing by any method, reload the editor window, then reveal the bottom panel (Ctrl+J) — there will be a new Verdict tab. Click Run Audit in the toolbar and the panel populates.
If you'd rather build from source:
cd verdict-vscode
npm install
npm run compile
npx @vscode/vsce package --no-dependencies --allow-missing-repository
5. Add Verdict to a repo's CI (optional)
Drop this file into the repo you want audited:
# .github/workflows/verdict.yml
name: Verdict
on:
pull_request:
types: [opened, synchronize, reopened]
permissions:
contents: read
pull-requests: write
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: nshah271/verdict@main
# Optional inputs (all have defaults):
# fail-on: never | suspicious | lied (default: never)
# comment-on-pr: true | false (default: true)
# static-only: true | false (default: false)
# diff-range: <any git range> (default: origin/main...HEAD)
What happens after that:
- Every PR (open + every push) gets a bot comment listing findings, each one a clickable deep link to the exact line.
- Anyone watching the PR gets the comment in their inbox via GitHub's normal notification path — that's the "Verdict emails me my findings" experience, no SMTP setup needed.
- The check stays green by default (informational). Flip
fail-on: liedif you want it to actually block merges onLIEDverdicts.
6. Open the analytics dashboard (optional)
From the repo root:
verdict dashboard
This walks the git history, scores each commit, writes one JSON per commit into dashboard/data/, then serves the dashboard at http://localhost:8765 and opens your browser. First run prompts for commit count and branch; subsequent runs reuse cached scores.
Verdict tab — what's in it
The Verdict tab is a two-pane webview inside the editor's bottom panel:
- Toolbar: Run Audit · Run Static Audit · Refresh · diff-range selector · group-by (type / file / severity) · sort-by · fuzzy search.
- Left pane: grouped, filterable list of findings.
- Right pane: detail view for the selected finding — full message, file:line link, dismiss/resolve buttons, Fix with Bob handoff (sends the finding into Bob as a fix prompt).
- Status bar: left-side indicator with the current verdict color.
Status (dismissed / resolved / open) persists across runs in .bob/verdict-state.json, keyed by a stable SHA-1 of (kind, file, line, message) so re-runs of Verdict don't lose your triage state — unless the underlying finding actually changed.
Settings (workspace or user settings.json):
| Key | Default | Purpose |
|---|---|---|
verdict.pythonPath |
"" |
Python interpreter to run python -m verdict.cli. Empty → use verdict from PATH. |
verdict.reportPath |
"verdict-report.json" |
Where to read the report from. |
verdict.diffRange |
"HEAD" |
Default --diff-range for in-tab runs. |
verdict.groupBy |
"type" |
type / file / severity. |
verdict.sortBy |
"severity" |
severity / file / title. |
Repo layout
verdict/
├── verdict/ # The Python package
│ ├── cli.py # `verdict run`, `verdict mcp-install`, `verdict dashboard`
│ ├── diff.py # git diff parsing → ChangedFile[]
│ ├── ast_utils.py # AST walk → AddedFunction[]
│ ├── report.py # Scorecard construction, terminal/JSON formatting
│ ├── types.py # Shared TypedDicts (the contract between checks)
│ ├── mcp_server.py # `verdict-mcp` entry point
│ ├── dashboard_cmd.py # Backfill + local HTTP server
│ ├── _tracer_plugin.py # pytest plugin: function-execution tracer (dynamic)
│ ├── _coverage_plugin.py # pytest plugin: line-level coverage (dynamic)
│ ├── bob_integration/
│ │ ├── custom_mode.yaml # The "Verifier" Custom Mode
│ │ └── slash_commands/
│ │ └── verify.md # /verify slash command
│ └── checks/ # The seven checks (auto-discovered via pkgutil)
│ ├── dead_functions.py
│ ├── vacuous_tests.py
│ ├── hallucinated_api.py
│ ├── phantom_files.py
│ ├── suppressed_exc.py
│ ├── trace.py
│ └── coverage_delta.py
│
├── verdict-vscode/ # The VS Code / Bob extension ("Verdict tab")
│ ├── package.json # Manifest: tab, commands, settings
│ ├── src/
│ │ ├── extension.ts # activate(), command wiring, file watcher
│ │ ├── findingsStore.ts # Load + normalize the JSON report
│ │ ├── statusStore.ts # .bob/verdict-state.json triage persistence
│ │ ├── statusBar.ts # Left-side status-bar item
│ │ ├── verdictRunner.ts # Spawns the CLI, streams to output channel
│ │ └── webviewProvider.ts # Webview host (HTML/CSP, host↔webview messages)
│ ├── media/
│ │ ├── main.css # Design tokens + components
│ │ ├── main.js # Webview client (toolbar, list, detail)
│ │ └── verdict.svg # Tab icon
│ └── verdict-vscode-0.2.0.vsix # Prebuilt, install directly
│
├── dashboard/ # Analytics dashboard (static HTML + per-commit JSON)
│ ├── index.html
│ ├── app.js
│ ├── style.css
│ ├── backfill.py
│ └── data/ # One <sha>.json per scored commit
│
├── tests/ # Pytest suite for the checks and CLI
├── pyproject.toml # Python packaging
├── action.yml # GitHub Action wrapper for CI
├── Dockerfile # Containerized verdict runner
└── README.md
Design choices worth calling out
Static + dynamic, not just one. Static-only catches dead code and obvious lies but can't tell you whether the tests actually exercise the new function. Dynamic-only requires a working test suite and is slow. Verdict runs both, so a project with no tests still gets meaningful static findings, and a project with tests gets the deeper dynamic signal.
Per-check timeout, never a stuck audit. Dynamic checks run pytest under a tracer. If a project's test suite hangs, the MCP server would hang too. Each check is wrapped in a ThreadPoolExecutor.result(timeout=…) so one slow check can't take down the whole audit — it produces a check_timed_out finding and Verdict moves on.
Stable finding IDs. Triage state (dismissed / resolved) is keyed by sha1(kind \0 file \0 line \0 message).slice(0, 12). Re-running Verdict doesn't lose your triage decisions unless the underlying finding genuinely changed.
Checks are plug-ins, not hardcoded. discover_checks() walks verdict.checks via pkgutil.iter_modules and pulls each module's top-level check attribute. Adding an eighth check is one new file in verdict/checks/ — no registry, no wiring.
Bob-first but not Bob-only. Every Bob-specific integration (MCP, Custom Mode, slash command, .vsix) is optional. The CLI is the primary surface. Anything that works in Bob works in plain VS Code, and the CLI works without any editor.
Roadmap / known gaps
- Scorecard is a hard threshold. A 0.80-confidence finding is
SUSPICIOUS; 0.81 jumps toLIED. Replacement plan (Bayesian aggregation with per-check priors, corroboration bonus, soft bands) is sketched inverdict/report.py. - Python only. All checks parse Python AST. Adding TypeScript/Go is a straightforward fork of
ast_utils.pyand the static checks, but it's not done. - Confidences are uncalibrated guesses. Each check author picked their own. A small labeled fixture corpus + per-check precision/recall tuning is the path to real probabilities.
- VS Code extension engine target is high. Currently
^1.85.0; lowering it is a one-line fix to broaden Bob/VS Code compatibility.
Made for the IBM Bob hackathon (May 2026)
Verdict is the team's submission. The team:
- Neel
- Jacob
- Alexie
- Ben
Built end-to-end on Bob (a lot of Verdict was written by the thing it audits).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file myverdict-0.1.1.tar.gz.
File metadata
- Download URL: myverdict-0.1.1.tar.gz
- Upload date:
- Size: 67.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6c60badfef16505e86bade59b4cbc75c72d1f04f253821f78fe20d602bd7f3f
|
|
| MD5 |
69eeebc0446bafa2fa33db33b2b3c8bf
|
|
| BLAKE2b-256 |
e821d302bc3f9513b0524846ba0bbe892d6ef2e5b2bc72fa9d40c053a3ab83c8
|
File details
Details for the file myverdict-0.1.1-py3-none-any.whl.
File metadata
- Download URL: myverdict-0.1.1-py3-none-any.whl
- Upload date:
- Size: 56.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e37efd8d5183cf7a9b6a404e17db485405b8800f8c796cd3db6444a02fe0d8a
|
|
| MD5 |
bc27c451cad59f68963444825cea19b6
|
|
| BLAKE2b-256 |
854342883c4500f5ecbcf8f66e11435d53fc67978908af23d1ea00c4ff832dcb
|