Skip to main content

Context-isolated verification harness for AI-generated code

Project description

CrossReview

CI PyPI Python License: MIT

English | 简体中文

Automated cross-review for AI coding — same model, clean session, independent second pass on your output.

What is Cross-Review?

In human code review, a change is typically inspected by someone who did not directly implement it, which reduces author bias. CrossReview applies the same principle to AI-generated code by separating generation and review into two isolated contexts.

An AI coding assistant (Claude, Copilot, Cursor, etc.) first produces the change in its original session. CrossReview then packages the diff, stated intent, focus areas, and optional context into a ReviewPack and hands it to a separate reviewer session for verification. That reviewer does not inherit the original conversation, reasoning trace, or tool history; it evaluates the change only from the minimum necessary inputs.

The key insight: you don't need a different model, just a different context. Same model, clean session, real findings.

Why It Works

The mechanism is not model diversity; it is input isolation.

The author session accumulates local assumptions, discarded alternatives, retries, and tool-side trial-and-error. If the review step reuses that context, the reviewer is likely to preserve the author's framing instead of independently re-deriving whether the change is correct.

CrossReview avoids that by constraining reviewer input to the review artifact itself:

Reviewer receives Reviewer does not receive
Diff / changed files Original conversation
Stated intent Planning or reasoning trace
Focus areas Tool call history
Optional context files Retries, failed attempts, intermediate drafts

This separation has two practical effects:

  • It increases reviewer independence, because the second pass must justify findings from the artifact rather than from inherited session state.
  • It improves auditability, because reviewer claims can be checked against ReviewPack contents, emitted findings, and deterministic normalization rules.

Eval Results

Full evaluation across 33 fixtures (claude-opus-4.6, external_only scope):

Metric Value Gate
Precision 0.885 ≥ 0.70 ✅
Recall 0.929 ≥ 0.80 ✅
Unclear rate 0.133 ≤ 0.150 ✅
Invalid findings / run 1 ≤ 2 ✅

All 9 release gate metrics passblocking_pass: true. See v0-scope.md §12 for the full gate definition.

Quick Start

pip install crossreview              # from PyPI (v0.1.0a2+)
pip install -e .                     # local dev (pack + verify commands)
pip install -e '.[anthropic]'        # + Anthropic standalone reviewer backend
pip install -e '.[dev]'              # dev dependencies (pytest + ruff)

# configure standalone verify via flags, crossreview.yaml, or env vars
# example:
#   export CROSSREVIEW_PROVIDER=anthropic
#   export CROSSREVIEW_MODEL=claude-sonnet-4-20250514
#   export CROSSREVIEW_API_KEY_ENV=ANTHROPIC_API_KEY
#   export ANTHROPIC_API_KEY=...

crossreview pack --diff HEAD~1 --intent "fix auth token refresh" > pack.json
crossreview pack --staged --intent "fix auth token refresh" > pack.json
crossreview verify --pack pack.json

Or in one step:

crossreview verify --diff HEAD~1 --intent "fix auth token refresh"
crossreview verify --staged --intent "fix auth token refresh"

crossreview verify --diff, --staged, and --unstaged output human-readable text by default. crossreview verify --pack outputs ReviewResult JSON (default), or human-readable text with --format human:

{
  "schema_version": "0.1-alpha",
  "artifact_fingerprint": "diff:abc123",
  "pack_fingerprint": "pack:def456",
  "review_status": "complete",
  "intent_coverage": "covered",
  "findings": [
    {
      "id": "f-001",
      "severity": "high",
      "summary": "Token refresh silently succeeds when refresh_token is expired",
      "detail": "The try/except on line 42 catches TokenExpiredError but returns the old token instead of raising.",
      "category": "logic_error",
      "locatability": "exact",
      "confidence": "plausible",
      "evidence_related_file": false,
      "actionable": true,
      "file": "src/auth.py",
      "line": 42
    }
  ],
  "advisory_verdict": {
    "verdict": "concerns",
    "rationale": "review found medium/high-severity issues"
  },
  "quality_metrics": {
    "pack_completeness": 0.85,
    "noise_count": 0,
    "raw_findings_count": 1,
    "emitted_findings_count": 1,
    "locatability_distribution": {
      "exact_pct": 1.0,
      "file_only_pct": 0.0,
      "none_pct": 0.0
    },
    "speculative_ratio": 0.0
  },
  "reviewer": {
    "type": "fresh_llm",
    "model": "claude-sonnet-4-20250514",
    "session_isolated": true,
    "failure_reason": null,
    "prompt_source": "product",
    "prompt_version": "v0.1"
  },
  "budget": {
    "status": "complete",
    "files_reviewed": 1,
    "files_total": 1,
    "chars_consumed": 842,
    "chars_limit": 12000
  }
}

Architecture

         git diff + intent + focus + context
                      │
                      ▼
              ┌────────────────┐
              │      Pack      │  Assemble ReviewPack
              └───────┬────────┘
                      │
                      ▼
              ┌────────────────┐
              │  Budget Gate   │  Focus-priority, size cap
              └───────┬────────┘
                      │
   ╔══════════════════╪═══════════════════════════╗
   ║                  ▼  Isolation Boundary       ║
   ║          ┌────────────────┐                  ║
   ║          │ Reviewer (LLM) │  Fresh session,  ║
   ║          │                │  zero shared ctx ║
   ║          └───────┬────────┘                  ║
   ╚══════════════════╪═══════════════════════════╝
                      │
                      ▼
              ┌────────────────┐
              │  Normalizer    │  Extract findings from text
              └───────┬────────┘
                      │
                      ▼
              ┌────────────────┐
              │  Adjudicator   │  Apply rules → verdict
              └───────┬────────┘
                      │
                      ▼
              ┌────────────────┐
              │ ReviewResult   │  Findings + verdict
              │ (JSON)         │  + quality metrics
              └────────────────┘

Only the Reviewer calls an LLM. Everything else is rule-based — no AI in the loop.

Two reviewer backend modes:

Mode Description Dependency
Host-integrated (CLI implemented) The host renders the reviewer prompt in an isolated context (fresh session / sub-agent), then feeds raw analysis back to CrossReview's normalizer + adjudicator through the render-prompt + ingest CLI commands No extra SDK on the CrossReview side
Standalone (implemented) CLI calls the LLM API directly crossreview[anthropic] + reviewer config + API key

Host-integrated is the planned default product path. The host does NOT need to implement a Python ReviewerBackend; the integration path is render-prompt + ingest, with the host responsible for executing the canonical prompt in a fresh context and feeding raw analysis back.

Commands

crossreview pack

crossreview pack --diff HEAD~1 > pack.json
crossreview pack --diff main..feat --intent "add caching" --focus cache --context ./plan.md > pack.json
Flag Description
--diff REF Git ref (HEAD~1) or range (main..feat)
--intent TEXT Task intent (background claim, not ground truth)
--task FILE Full task description file
--focus TERM Focus review area (repeatable)
--context FILE Extra context file (repeatable)

crossreview verify

Two modes: --pack (verify a pre-built ReviewPack) or --diff (one-stop: pack + verify).

# one-stop: pack + verify, human output by default
crossreview verify --diff HEAD~1
crossreview verify --diff HEAD~1 --intent "fix auth" --focus auth

# verify a pre-built pack, JSON output by default
crossreview verify --pack pack.json
crossreview verify --pack pack.json --model claude-sonnet-4-20250514 --provider anthropic

crossreview verify requires reviewer configuration to resolve successfully:

  • --model / --provider / --api-key-env
  • or crossreview.yaml
  • or ~/.crossreview/config.yaml
  • or CROSSREVIEW_MODEL / CROSSREVIEW_PROVIDER / CROSSREVIEW_API_KEY_ENV
Flag Description
--diff REF Git ref for diff (e.g. HEAD~1, main..feat). Assembles ReviewPack inline. Mutually exclusive with --pack
--pack FILE Path to ReviewPack JSON. Mutually exclusive with --diff
--intent TEXT Task intent string (--diff mode)
--task FILE Task description file (--diff mode)
--focus TERM Focus area, repeatable (--diff mode)
--context FILE Extra context file, repeatable (--diff mode)
--format FORMAT Output format. Defaults to human with --diff, json with --pack
--model TEXT Override reviewer model
--provider TEXT Override provider (currently anthropic only)
--api-key-env VAR Override API key env variable name

crossreview render-prompt

crossreview render-prompt --pack pack.json > prompt.md
crossreview render-prompt --pack pack.json --template custom-template.md > prompt.md

Renders a ReviewPack into the full canonical reviewer prompt for the host to execute in an isolated context. No LLM call, no API key needed.

Flag Description
--pack FILE Path to ReviewPack JSON
--template FILE Custom prompt template (default: built-in product/v0.1)

crossreview ingest

crossreview ingest --raw-analysis raw.md --pack pack.json --model claude-sonnet-4-20250514
crossreview ingest --raw-analysis - --pack pack.json --model host_unknown --prompt-source product --prompt-version v0.1

Takes raw analysis text from a host-integrated review session and produces a standard ReviewResult via normalizer + adjudicator. No LLM call, no API key needed. Outputs JSON by default; use --format human for terminal-friendly output.

Flag Description
--raw-analysis FILE Raw analysis file path; - for stdin
--pack FILE Original ReviewPack JSON
--model TEXT Host model name (host_unknown if unknown)
--format FORMAT Output format: json (default) or human
--prompt-source TEXT Prompt source identifier (optional)
--prompt-version TEXT Prompt version identifier (optional)
--latency-sec FLOAT Host-measured LLM latency (optional)
--input-tokens INT Host-reported input token count (optional)
--output-tokens INT Host-reported output token count (optional)

Exit Codes

All commands return 0 when a ReviewResult is successfully produced, regardless of review_status or advisory_verdict. A non-zero exit code means the command failed to produce output (invalid input, missing API key, empty diff, etc.).

For automation, check review_status and advisory_verdict in the JSON output instead of relying on the exit code:

crossreview verify --diff HEAD~1 --format json | jq -e '.advisory_verdict.verdict == "pass_candidate"'

Status

Component Status Notes
Schema ✅ Done ReviewPack / Finding / ReviewResult / Config
Pack CLI ✅ Done crossreview pack
Budget Gate ✅ Done Focus priority + soft/hard truncation
Reviewer ✅ Done ReviewerBackend protocol + Anthropic standalone
Normalizer ✅ Done Rule-based finding extraction
Adjudicator ✅ Done Rule-based advisory verdict
Verify CLI ✅ Done crossreview verify --pack
Render Prompt CLI ✅ Done crossreview render-prompt --pack (host-integrated front half)
Ingest CLI ✅ Done crossreview ingest --raw-analysis --pack --model (host-integrated back half)
Evidence Collector 🔜 Next ReviewPack.evidence path exists, empty evidence works
Eval Harness ✅ Done 33 fixtures, 9/9 gate metrics pass, blocking_pass: true
Human-readable Output ✅ Done --format human on verify/ingest
One-stop Verify ✅ Done crossreview verify --diff (pack + review in one step, default --format human)

v0 Scope

Supported: code_diff artifact only · advisory verdict · single fresh_llm reviewer · deterministic adjudicator and normalizer (no LLM fallback)

Out of scope (v0): Python SDK · MCP Server · CI/CD Action · Agent Skill runtime mode (advisory SKILL.md provided; runtime bridge deferred) · cross-model reviewer · verdict = block

Release gate: v0 must pass 9 blocking metrics (§12), including manual_recall ≥ 0.80, precision ≥ 0.70, fixture_count ≥ 20, invalid_findings_per_run ≤ 2, and 5 others. All 9 currently pass (blocking_pass: true).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crossreview-0.1.0a2.tar.gz (66.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crossreview-0.1.0a2-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file crossreview-0.1.0a2.tar.gz.

File metadata

  • Download URL: crossreview-0.1.0a2.tar.gz
  • Upload date:
  • Size: 66.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for crossreview-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 33899bafc8bfbf950171fd1895af613279295eeee324b3acd1ace46ff4019593
MD5 85fa1105bd9f535d29c0be223b04252c
BLAKE2b-256 52557d32cfc7b500544cce782b98d781b214e3ae9818271ef48607d4edf98053

See more details on using hashes here.

Provenance

The following attestation bundles were made for crossreview-0.1.0a2.tar.gz:

Publisher: publish.yml on evidentloop/cross-review

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file crossreview-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: crossreview-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 40.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for crossreview-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 fd908d5170109545bc61060eb33b4c07070836afc8bd26dadda4e2a64600c772
MD5 3b726efea48355e157dd8f9d4bd64df4
BLAKE2b-256 0ff2f973fbed111686222d7d8990195a09120bfe5a7fab45d03c76869d0f8f0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for crossreview-0.1.0a2-py3-none-any.whl:

Publisher: publish.yml on evidentloop/cross-review

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page