Skip to main content

Anti-sycophancy multi-evaluator engine. Empanel independent reviewers with lens diversity and isolated weighing to get code review without anchoring bias.

Project description

empanel

Anti-sycophancy multi-evaluator engine for LLM agents.

Empanel a grand jury of independent reviewers over the same code. Each lens returns its own verdict. Weighers score them in isolation. A deterministic synthesizer combines the findings into a ranked report — without any reviewer ever seeing another's opinion.

The thesis: LLM-as-judge workflows anchor catastrophically when one model sees another's output. Empanel enforces isolation at every stage so diverse lenses produce diverse findings, and weighers can't be flattered into consensus.

pip install empanel

Quickstart

# Review a file with four independent lenses, synthesize a ranked report
empanel review \
  --files src/contract.sol \
  --spec spec.md \
  --model claude-opus-4-7 \
  --output review.json \
  --markdown review.md
from empanel import CodeReviewEngine
from empanel.lenses import SECURITY, SPEC, EDGE_CASES, ARCHITECTURE

engine = CodeReviewEngine(lenses=[SECURITY, SPEC, EDGE_CASES, ARCHITECTURE])
result = engine.run(code=Path("src/contract.sol").read_text(),
                    spec=Path("spec.md").read_text())

for finding in result.findings:
    print(f"[{finding.confidence}] {finding.title}{finding.location}")

Why a grand jury metaphor

A grand jury is the closest real proceeding to what this tool does:

  • Multiple independent reviewers hearing the same evidence
  • No cross-examination between reviewers — each works in isolation
  • Output is a ranked list of indictments (issues worth pursuing), not a verdict
  • Evidence-enforced — every finding must cite the code

The architecture maps directly onto those properties. Adding a new lens is seating another juror; tightening weighing is tightening isolation rules.

How it works

Three phases, isolated by construction:

  1. Evaluate — each lens reviews the code independently. No cross-talk.
  2. Weigh — each weigher scores the raw findings without seeing other weighers' scores or lens identities. This is the anti-sycophancy wedge: a weigher can't be flattered into agreement with the majority.
  3. Synthesize — deterministic math combines the scores. Finding fingerprints deduplicate near-identical reports. Confidence tiers fall out of reviewer concurrence, not vibes.

Replay artifacts are stored as JSON so any review can be reproduced or disputed after the fact.

Built-in lenses

  • SECURITY — adversarial threat model framing
  • SPEC — compares implementation against an optional spec
  • EDGE_CASES — boundaries, error paths, null/empty, overflow
  • ARCHITECTURE — coupling, leaky abstractions, hidden state
  • PERFORMANCE — complexity, allocation, hot paths
  • READABILITY — naming, flow, cognitive load

Register custom lenses by subclassing Lens and passing them to the engine. The only contract is "return a list of Findings with evidence."

Integration points

  • Claude Code slash commandqa-hard uses empanel as the review backend. Runs without an API key by dispatching each evaluator through the Claude Code Agent tool.
  • Standalone CLIempanel review works with any Anthropic-API-keyed setup. Model is a flag, so anything that quacks like Claude works.
  • Fixturesempanel.fixtures bundles the regression corpus of real bugs the engine has caught. Use tests/test_fixtures.py as a template for pinning your own.

Status

  • v0.3.0 — renamed from independent-eval. 141 tests pass.
  • Self-review converged after three rounds at v0.2.x — engine reviews its own source without regressions.
  • Roadmap: ROADMAP.md (pluggable lenses, cost budgeting, cross-session replay diffing).

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

empanel-0.3.0.tar.gz (93.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

empanel-0.3.0-py3-none-any.whl (70.2 kB view details)

Uploaded Python 3

File details

Details for the file empanel-0.3.0.tar.gz.

File metadata

  • Download URL: empanel-0.3.0.tar.gz
  • Upload date:
  • Size: 93.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for empanel-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3725e00d6b04a661f1b1a9841bbbbca24048a2db5dde37d5003387766ebc1184
MD5 520f1fd7edd509b0e1664c5f2a93397e
BLAKE2b-256 117816b9878deb7e1ead2d934648a359b79b780b4c71d3029080550f1637e2b4

See more details on using hashes here.

File details

Details for the file empanel-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: empanel-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 70.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for empanel-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2bd9c88ce9ea2efd1f43a924eaf3967dd242b3e109a25f2e2bb72b01df068800
MD5 7bbe7aab4913f11d5c308a0b746a477b
BLAKE2b-256 15c70632ffb550c67de2c96c6a43664a87831e2e5839da5532252c43d517c5ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page