Multi-agent adversarial review & deliberation for plans/specs on subscription CLIs (reduce rework before execution)
Project description
challenge-plans
中文文档: README-zh.md
Adversarially review your plan or spec before you execute it — across the coding CLIs you already have logged in. No API keys.
challenge-plans orchestrates the subscription AI coding CLIs already on your machine (Claude Code, Codex, …) to cross-examine a plan/spec and surface the flaws that cause rework downstream — and to vote across options when you're unsure. It also reviews a raw git diff as a lightweight code review pass, and drops in as an agent skill. It runs on your existing subscriptions, so there are no per-token API charges. It slots into the superpowers plan lifecycle: writing-plans → challenge-plans → executing-plans.
$ challenge-plans run plan.md --type spec --profile standard --sink markdown
# challenge-plans · challenge · verdict: request_changes
- panel: expected 3 / collected 3 · complete ✓
- diversity: 2 families
- verified: 3 high/critical reviewed by Verifier (✓ verified, may hard-gate; ? unverified, advisory)
- surviving objections: 4
- [high✓] sensitive data sent to a third-party LLM with no privacy boundary @L42-43 (security_or_privacy_boundary, by claude:scope-boundary)
- [high✓] "schema-aligned" claimed but there's no contract test @L12-30 (integration_contract_gap, by gpt:correctness)
- [high✓] no measurable acceptance threshold @L1 (contract_violation, by preflight)
- [medium?] missing_fields vs null semantics left undefined @L32-34 (ambiguity_to_wrong_implementation)
Why challenge-plans
- 🔑 No API keys, no per-token charges — it drives the subscription CLIs you're already logged into (Claude Code, Codex). Bring at least one.
- 🧪 Evidence beats headcount — a minority objection with a reproduction can override a majority vote; correctness is not decided by voting.
- 🤝 Cross-family verification — an objection only earns hard-gate authority (
✓) when an independent model family reproduces it with concrete, line-anchored evidence. Single-model claims stay advisory. - 🛡️ Guards 7 known multi-agent failure modes — vote loss, option anchoring, premature hand-off, majority-over-minority, single-round complacency, false consensus, false convergence. Each was hit (and fixed) while building this tool with its own adversarial process.
- 🌍 Reads in your language — the codebase is English, but
--lang zh(orja,de,fr, …) makes every reviewer write its findings in your language while JSON keys and line anchors stay machine-stable. One flag, no separate build — see Output in your language.
Quickstart
Requires Python ≥ 3.10 (PyYAML installs automatically). Bring at least one logged-in coding CLI — Claude Code (claude) or OpenAI Codex (codex); two different vendors unlock cross-family verification.
git clone https://github.com/hiadrianchen/challenge-plans && cd challenge-plans
pip install -e . # exposes the `challenge-plans` command
challenge-plans doctor # which backend CLIs are logged in
challenge-plans run examples/spec-sample.md --type spec --sink markdown # see a verdict on the bundled sample
Hand the repo to your coding agent instead — "Install and set up challenge-plans from this repo, then run challenge-plans doctor" — and it'll do the above. To use it as an agent skill, drop SKILL.md where your agent discovers skills.
Use
challenge-plans doctor # which backends are ready
challenge-plans run path/to/spec.md --type spec --profile standard --sink markdown # harden a plan/spec
challenge-plans run change.diff --type diff --sink markdown # review a git diff
challenge-plans weigh path/to/options.yaml --profile standard --sink markdown # vote across options
challenge-plans run path/to/spec.md --enforce # CI gate: non-approve exits non-zero
challenge-plans run path/to/spec.md --type spec --sink markdown --lang zh # findings written in Chinese
# not pip-installed? prefix with: PYTHONPATH=src python3 -m challenge_plans.cli ...
Ready-to-run samples live in examples/ (spec-sample.md, options.yaml). options.yaml:
question: Refactor auth with approach A or B?
options:
- id: A
text: One-shot rewrite — concentrated risk, clean result
- id: B
text: Incremental migration — slower, every step reversible
--profile fast|standard|deep,--sink stdout|markdown,--enforce(non-approve verdicts exit non-zero; advisory exit 0 by default).--lang <code>writes the human-readable output in your language (defaulten) — see below.[sev✓]= cross-family verified, may hard-gate;[sev?]= unverified, advisory only.- Artifact types:
--type specand--type diffare supported;plan/decisionare reserved (rubric pending).
The bundled SKILL.md routes review/QA of a plan/spec to run automatically; option-voting is the weigh subcommand.
Output in your language
challenge-plans ships an English codebase, but the reviewers can answer in any language — just add --lang:
challenge-plans run plan.md --type spec --lang zh # objections, evidence, reproductions in Chinese
challenge-plans weigh options.yaml --lang ja # deliberation reasons in Japanese
--lang only switches the human-readable prose (steelman, titles, evidence, reproductions, vote reasons). JSON keys, enum values, and L12-15 line anchors stay verbatim, so parsing, dedup, and CI gates are unaffected. It's equivalent to exporting CHALLENGE_PLANS_LANG once. There's no separate translated build to maintain — the same English source localizes on demand.
As an agent skill: your agent just passes --lang <your-language> and the whole cross-review comes back localized. The bundled SKILL.md documents the flag so the calling agent can set it from the user's language automatically.
Two modes
challenge-plans isn't one feature — it's two modes on one engine. The calling agent routes by intent; the user never has to pick:
| challenge (adversarial) | weigh-options (deliberation) | |
|---|---|---|
| When | You have a drafted plan/spec to poke holes in / harden | You have several options / a pile of to-dos and aren't sure which |
| Routing signal | a single drafted artifact + "review / find flaws / can this execute" | multiple candidates + "which one / rank these / is it worth it" |
| Aggregation | Evidence survival — a minority can be right, no majority vote | Weighted majority + exposed dissent — only genuine trade-offs get voted on |
| Output | 6-state verdict + surviving objections + reproductions / counter-evidence | ranked options + vote tally + strongest dissent |
The agent picks the mode — it isn't dumped on the user: it reads the intent and routes "review a drafted artifact" to adversarial mode and "choose among options" to deliberation, with deterministic routing signals defining the boundary. During deliberation, if an option is flagged with a mechanically verifiable blocker, the recommendation is downgraded to discuss and you're asked to verify it in challenge mode rather than adopting it outright — so a vote can never outweigh a falsifiable minority objection.
How it works
Adversarial mode (reduce-rework loop):
drafted artifact + bounded context
→ multiple persona/CLI challengers each steelman → find flaws (bound to specific text, no hedging)
→ Verifier (cross-family) produces a minimal reproduction / contradicting source line
→ dedup by canonical key + evidence-survival
→ single verdict pipeline → 6-state verdict + panel-integrity check
→ (--deep: multi-round to two-condition convergence)
Deliberation mode — the methodology is a strict three-phase flow. The weigh CLI implements phase ③ (it votes on the options you hand it); phases ①② are the calling agent's responsibility before invoking it — no shortcuts:
① align (agent) share full background with every voter first — the question, constraints, known facts — don't pre-supply options
② collect (agent) each voter independently, unseen by the others and not fed the orchestrator's preferences, generates candidates → dedup/cluster into an option pool
③ vote `challenge-plans weigh` votes on that option pool (model_family-weighted to block false consensus) → ranking + tally + dissent
hands back to a human only on a tie / missing votes; otherwise closes the loop and returns a result
What it guards against — 7 multi-agent failure modes
These traps are ones a naive multi-agent setup almost always falls into — and ones we hit ourselves while building this tool with its own adversarial process. Each guard is built into the design, and the design is dogfooded:
- Vote/finding loss — a challenger is truncated/timed-out/unparseable and the system silently aggregates a partial panel. Guard: machine-readable capture + per-voter integrity self-check; missing votes never approve or declare a majority.
- Option anchoring — the orchestrator only offers its own pre-picked options, so agents merely ratify the framing. Guard: deliberation always diverges (generate first, vote second); voters aren't fed the orchestrator's preferences.
- Premature hand-off — the orchestrator bounces the open decision back to the human mid-way instead of finishing the vote. Guard: close the loop and return a result; hand back only on a tie / missing votes.
- Majority over minority — out-voting a minority that has a reproducible blocker. Guard: two modes with split aggregation + the escape gate; adversarial mode bans voting and lets evidence beat headcount.
- Single-round complacency — one pass declared sufficient. Guard:
--deepmulti-round to convergence + adversarial review of the code itself before shipping. - False consensus — same-model personas counted as independent votes, so one model's bias gets cloned into a "majority". Guard: per-
model_familyweight cap, raw/weighted both shown, single-family warning. - False convergence — declaring "done" when no new objection appeared but an old blocker is still open. Guard: two-condition convergence (new_surviving == 0 and unresolved_required == 0).
Backends
challenge-plans drives whatever subscription coding CLI you already have logged in — e.g. Claude Code (claude) or OpenAI Codex (codex). You don't need any specific one. With two different vendors it can cross-verify findings; with one, results stay advisory. No API keys, and no per-token API charges from this tool (doctor checks the CLIs are logged in, not your billing; usage still counts against your normal subscription limits).
Status
v1 — usable. Both modes work end-to-end, validated against a real spec and pinned by a pytest suite, hardened across multiple cross-agent adversarial-review rounds.
Known boundaries (also reflected in the run output): concern dedup is exact-anchor only; no idle-timeout (wall-clock only); deliberation blockers are flagged, not yet auto-verified by the Verifier; the open-decision divergence phase is the calling agent's job; manual_paste/Gemini adapters are follow-ups.
Testing
pip install -e ".[dev]" && pytest # pythonpath/testpaths preconfigured
The suite pins every invariant established across the adversarial-review rounds.
Contributing
Issues and PRs welcome — see CONTRIBUTING.md. The project is dogfooded: reviewing your own change with challenge-plans run <change>.diff --type diff before opening a PR is encouraged.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file challenge_plans-0.1.0.tar.gz.
File metadata
- Download URL: challenge_plans-0.1.0.tar.gz
- Upload date:
- Size: 42.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b09940f0019410d797707565ae5883e7395f8d006e89bbbb4fce4394c413331
|
|
| MD5 |
456ab2eec5e3755be4374cdbcb1ededf
|
|
| BLAKE2b-256 |
aa084ea066a8b4ccdca01e98bdb37c999f07f21b08b3f2cb313a1c5cae2bb3b9
|
Provenance
The following attestation bundles were made for challenge_plans-0.1.0.tar.gz:
Publisher:
release.yml on hiadrianchen/challenge-plans
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
challenge_plans-0.1.0.tar.gz -
Subject digest:
3b09940f0019410d797707565ae5883e7395f8d006e89bbbb4fce4394c413331 - Sigstore transparency entry: 1935219570
- Sigstore integration time:
-
Permalink:
hiadrianchen/challenge-plans@1044316bb0c51396953e1d424c137d59334714cf -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/hiadrianchen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1044316bb0c51396953e1d424c137d59334714cf -
Trigger Event:
push
-
Statement type:
File details
Details for the file challenge_plans-0.1.0-py3-none-any.whl.
File metadata
- Download URL: challenge_plans-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cdacc3f6aace3a76d64a6618042a54295e61775f7fcf735e735837a0b307705
|
|
| MD5 |
7ce98a3b318162c6ae7835757116be3c
|
|
| BLAKE2b-256 |
de0193393fd832d65733f9dc8234e8b0b99a1564d033ef9c1adfbb32d9b179ff
|
Provenance
The following attestation bundles were made for challenge_plans-0.1.0-py3-none-any.whl:
Publisher:
release.yml on hiadrianchen/challenge-plans
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
challenge_plans-0.1.0-py3-none-any.whl -
Subject digest:
2cdacc3f6aace3a76d64a6618042a54295e61775f7fcf735e735837a0b307705 - Sigstore transparency entry: 1935219618
- Sigstore integration time:
-
Permalink:
hiadrianchen/challenge-plans@1044316bb0c51396953e1d424c137d59334714cf -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/hiadrianchen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1044316bb0c51396953e1d424c137d59334714cf -
Trigger Event:
push
-
Statement type: