Skip to main content

Multi-agent adversarial review & deliberation for plans/specs on subscription CLIs (reduce rework before execution)

Project description

challenge-plans

Python License CI

中文文档: README-zh.md

Adversarially review your plan or spec before you execute it — across the coding CLIs you already have logged in. No API keys.

challenge-plans orchestrates the subscription AI coding CLIs already on your machine (Claude Code, Codex, …) to cross-examine a plan/spec and surface the flaws that cause rework downstream — and to vote across options when you're unsure. It also reviews a raw git diff as a lightweight code review pass, and drops in as an agent skill. It runs on your existing subscriptions, so there are no per-token API charges. It slots into the superpowers plan lifecycle: writing-plans → challenge-plans → executing-plans.

$ challenge-plans run plan.md --type spec --profile standard --sink markdown

# challenge-plans · challenge · verdict: request_changes
- panel: expected 3 / collected 3 · complete ✓
- diversity: 2 families
- verified: 3 high/critical reviewed by Verifier (✓ verified, may hard-gate; ? unverified, advisory)
- surviving objections: 4

- [high✓]   sensitive data sent to a third-party LLM with no privacy boundary  @L42-43  (security_or_privacy_boundary, by claude:scope-boundary)
- [high✓]   "schema-aligned" claimed but there's no contract test             @L12-30  (integration_contract_gap, by gpt:correctness)
- [high✓]   no measurable acceptance threshold                               @L1      (contract_violation, by preflight)
- [medium?] missing_fields vs null semantics left undefined                  @L32-34  (ambiguity_to_wrong_implementation)

Why challenge-plans

  • 🔑 No API keys, no per-token charges — it drives the subscription CLIs you're already logged into (Claude Code, Codex). Bring at least one.
  • 🧪 Evidence beats headcount — a minority objection with a reproduction can override a majority vote; correctness is not decided by voting.
  • 🤝 Cross-family verification — an objection only earns hard-gate authority () when an independent model family reproduces it with concrete, line-anchored evidence. Single-model claims stay advisory.
  • 🛡️ Guards 7 known multi-agent failure modes — vote loss, option anchoring, premature hand-off, majority-over-minority, single-round complacency, false consensus, false convergence. Each was hit (and fixed) while building this tool with its own adversarial process.
  • 🌍 Reads in your language — the codebase is English, but --lang zh (or ja, de, fr, …) makes every reviewer write its findings in your language while JSON keys and line anchors stay machine-stable. One flag, no separate build — see Output in your language.

Quickstart

Requires Python ≥ 3.10 (PyYAML installs automatically). Bring at least one logged-in coding CLI — Claude Code (claude) or OpenAI Codex (codex); two different vendors unlock cross-family verification.

git clone https://github.com/hiadrianchen/challenge-plans && cd challenge-plans
pip install -e .                                                          # exposes the `challenge-plans` command
challenge-plans doctor                                                    # which backend CLIs are logged in
challenge-plans run examples/spec-sample.md --type spec --sink markdown   # see a verdict on the bundled sample

Hand the repo to your coding agent instead — "Install and set up challenge-plans from this repo, then run challenge-plans doctor" — and it'll do the above. To use it as an agent skill, drop SKILL.md where your agent discovers skills.

Use

challenge-plans doctor                                                                 # which backends are ready
challenge-plans run path/to/spec.md --type spec --profile standard --sink markdown     # harden a plan/spec
challenge-plans run change.diff --type diff --sink markdown                             # review a git diff
challenge-plans weigh path/to/options.yaml --profile standard --sink markdown           # vote across options
challenge-plans run path/to/spec.md --enforce                                           # CI gate: non-approve exits non-zero
challenge-plans run path/to/spec.md --type spec --sink markdown --lang zh                # findings written in Chinese
# not pip-installed? prefix with: PYTHONPATH=src python3 -m challenge_plans.cli ...

Ready-to-run samples live in examples/ (spec-sample.md, options.yaml). options.yaml:

question: Refactor auth with approach A or B?
options:
  - id: A
    text: One-shot rewrite — concentrated risk, clean result
  - id: B
    text: Incremental migration — slower, every step reversible
  • --profile fast|standard|deep, --sink stdout|markdown, --enforce (non-approve verdicts exit non-zero; advisory exit 0 by default).
  • --lang <code> writes the human-readable output in your language (default en) — see below.
  • [sev✓] = cross-family verified, may hard-gate; [sev?] = unverified, advisory only.
  • Artifact types: --type spec and --type diff are supported; plan / decision are reserved (rubric pending).

The bundled SKILL.md routes review/QA of a plan/spec to run automatically; option-voting is the weigh subcommand.

Output in your language

challenge-plans ships an English codebase, but the reviewers can answer in any language — just add --lang:

challenge-plans run plan.md --type spec --lang zh     # objections, evidence, reproductions in Chinese
challenge-plans weigh options.yaml --lang ja          # deliberation reasons in Japanese

--lang only switches the human-readable prose (steelman, titles, evidence, reproductions, vote reasons). JSON keys, enum values, and L12-15 line anchors stay verbatim, so parsing, dedup, and CI gates are unaffected. It's equivalent to exporting CHALLENGE_PLANS_LANG once. There's no separate translated build to maintain — the same English source localizes on demand.

As an agent skill: your agent just passes --lang <your-language> and the whole cross-review comes back localized. The bundled SKILL.md documents the flag so the calling agent can set it from the user's language automatically.

Two modes

challenge-plans isn't one feature — it's two modes on one engine. The calling agent routes by intent; the user never has to pick:

challenge (adversarial) weigh-options (deliberation)
When You have a drafted plan/spec to poke holes in / harden You have several options / a pile of to-dos and aren't sure which
Routing signal a single drafted artifact + "review / find flaws / can this execute" multiple candidates + "which one / rank these / is it worth it"
Aggregation Evidence survival — a minority can be right, no majority vote Weighted majority + exposed dissent — only genuine trade-offs get voted on
Output 6-state verdict + surviving objections + reproductions / counter-evidence ranked options + vote tally + strongest dissent

The agent picks the mode — it isn't dumped on the user: it reads the intent and routes "review a drafted artifact" to adversarial mode and "choose among options" to deliberation, with deterministic routing signals defining the boundary. During deliberation, if an option is flagged with a mechanically verifiable blocker, the recommendation is downgraded to discuss and you're asked to verify it in challenge mode rather than adopting it outright — so a vote can never outweigh a falsifiable minority objection.

How it works

Adversarial mode (reduce-rework loop):

drafted artifact + bounded context
  → multiple persona/CLI challengers each steelman → find flaws (bound to specific text, no hedging)
  → Verifier (cross-family) produces a minimal reproduction / contradicting source line
  → dedup by canonical key + evidence-survival
  → single verdict pipeline → 6-state verdict + panel-integrity check
  → (--deep: multi-round to two-condition convergence)

Deliberation mode — the methodology is a strict three-phase flow. The weigh CLI implements phase ③ (it votes on the options you hand it); phases ①② are the calling agent's responsibility before invoking it — no shortcuts:

① align    (agent) share full background with every voter first — the question, constraints, known facts — don't pre-supply options
② collect  (agent) each voter independently, unseen by the others and not fed the orchestrator's preferences, generates candidates → dedup/cluster into an option pool
③ vote     `challenge-plans weigh` votes on that option pool (model_family-weighted to block false consensus) → ranking + tally + dissent
           hands back to a human only on a tie / missing votes; otherwise closes the loop and returns a result

What it guards against — 7 multi-agent failure modes

These traps are ones a naive multi-agent setup almost always falls into — and ones we hit ourselves while building this tool with its own adversarial process. Each guard is built into the design, and the design is dogfooded:

  1. Vote/finding loss — a challenger is truncated/timed-out/unparseable and the system silently aggregates a partial panel. Guard: machine-readable capture + per-voter integrity self-check; missing votes never approve or declare a majority.
  2. Option anchoring — the orchestrator only offers its own pre-picked options, so agents merely ratify the framing. Guard: deliberation always diverges (generate first, vote second); voters aren't fed the orchestrator's preferences.
  3. Premature hand-off — the orchestrator bounces the open decision back to the human mid-way instead of finishing the vote. Guard: close the loop and return a result; hand back only on a tie / missing votes.
  4. Majority over minority — out-voting a minority that has a reproducible blocker. Guard: two modes with split aggregation + the escape gate; adversarial mode bans voting and lets evidence beat headcount.
  5. Single-round complacency — one pass declared sufficient. Guard: --deep multi-round to convergence + adversarial review of the code itself before shipping.
  6. False consensus — same-model personas counted as independent votes, so one model's bias gets cloned into a "majority". Guard: per-model_family weight cap, raw/weighted both shown, single-family warning.
  7. False convergence — declaring "done" when no new objection appeared but an old blocker is still open. Guard: two-condition convergence (new_surviving == 0 and unresolved_required == 0).

Backends

challenge-plans drives whatever subscription coding CLI you already have logged in — e.g. Claude Code (claude) or OpenAI Codex (codex). You don't need any specific one. With two different vendors it can cross-verify findings; with one, results stay advisory. No API keys, and no per-token API charges from this tool (doctor checks the CLIs are logged in, not your billing; usage still counts against your normal subscription limits).

Status

v1 — usable. Both modes work end-to-end, validated against a real spec and pinned by a pytest suite, hardened across multiple cross-agent adversarial-review rounds.

Known boundaries (also reflected in the run output): concern dedup is exact-anchor only; no idle-timeout (wall-clock only); deliberation blockers are flagged, not yet auto-verified by the Verifier; the open-decision divergence phase is the calling agent's job; manual_paste/Gemini adapters are follow-ups.

Testing

pip install -e ".[dev]" && pytest      # pythonpath/testpaths preconfigured

The suite pins every invariant established across the adversarial-review rounds.

Contributing

Issues and PRs welcome — see CONTRIBUTING.md. The project is dogfooded: reviewing your own change with challenge-plans run <change>.diff --type diff before opening a PR is encouraged.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

challenge_plans-0.1.0.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

challenge_plans-0.1.0-py3-none-any.whl (36.7 kB view details)

Uploaded Python 3

File details

Details for the file challenge_plans-0.1.0.tar.gz.

File metadata

  • Download URL: challenge_plans-0.1.0.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for challenge_plans-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3b09940f0019410d797707565ae5883e7395f8d006e89bbbb4fce4394c413331
MD5 456ab2eec5e3755be4374cdbcb1ededf
BLAKE2b-256 aa084ea066a8b4ccdca01e98bdb37c999f07f21b08b3f2cb313a1c5cae2bb3b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for challenge_plans-0.1.0.tar.gz:

Publisher: release.yml on hiadrianchen/challenge-plans

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file challenge_plans-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: challenge_plans-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for challenge_plans-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cdacc3f6aace3a76d64a6618042a54295e61775f7fcf735e735837a0b307705
MD5 7ce98a3b318162c6ae7835757116be3c
BLAKE2b-256 de0193393fd832d65733f9dc8234e8b0b99a1564d033ef9c1adfbb32d9b179ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for challenge_plans-0.1.0-py3-none-any.whl:

Publisher: release.yml on hiadrianchen/challenge-plans

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page