Skip to main content

Multi-model LLM deliberation with anonymized peer review

Project description

LLM Council

Python 3.10+ License: MIT CI Coverage

Multi-model LLM deliberation with anonymized peer review. Available as a Claude Code skill, an OpenClaw skill, or a standalone CLI.

What It Does

LLM Council queries multiple language models in parallel, has them anonymously rank each other's responses, and synthesizes a final answer from the top-ranked contributions. Each model evaluates shuffled, label-anonymized responses (Response A/B/C) so no model knows which peer produced which answer. A designated chairman then reads the aggregate rankings and writes the synthesis.

3-Stage Pipeline:

  1. Stage 1 — Independent responses: Every council model answers the question in parallel.
  2. Stage 2 — Anonymized peer ranking: Each model ranks the other responses using randomized anonymous labels and nonce-fenced XML, preventing favoritism.
  3. Stage 3 — Chairman synthesis: A chairman model receives the ranked results and produces the final consolidated answer.

Supported Providers:

Provider Access Streaming Token Usage Key Features
AWS Bedrock Any model in your Bedrock region (Anthropic Claude, Meta Llama, Mistral, etc.) Yes Real counts from API Extended thinking via budget_tokens, native AWS auth
Poe Any bot on Poe's API (OpenAI GPT, Google Gemini, xAI Grok, community bots, etc.) Yes Not provided by API Web search toggle, configurable reasoning effort
OpenRouter Hundreds of models via a single OpenAI-compatible API (OpenAI, Anthropic, Google, Meta, Mistral, and more) Yes (SSE) Real counts from API Standard temperature/max_tokens controls, model discovery via --list-models

Quickstart

Installation

# Clone the repository
git clone https://github.com/0ri/llm-council.git
cd llm-council

# Install with uv (recommended)
uv sync

# Or install with pip in editable mode
pip install -e .

Environment Variables

Create a .env file (or export directly) with the API keys for the providers you plan to use:

# Required for OpenRouter models (easiest way to get started — one key covers hundreds of models)
export OPENROUTER_API_KEY=your-openrouter-api-key

# Required for Poe models (GPT, Gemini, Grok, community bots)
export POE_API_KEY=your-poe-api-key

# Required for Bedrock models (uses standard AWS auth — configure via `aws configure` or env vars)
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_DEFAULT_REGION=us-east-1

You only need keys for the providers in your council config. The default config uses OpenRouter only, so OPENROUTER_API_KEY is enough to get started.

Path A: Skill Usage (Claude Code)

If you use Claude Code, the fastest path is the /council slash command:

# 1. Install the project
uv sync

# 2. Set your API key
export OPENROUTER_API_KEY=your-key

# 3. Ask the council
/council "What's the best approach for building a REST API?"

The skill files live in .claude/commands/council.md and .claude/skills/council/. See the Skill Usage section for full setup details.

Path B: Direct CLI Usage

# 1. Install the project
uv sync

# 2. Set your API key
export OPENROUTER_API_KEY=your-key

# 3. Run a full 3-stage council deliberation
llm-council "What are the tradeoffs between REST and GraphQL?"

# 4. (Optional) Preview what would run without making API calls
llm-council --dry-run "test question"

# 5. (Optional) Stream the chairman's synthesis in real time
llm-council --stream "Explain the CAP theorem"

The default config uses four OpenRouter models (Claude Opus 4.6, GPT-5.3-Codex, Gemini-3.1-Pro, Grok 4) with Gemini-3.1-Pro as chairman. To customize, create a .claude/council-config.json — see Configuration Reference.

Skill Usage (Claude Code & OpenClaw)

The council is designed as a skill first — most users interact with it through a /council slash command rather than the CLI directly. Two skill interfaces are supported: Claude Code and OpenClaw.

Claude Code Setup

The Claude Code skill lives in two locations within this repository:

  • Slash command: .claude/commands/council.md — defines the /council command, argument hints, and execution instructions
  • Skill package: .claude/skills/council/ — contains the plugin manifest, a mirrored command file, and the runner script

To install the skill into your own workspace:

# Option 1: Symlink (recommended — stays in sync with upstream)
ln -s /path/to/llm-council/.claude/commands/council.md  your-project/.claude/commands/council.md
ln -s /path/to/llm-council/.claude/skills/council/       your-project/.claude/skills/council

# Option 2: Copy
cp .claude/commands/council.md  your-project/.claude/commands/council.md
cp -r .claude/skills/council/   your-project/.claude/skills/council/

Set the required environment variables (only the providers you use):

export OPENROUTER_API_KEY=your-openrouter-key   # easiest — one key covers hundreds of models
export POE_API_KEY=your-poe-key                  # for Poe models (GPT, Gemini, Grok)
# Bedrock uses standard AWS auth (aws configure or AWS_* env vars)

Then invoke from any Claude Code conversation:

/council "What are the tradeoffs between microservices and monoliths?"

OpenClaw Setup

The OpenClaw-compatible skill package lives in skills/council/ with a SKILL.md manifest.

Install via ClawHub or manually:

# Option 1: ClawHub
clawhub install council

# Option 2: Manual copy
cp -r skills/council/ your-project/skills/council/

The SKILL.md front-matter declares everything OpenClaw needs:

Field Purpose
name Skill identifier (council)
description Human-readable summary
user-invocable true — appears in the user's skill list
metadata.openclaw.requires.bins Runtime dependencies (uv, python3)
metadata.openclaw.requires.env Required env vars (POE_API_KEY)
metadata.openclaw.primaryEnv The key OpenClaw prompts for during install
metadata.openclaw.install Auto-install steps (e.g., brew install uv)

Path references in SKILL.md use {baseDir}, which OpenClaw resolves to the skill's install directory at runtime. For example:

uv run {baseDir}/scripts/council.py --config {baseDir}/config/council-config.json "your question"

API keys are injected via OpenClaw's skill configuration — set POE_API_KEY (and optionally OPENROUTER_API_KEY or AWS credentials) in your OpenClaw skill config, and they'll be available to the council script automatically.

Interactive Configuration (--config)

Both skill interfaces support an interactive configuration mode:

/council --config

This walks you through:

  1. Selecting which models to include in the council (multi-select across Bedrock, Poe, and OpenRouter)
  2. Choosing a chairman model from the selected council members
  3. Setting enhanced parameters per model (e.g., budget_tokens for Bedrock, web_search/reasoning_effort for Poe)

The configuration is saved to the appropriate config file:

  • Claude Code: .claude/council-config.json
  • OpenClaw: {baseDir}/config/council-config.json (i.e., skills/council/config/council-config.json)

Shared Script Architecture

Both skill interfaces use the same underlying runner script — a thin wrapper that delegates to the installed llm_council package:

Path Used by
.claude/skills/council/scripts/council.py Claude Code
skills/council/scripts/council.py OpenClaw

These are identical files. The script uses PEP 723 inline metadata so uv run can resolve the llm-council dependency automatically:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10"
# dependencies = ["llm-council"]
# ///

Config files can be symlinked between the two skill directories if you want a single source of truth. The actual council logic lives in src/llm_council/ — the skill scripts just import and call llm_council.cli.main().

Examples

Basic question via /council:

> /council "Explain the CAP theorem and which tradeoff is best for a chat application"

## LLM Council Response

### Model Rankings (by peer review)

| Rank | Model          | Avg Position | 95% CI       | Borda Score |
|------|----------------|--------------|--------------|-------------|
| 1    | Claude Opus 4.6 | 1.33        | [1.0, 1.67]  | 2.67        |
| 2    | GPT-5.3-Codex  | 2.0          | [1.33, 2.67] | 2.0         |
| 3    | Gemini-3.1-Pro | 2.67         | [2.0, 3.33]  | 1.33        |
| 4    | Grok 4         | 3.67         | [3.33, 4.0]  | 0.33        |

*Rankings based on 3/3 valid ballots (anonymous peer evaluation)*

---

### Synthesized Answer

**Chairman:** Gemini-3.1-Pro

The CAP theorem states that a distributed system can guarantee at most two of
three properties: Consistency, Availability, and Partition tolerance...

Interactive configuration session via /council --config:

> /council --config

Which models should be in the council?
  [x] Claude Opus 4.6 (Bedrock)
  [x] GPT-5.3-Codex (Poe)
  [x] Gemini-3.1-Pro (Poe)
  [ ] Grok-4 (Poe)

Which model should be chairman?
  > Gemini-3.1-Pro

Configure enhanced parameters?
  Claude Opus 4.6 → budget_tokens: 10000
  GPT-5.3-Codex   → web_search: true, reasoning_effort: high
  Gemini-3.1-Pro   → web_search: true, reasoning_effort: high

✓ Configuration saved to .claude/council-config.json

For the full command reference and execution details, see the actual skill files:

Architecture

Pipeline Overview

graph LR
    Q["User Question"]

    subgraph Stage1["Stage 1: Independent Responses"]
        B1["Bedrock"]
        P1["Poe"]
        O1["OpenRouter"]
    end

    subgraph Stage2["Stage 2: Anonymized Peer Ranking"]
        direction TB
        AN["Anonymizer<br/>(random labels, nonce XML fencing,<br/>per-ranker shuffled order)"]
        B2["Bedrock ranks"]
        P2["Poe ranks"]
        O2["OpenRouter ranks"]
        AN --> B2
        AN --> P2
        AN --> O2
    end

    subgraph Stage3["Stage 3: Chairman Synthesis"]
        AGG["Aggregate Rankings<br/>(Borda count + bootstrap CI)"]
        CH["Chairman Model"]
        AGG --> CH
    end

    Q --> B1
    Q --> P1
    Q --> O1

    B1 --> AN
    P1 --> AN
    O1 --> AN

    B2 --> AGG
    P2 --> AGG
    O2 --> AGG

    CH --> R["Final Answer"]

Stage Descriptions

Stage 1 — Independent Responses. Every council model receives the user's question and answers independently, in parallel. Responses are cached in a local SQLite database so repeated questions skip the API call. A soft timeout and minimum-response threshold let the pipeline proceed if slow models haven't finished yet, and circuit breakers prevent retrying providers that are consistently failing.

Stage 2 — Anonymized Peer Ranking. Each council model ranks the other models' responses without knowing who wrote what. Self-exclusion ensures no model ranks its own answer. Responses are presented under randomized anonymous labels (Response A, B, C…) with a per-ranker shuffled order so that position bias is mitigated. Each response is wrapped in nonce-fenced XML tags (<response-{random_hex}>) to prevent prompt injection and fence-breaking. A system message instructs rankers to ignore any manipulation attempts inside the fenced content. Invalid ballots (unparseable rankings) are retried up to a configurable number of times.

Stage 3 — Chairman Synthesis. A designated chairman model receives the original responses (anonymized), the aggregate peer rankings (Borda count with bootstrap 95% confidence intervals), and writes the final consolidated answer. The chairman sees the same nonce-fenced, label-anonymized view — it knows which responses were ranked highest but not which model produced them. When --stream is enabled, the chairman's output is streamed to stdout in real time.

Anonymization Mechanism

The ranking stage uses three layers of anonymization to ensure fair evaluation:

  1. Random labels — Responses are labeled Response A, B, C… rather than by model name, so rankers cannot identify peers.
  2. Nonce-based XML fencing — Each response is wrapped in <response-{nonce}>…</response-{nonce}> tags where the nonce is a random hex string (secrets.token_hex(8)), regenerated per prompt. This prevents models from guessing the delimiter and breaking out of their fenced block. Any closing tags matching the nonce pattern found in model output are stripped before embedding.
  3. Per-ranker shuffled mappings — The order of responses is independently shuffled for each ranker using random.shuffle, so Response A for one ranker may correspond to a completely different model than Response A for another. The label-to-model mapping is stored per ranker and used during aggregation to correctly attribute rankings.

Package Structure

src/llm_council/
├── __init__.py              # Package exports: run_council, CouncilConfig, CouncilContext
├── aggregation.py           # Borda count, bootstrap confidence intervals, ranking aggregation
├── budget.py                # Token and cost budget guards with reserve/commit/release
├── cache.py                 # SQLite response cache for Stage 1 (TTL, stats, clearing)
├── cli.py                   # CLI entry point, argparse flags, config loading
├── context.py               # Per-run dependency-injection container (CouncilContext)
├── cost.py                  # Token counting (tiktoken) and per-stage cost estimation
├── council.py               # Main orchestrator: validate_config, run_council
├── flattener.py             # Codebase flattener: directory → single markdown document
├── formatting.py            # Markdown output formatting for all stage combinations
├── manifest.py              # Run manifest: metadata, timestamps, config hash
├── models.py                # Pydantic config models (CouncilConfig, provider configs, result types)
├── parsing.py               # Ranking parser: JSON/text extraction from model output
├── persistence.py           # JSONL run logger for session persistence
├── progress.py              # Real-time progress display (Rich TTY / plain non-TTY)
├── prompts.py               # Prompt templates for ranking and synthesis stages
├── security.py              # Input sanitization, injection detection, nonce fencing, output redaction
├── stages.py                # 3-stage pipeline logic: collect, rank, synthesize
└── providers/
    ├── __init__.py           # Provider/StreamingProvider protocols, timeout constants, registry
    ├── bedrock.py            # AWS Bedrock provider (Converse API, extended thinking, streaming)
    ├── openrouter.py         # OpenRouter provider (OpenAI-compatible API, SSE streaming)
    └── poe.py                # Poe provider (fastapi_poe, web search, reasoning effort)

CLI Reference

LLM Council provides two CLI entry points:

  • llm-council — the main deliberation pipeline
  • flatten-project — standalone codebase flattener

llm-council

llm-council [OPTIONS] [QUESTION]

Multi-model LLM deliberation with anonymized peer review. The question can be passed as a positional argument, read from a file with --question-file, or piped via --flatten.

Flags

Flag Description Default Example
question Positional argument: the question to ask the council (required unless --question-file, --list-models, --clear-cache, or --cache-stats is used) llm-council "What is the best sorting algorithm?"
--config PATH Path to a council-config.json file Auto-discovered (see Configuration Reference) llm-council --config ./my-config.json "question"
-v, --verbose Enable verbose (DEBUG-level) logging to stderr False llm-council -v "question"
--manifest Print the run manifest as JSON to stderr after completion False llm-council --manifest "question"
--log-dir DIR Write JSONL run logs to the specified directory (disabled) llm-council --log-dir ./logs "question"
--stage {1,2,3} Maximum pipeline stage to run (1 = responses only, 2 = responses + rankings, 3 = full run) 3 llm-council --stage 2 "question"
--dry-run Preview the configuration, model list, estimated API calls, and budget limits without making any API calls False llm-council --dry-run "question"
--list-models List available models from all configured providers (Bedrock, Poe, OpenRouter) and exit False llm-council --list-models
--flatten PATH Flatten a directory into a single markdown document and prepend it to the question as <project>…</project> context (disabled) llm-council --flatten ./src "Explain the architecture"
--codemap When used with --flatten, extract only structural skeletons (function/class signatures) instead of full file contents False llm-council --flatten ./src --codemap "Summarize the API"
--question-file FILE Read the question text from a file instead of the positional argument (disabled) llm-council --question-file prompt.txt
--seed INT Seed for reproducible bootstrap confidence intervals in ranking aggregation (random) llm-council --seed 42 "question"
--no-cache Disable the local SQLite response cache entirely for this run False llm-council --no-cache "question"
--cache-ttl SECONDS Override the cache TTL (time-to-live) in seconds. 0 bypasses cache reads but still writes new entries Config cache_ttl or 86400 (24 h) llm-council --cache-ttl 3600 "question"
--clear-cache Delete all entries from the response cache and exit False llm-council --clear-cache
--cache-stats Print cache statistics (total entries, expired entries, database size) and exit False llm-council --cache-stats
--stream Stream the Stage 3 chairman synthesis to stdout in real time instead of printing the full result at the end False llm-council --stream "question"

Usage Examples

Full 3-stage run — ask a question and get the complete deliberation (responses → rankings → synthesis):

llm-council "What are the trade-offs between microservices and monoliths?"

Stage-limited run — collect only Stage 1 responses (no ranking or synthesis):

llm-council --stage 1 "Explain the CAP theorem"

Run through Stage 2 (responses + rankings, no chairman synthesis):

llm-council --stage 2 "Compare REST vs GraphQL"

Dry run — preview the configuration and estimated API calls without spending tokens:

llm-council --dry-run "What is the best database for time-series data?"

Flatten a codebase + query — prepend a flattened project directory as context:

llm-council --flatten ./my-project "Review this codebase for security issues"

Use --codemap to send only signatures instead of full file contents (saves tokens):

llm-council --flatten ./my-project --codemap "Summarize the public API"

Streaming output — stream the chairman's synthesis to the terminal as it's generated:

llm-council --stream "Write a Python async HTTP client with retry logic"

Combined flags — verbose logging, custom config, session persistence, and streaming:

llm-council -v \
  --config ./council-config.json \
  --log-dir ./session-logs \
  --manifest \
  --stream \
  "Design a rate limiter for a REST API"

flatten-project

flatten-project [OPTIONS] PATH [PATH ...]

Standalone CLI for flattening one or more directories into a single markdown document suitable for LLM context windows. Output is written to stdout; metadata (file count, character count, estimated tokens) is printed to stderr.

Flags

Flag Description Default Example
PATH One or more directory paths to flatten (positional, required) (required) flatten-project ./src
--no-gitignore Ignore .gitignore rules and include all files False (.gitignore is respected) flatten-project --no-gitignore ./src
--codemap Extract only structural skeletons (function/class signatures, imports) instead of full file contents. Uses AST parsing for Python and heuristic pattern matching for other languages False flatten-project --codemap ./src
--max-file-size BYTES Skip files larger than this size in bytes 100000 (100 KB) flatten-project --max-file-size 500000 ./src

Usage Examples

Flatten a project and pipe to a file:

flatten-project ./my-project > context.md

Generate a codemap (signatures only) for a large codebase:

flatten-project --codemap ./src > codemap.md

Flatten multiple directories, including all files regardless of .gitignore:

flatten-project --no-gitignore ./src ./tests > full-dump.md

Configuration Reference

LLM Council is configured via a council-config.json file. The config controls which models participate in the council, who serves as chairman, budget limits, caching behavior, and resilience settings.

Config File Search Order

When no --config flag is provided, the CLI searches for a config file in this order:

  1. <CWD>/.claude/council-config.json — project-local config
  2. ~/.claude/council-config.json — user-global config
  3. Built-in default — an all-OpenRouter config ships with the package

The first file found wins. Use --config path/to/config.json to override the search entirely.

Top-Level Schema

Field Type Default Description
council_models list[ModelConfig] (required, min 1) List of models that participate in Stage 1 (responses) and Stage 2 (rankings). Each entry is a provider-specific model config object.
chairman ModelConfig (required) The model that performs Stage 3 synthesis. Can be one of the council models or a separate model.
budget object {} (no limits) Optional budget controls. See Budget Fields below.
cache_ttl int 86400 (24 hours) Time-to-live in seconds for cached Stage 1 responses. Overridden by --cache-ttl CLI flag.
soft_timeout float 300 (5 minutes) Seconds to wait for parallel Stage 1 queries before proceeding with partial results (if min_responses is satisfied).
min_responses int Number of council_models Minimum number of Stage 1 responses required before the soft timeout can trigger early completion. Defaults to all models.
stage2_retries int 1 Maximum retry rounds for invalid Stage 2 ballots. Set to 0 to disable retries.

Provider-Specific Model Fields

Every model config object requires name (a display label) and provider (one of bedrock, poe, openrouter). The remaining fields depend on the provider.

Bedrock

Field Type Required Default Description
name str yes Display name for this model
provider "bedrock" yes Must be "bedrock"
model_id str yes AWS Bedrock model identifier (e.g. "anthropic.claude-3-5-sonnet-20241022-v2:0")
budget_tokens int | null no null Max tokens for Bedrock's budget mode. Must be between 1024 and 128000 if set.

Poe

Field Type Required Default Description
name str yes Display name for this model
provider "poe" yes Must be "poe"
bot_name str yes Poe bot identifier (e.g. "Claude-3.5-Sonnet", "GPT-4o")
web_search bool no false Enable Poe's web search augmentation
reasoning_effort str | null no null Reasoning effort level. One of: "minimal", "low", "medium", "high", "Xhigh"

OpenRouter

Field Type Required Default Description
name str yes Display name for this model
provider "openrouter" yes Must be "openrouter"
model_id str yes OpenRouter model identifier (e.g. "anthropic/claude-sonnet-4", "openai/gpt-4o")
temperature float | null no null Sampling temperature. Provider default if unset.
max_tokens int | null no null Maximum output tokens. Provider default if unset.

Budget Fields

The optional budget object controls cost and token limits. If omitted or empty, no limits are enforced.

Field Type Default Description
max_tokens int (no limit) Maximum total tokens (input + output) across all stages. Must be a positive integer.
max_cost_usd float (no limit) Maximum total estimated cost in USD across all stages. Must be a positive number.
input_cost_per_1k float 0.01 Cost per 1,000 input tokens (used for budget estimation).
output_cost_per_1k float 0.03 Cost per 1,000 output tokens (used for budget estimation).

The budget system uses a reserve/commit/release mechanism: tokens are reserved before each query, committed on success (adjusted to actual usage), and released on failure.

Example Configurations

All-OpenRouter

A simple setup using only OpenRouter — requires a single OPENROUTER_API_KEY:

{
  "council_models": [
    {
      "name": "Claude Sonnet 4",
      "provider": "openrouter",
      "model_id": "anthropic/claude-sonnet-4",
      "temperature": 0.7
    },
    {
      "name": "GPT-4o",
      "provider": "openrouter",
      "model_id": "openai/gpt-4o",
      "temperature": 0.7,
      "max_tokens": 4096
    },
    {
      "name": "Gemini 2.5 Pro",
      "provider": "openrouter",
      "model_id": "google/gemini-2.5-pro-preview",
      "max_tokens": 8192
    }
  ],
  "chairman": {
    "name": "Claude Sonnet 4",
    "provider": "openrouter",
    "model_id": "anthropic/claude-sonnet-4"
  }
}

Mixed Bedrock + Poe

Combines AWS Bedrock and Poe providers — requires AWS credentials and POE_API_KEY:

{
  "council_models": [
    {
      "name": "Claude via Bedrock",
      "provider": "bedrock",
      "model_id": "anthropic.claude-3-5-sonnet-20241022-v2:0",
      "budget_tokens": 4096
    },
    {
      "name": "GPT-4o via Poe",
      "provider": "poe",
      "bot_name": "GPT-4o",
      "web_search": true
    },
    {
      "name": "Claude via Poe",
      "provider": "poe",
      "bot_name": "Claude-3.5-Sonnet",
      "reasoning_effort": "high"
    }
  ],
  "chairman": {
    "name": "Claude via Bedrock",
    "provider": "bedrock",
    "model_id": "anthropic.claude-3-5-sonnet-20241022-v2:0"
  },
  "cache_ttl": 3600,
  "soft_timeout": 120
}

Config with Budget Limits

An OpenRouter setup with strict budget controls and resilience tuning:

{
  "council_models": [
    {
      "name": "Claude Sonnet 4",
      "provider": "openrouter",
      "model_id": "anthropic/claude-sonnet-4",
      "max_tokens": 2048
    },
    {
      "name": "GPT-4o",
      "provider": "openrouter",
      "model_id": "openai/gpt-4o",
      "max_tokens": 2048
    },
    {
      "name": "Gemini 2.5 Flash",
      "provider": "openrouter",
      "model_id": "google/gemini-2.5-flash-preview",
      "max_tokens": 2048
    }
  ],
  "chairman": {
    "name": "Claude Sonnet 4",
    "provider": "openrouter",
    "model_id": "anthropic/claude-sonnet-4"
  },
  "budget": {
    "max_tokens": 50000,
    "max_cost_usd": 0.50,
    "input_cost_per_1k": 0.003,
    "output_cost_per_1k": 0.015
  },
  "cache_ttl": 43200,
  "soft_timeout": 60,
  "min_responses": 2,
  "stage2_retries": 2
}

Features

Caching

LLM Council caches Stage 1 responses in a local SQLite database to avoid redundant API calls. The cache is stored at ~/.llm-council/cache.db by default.

Each cache entry is keyed by a SHA-256 hash of the question, model name, and model ID. Entries expire after a configurable TTL (default: 24 hours / 86400 seconds). Expired entries are cleaned up automatically on startup.

Configuration:

Set the TTL in your council-config.json:

{
  "cache_ttl": 3600
}

CLI flags:

Flag Description
--no-cache Bypass the cache entirely for this run
--cache-ttl SECONDS Override the configured TTL for this run
--clear-cache Delete all cached responses and exit
--cache-stats Print cache statistics (total entries, expired entries) and exit

Example — check cache stats then run without cache:

# See how many responses are cached
llm-council --cache-stats

# Run a query bypassing the cache
llm-council --no-cache "What is the best sorting algorithm?"

# Clear all cached responses
llm-council --clear-cache

Streaming

The --stream flag enables real-time streaming of the Stage 3 chairman synthesis to stdout. Instead of waiting for the full synthesis to complete, text chunks are printed as they arrive from the provider.

llm-council --stream "Explain the CAP theorem"

Streaming uses the StreamingProvider protocol when available. Providers that implement astream() deliver chunks natively; non-streaming providers fall back to a single-chunk wrapper that buffers the full response and yields it at once.

If streaming encounters an error, it falls back to the standard query_model path automatically.

Stages 1 and 2 always run in non-streaming mode — only the final synthesis is streamed.

Budget Controls

Budget controls prevent runaway costs by enforcing token and dollar limits across a council run. Configure them in the budget section of your config:

{
  "budget": {
    "max_tokens": 50000,
    "max_cost_usd": 0.50,
    "input_cost_per_1k": 0.003,
    "output_cost_per_1k": 0.015
  }
}
Field Description Default
max_tokens Maximum total tokens (input + output) across all stages No limit
max_cost_usd Maximum estimated cost in USD No limit
input_cost_per_1k Cost per 1,000 input tokens for estimation 0.01
output_cost_per_1k Cost per 1,000 output tokens for estimation 0.03

Reserve / Commit / Release mechanism:

Budget enforcement uses an atomic reservation pattern to handle concurrent model queries safely:

  1. Reserve — Before each query, estimated tokens are deducted from the budget. If the projected total would exceed limits, a BudgetExceededError is raised and the query is skipped.
  2. Commit — After a successful query, the reservation is adjusted to reflect actual token usage reported by the provider.
  3. Release — If a query fails, the reservation is returned to the budget so other models can use it.

This ensures concurrent queries don't collectively overshoot the budget. When a model is skipped due to budget limits, the council continues with the remaining models.

Cost Tracking

Every council run tracks token usage per model and per stage. After the run completes, a summary is printed:

--- Token Usage ---
  Stage 1: ~3,200 in + ~4,800 out = ~8,000 tokens
    Claude Sonnet 4: ~1,100 in, ~1,600 out
    GPT-4o: ~1,050 in, ~1,500 out
    Gemini 2.5 Flash: ~1,050 in, ~1,700 out
  Stage 2: ~6,400 in + ~1,200 out = ~7,600 tokens
    Claude Sonnet 4: ~2,100 in, ~400 out
    GPT-4o: ~2,100 in, ~400 out
    Gemini 2.5 Flash: ~2,200 in, ~400 out
  Stage 3: ~2,800 in + ~2,000 out = ~4,800 tokens
    Claude Sonnet 4: ~2,800 in, ~2,000 out
  Total: ~12,400 in + ~8,000 out = ~20,400 tokens
  (~ indicates estimated tokens, actual counts used where available)
---

Token counts prefixed with ~ are estimates based on character count (using tiktoken's cl100k_base encoding when available, or a 4-chars-per-token heuristic). When a provider returns actual token counts in its API response, those are used instead and displayed without the ~ prefix.

The cost tracker records both estimated and actual counts for each model interaction, so you can compare projected vs real usage.

Session Persistence

Use --log-dir to save a complete record of a council run as a JSONL file:

llm-council --log-dir ./logs "What is the best programming language?"

Each run produces a file named {run_id}.jsonl in the specified directory. The file contains one JSON object per line, with the following record types:

Record Type Fields
config question, config (full council config)
stage1_response model, response, token_usage
stage2_ranking model, ranking_text, parsed_ranking, is_valid_ballot, label_mapping, token_usage
stage3_synthesis model, response, token_usage
aggregation rankings (model, average_rank, borda_score, rankings_count), valid_ballots, total_ballots
summary cost_summary, elapsed_seconds

Every record includes run_id and timestamp fields. Sensitive data (API keys, tokens) is automatically redacted before writing.

Combine with --manifest to append a run manifest comment block to the output, recording run metadata (run ID, timestamp, models used, chairman, stage counts, elapsed time, token estimates, config hash).

Flattener

The flattener serializes a project directory into a single markdown document suitable for LLM context windows. It's available as both a CLI flag and a standalone command.

As a CLI flag:

# Flatten the current directory and ask a question about it
llm-council --flatten . "How is error handling done in this project?"

# Use codemap mode for a structural overview (signatures only)
llm-council --flatten . --codemap "What are the main classes?"

As a standalone command:

# Flatten a directory to stdout
flatten-project ./src

# Codemap mode — extract only signatures, imports, and class/function definitions
flatten-project --codemap ./src

# Skip gitignore rules
flatten-project --no-gitignore ./src

# Set max file size (default: 100KB)
flatten-project --max-file-size 200000 ./src

Features:

  • Binary detection — Files with known binary extensions (images, archives, executables, fonts, databases, etc.) are automatically skipped. MIME type detection is used as a fallback for unknown extensions.
  • Gitignore support.gitignore patterns are respected by default (requires the pathspec package). Use --no-gitignore to include all files.
  • Directory filtering — Common non-source directories (.git, __pycache__, node_modules, .venv, dist, build, etc.) are always skipped.
  • File filtering — Lock files, minified assets, .env files, credentials, and council output files are skipped.
  • Python skeleton extraction — In --codemap mode, Python files are parsed via AST to extract imports, class definitions, function signatures, and docstrings. Non-Python files use heuristic pattern matching for structural extraction.
  • Token estimation — The output includes an estimated token count (using tiktoken when available, or a character-based heuristic). This is printed to stderr:
    # 42 files, 128,350 chars, ~32,088 tokens
    

Security

LLM Council includes multiple layers of security hardening to protect against prompt injection and data leakage.

Input sanitization (sanitize_user_input):

  • Strips control characters (preserving newlines and tabs)
  • Truncates input exceeding the maximum length (default: 500,000 characters)
  • Detects and logs potential prompt injection patterns (e.g., "ignore previous instructions", role markers, model delimiters) without blocking — legitimate use cases are preserved

Prompt injection defense (nonce-based XML fencing):

  • Model responses are wrapped in randomized XML delimiters: <response-{nonce}>...</response-{nonce}> where the nonce is a cryptographically random hex string
  • Each ranking stage uses a fresh nonce, making it infeasible for a model to guess and break out of its fence
  • A manipulation resistance system message instructs rankers to ignore any instructions embedded in responses

Output sanitization (sanitize_model_output):

  • Strips any closing XML tags matching the nonce pattern from model output, preventing fence-breaking attempts
  • Generic response-tag patterns are also removed as a defense-in-depth measure
  • Detected attempts are replaced with [FENCE_BREAK_ATTEMPT_REMOVED]

Sensitive data redaction (redact_sensitive):

  • API keys (OpenAI, Poe, AWS, Google) are redacted in log output
  • Bearer tokens, authorization headers, JWTs, and long hex strings are replaced with [REDACTED_*] placeholders
  • Applied automatically to all JSONL persistence records

Circuit Breaker and Retry

Each model gets its own circuit breaker, keyed by provider and model identifier (e.g., openrouter:anthropic/claude-sonnet-4 or poe:Claude-3.5-Sonnet).

Circuit breaker behavior:

Parameter Default Description
Failure threshold 3 Consecutive failures before the circuit opens
Cooldown 60 seconds Time before a half-open retry is allowed

The circuit breaker follows a standard closed → open → half-open pattern:

  • Closed (normal) — Requests pass through. Failures increment a counter.
  • Open (rejecting) — After 3 consecutive failures, the circuit opens. All requests to that model are immediately skipped with a warning.
  • Half-open (probing) — After the 60-second cooldown, one request is allowed through. Success closes the circuit; failure reopens it.

Retry and graceful degradation:

  • Individual model queries are wrapped with asyncio.wait_for using a configurable timeout (default: 360 seconds).
  • Timeouts and exceptions are caught per-model — a single model failure doesn't abort the run.
  • The min_responses config field (default: all models) sets the minimum number of successful Stage 1 responses needed to proceed. Combined with soft_timeout, the council can move forward with partial results if slow models haven't responded.
  • Budget reservations are released on failure, freeing capacity for remaining models.
  • Streaming queries fall back to the standard query path if the stream encounters an error.

Output Formats

LLM Council output varies depending on the --stage flag. A full 3-stage run produces rankings, ballot validity, and a chairman synthesis. Stage-limited runs produce subsets of this output. Full runs also include a run manifest comment block at the end.

Full 3-Stage Output (default)

A complete council run (--stage 3 or no --stage flag) produces a rankings table, ballot validity indicator, and chairman synthesis:

## LLM Council Response

### Model Rankings (by peer review)

| Rank | Model | Avg Position | 95% CI | Borda Score |
|------|-------|--------------|--------|-------------|
| 1 | Gemini-2.5-Pro | 1.3 | [1.0, 1.7] | 8 |
| 2 | Claude-Sonnet-4 | 1.8 | [1.2, 2.4] | 7 |
| 3 | GPT-4.1 | 2.5 | [2.0, 3.0] | 5 |
| 4 | Grok-3 | 3.4 | [2.8, 4.0] | 2 |

*Rankings based on 4/4 valid ballots (anonymous peer evaluation)*

---

### Synthesized Answer

**Chairman:** Gemini-2.5-Pro

The council reached broad agreement that [...chairman's synthesis of the top-ranked responses...]

The rankings table includes:

  • Rank — Position based on average rank across all ballots
  • Model — The model's display name from the config
  • Avg Position — Mean rank across all valid ballots (lower is better)
  • 95% CI — Confidence interval for the average position
  • Borda Score — Borda count score (higher is better), used as a tiebreaker

The ballot validity line shows how many ranking ballots were successfully parsed. When all ballots are valid, it reads "anonymous peer evaluation". When some fail to parse, it notes "some rankings could not be parsed reliably".

Stage 1 Only (--stage 1)

Running with --stage 1 collects individual model responses without ranking or synthesis:

## LLM Council Response (Stage 1 only)

### Gemini-2.5-Pro

[Gemini's full response to the question...]

### Claude-Sonnet-4

[Claude's full response to the question...]

### GPT-4.1

[GPT's full response to the question...]

### Grok-3

[Grok's full response to the question...]

Each model's response is printed under its own heading. No rankings or synthesis are performed.

Stage 1+2 Output (--stage 2)

Running with --stage 2 collects responses and performs anonymous peer ranking, but skips the chairman synthesis:

## LLM Council Response (Stages 1-2, no synthesis)

### Model Rankings (by peer review)

| Rank | Model | Avg Position | 95% CI | Borda Score |
|------|-------|--------------|--------|-------------|
| 1 | Gemini-2.5-Pro | 1.3 | [1.0, 1.7] | 8 |
| 2 | Claude-Sonnet-4 | 1.8 | [1.2, 2.4] | 7 |
| 3 | GPT-4.1 | 2.5 | [2.0, 3.0] | 5 |
| 4 | Grok-3 | 3.4 | [2.8, 4.0] | 2 |

*Rankings based on 4/4 valid ballots (anonymous peer evaluation)*

---

### Individual Responses

#### Gemini-2.5-Pro

[Gemini's full response...]

#### Claude-Sonnet-4

[Claude's full response...]

#### GPT-4.1

[GPT's full response...]

#### Grok-3

[Grok's full response...]

The rankings table appears first, followed by each model's individual response. This is useful for seeing how models were ranked without waiting for the chairman synthesis.

Run Manifest Comment Block

Full 3-stage runs append an HTML comment block at the end of the output containing execution metadata. This block is invisible when rendered as markdown but can be parsed programmatically:

<!-- Run Manifest
Run ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
Timestamp: 2025-01-15T14:30:00.123456+00:00
Models: Gemini-2.5-Pro, Claude-Sonnet-4, GPT-4.1, Grok-3
Chairman: Gemini-2.5-Pro
Stage 1 Results: 4/4
Stage 2 Ballots: 4/4 valid
Total Time: 12.3s
Est. Tokens: ~8,500
Config Hash: a1b2c3d4e5f67890...
-->

Manifest fields:

Field Description
Run ID UUID v4 uniquely identifying this run
Timestamp ISO 8601 UTC timestamp of when the run started
Models Comma-separated list of all council model names
Chairman The model designated to write the synthesis
Stage 1 Results Successful responses out of total models queried
Stage 2 Ballots Valid ranking ballots out of total ballots received
Total Time Wall-clock elapsed time for the entire run
Est. Tokens Estimated total token usage across all stages
Config Hash Truncated SHA-256 hash of the config JSON (for reproducibility tracking)

The manifest is also saved as JSON when using --manifest or --log-dir for programmatic access to run metadata.

Available Models

LLM Council supports three providers, each with different model discovery mechanisms.

Bedrock

Bedrock models are discovered dynamically via the AWS ListFoundationModels API. Any model available in your configured AWS region can be used. Common Anthropic models on Bedrock:

Model ID Description
us.anthropic.claude-sonnet-4-20250514-v1:0 Claude Sonnet 4 — fast, capable
us.anthropic.claude-opus-4-20250514-v1:0 Claude Opus 4 — highest capability
anthropic.claude-3-5-sonnet-20241022-v2:0 Claude 3.5 Sonnet v2
anthropic.claude-3-opus-20240229-v1:0 Claude 3 Opus

Run llm-council --list-models with valid AWS credentials to see all models available in your region.

Poe

Poe has no model discovery API. Bots are referenced by name in the config's bot_name field. Common bots:

Bot Name Model
GPT-5.3-Codex OpenAI GPT-5.3 Codex
GPT-5.2 OpenAI GPT-5.2
GPT-4o OpenAI GPT-4o
Gemini-3.1-Pro Google Gemini 3.1 Pro
Gemini-3-Flash Google Gemini 3 Flash
Grok-4 xAI Grok 4
Grok-3 xAI Grok 3
Claude-3.5-Sonnet Anthropic Claude 3.5 Sonnet
Claude-3-Opus Anthropic Claude 3 Opus
Llama-3.3-70B Meta Llama 3.3 70B
Mixtral-8x7B Mistral Mixtral 8x7B

Poe bot names are case-sensitive. New bots appear on Poe regularly — check poe.com for the latest list.

OpenRouter

OpenRouter provides a model discovery API with hundreds of models. Run llm-council --list-models with a valid OPENROUTER_API_KEY to see the full list. Models are referenced by their OpenRouter ID in the config's model_id field. Examples:

Model ID Description
anthropic/claude-opus-4.6 Claude Opus 4.6
anthropic/claude-sonnet-4 Claude Sonnet 4
openai/gpt-5.3-codex GPT-5.3 Codex
google/gemini-3.1-pro-preview Gemini 3.1 Pro
x-ai/grok-4 Grok 4
meta-llama/llama-4-maverick Llama 4 Maverick

OpenRouter supports temperature, max_tokens, and other OpenAI-compatible parameters. See the OpenRouter docs for the complete model catalog.

Background

LLM Council was born from a simple observation: no single LLM is consistently the best at everything. Different models have different strengths — one might excel at reasoning, another at creative writing, a third at code generation. Rather than picking a single model and hoping for the best, what if you could consult multiple models and let them evaluate each other's work?

That's the core idea. LLM Council runs your question through multiple models in parallel, then has each model anonymously rank the others' responses (without knowing who wrote what), and finally asks the top-ranked model to synthesize a final answer. The anonymized peer review step is key — it reduces bias and produces rankings that correlate well with human preferences.

The project started as a CLI tool but quickly evolved into a skill-first architecture. Most users interact with it through Claude Code's /council slash command or as an OpenClaw skill, where the council runs transparently behind a single command. The CLI remains available for scripting, automation, and direct terminal use.

Contributing

Contributions are welcome. See CONTRIBUTING.md for development setup, project structure, testing instructions, and guidelines for adding new providers.

License

MIT © 2025 Ori Neidich

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_council_skill-0.2.0.tar.gz (150.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_council_skill-0.2.0-py3-none-any.whl (77.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_council_skill-0.2.0.tar.gz.

File metadata

  • Download URL: llm_council_skill-0.2.0.tar.gz
  • Upload date:
  • Size: 150.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_council_skill-0.2.0.tar.gz
Algorithm Hash digest
SHA256 813670aa1e450c4dd08d67c971ea459c64ed7c04366f4ef1e14193d9bdb8fcbe
MD5 d00b3fb10188728dd9614dbcbaf6c22c
BLAKE2b-256 88ebcaec3ae30f23b34f7bc1bb542f16c1c53034e6f6eb2c392de157d39c2683

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_council_skill-0.2.0.tar.gz:

Publisher: publish.yml on 0ri/llm-council

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_council_skill-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_council_skill-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 938916b9f1e8264be2709d970a02d6683f219d49b67644629ea36bab4999894e
MD5 2fd57719af01e2a7df6a2f29f0a7e3f5
BLAKE2b-256 27b78b4211a39c0b1f88a552b9465ec9ef4a65c96f574c3dae20ed9f7dff2bc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_council_skill-0.2.0-py3-none-any.whl:

Publisher: publish.yml on 0ri/llm-council

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page