Skip to main content

CLI, MCP server, and JSON schemas for validating and auditing strategic-risk AI agent output

Project description

Agenda Intelligence MD

CI / MCP / EVIDENCE-AUDIT LAYER FOR STRATEGIC INTELLIGENCE AGENTS — protocol, JSON schemas, CLI and MCP server for validating, scoring and auditing the structure of strategic-risk agent output. The evidence-discipline surface for markdown-first reasoning skills (Global Think Tank Analyst, Central Asia + Caspian, Gulf + Middle East). Open-source.

Evidence & eval layer for strategic intelligence agents.

PyPI version License: MIT

A protocol, JSON-schema set, CLI, and MCP-compatible toolkit that helps AI agents move from unsupported summaries to auditable strategic-risk briefs:

  • what changed
  • why it matters
  • what is evidence-backed
  • what is uncertain
  • who gains or loses leverage
  • what scenarios are plausible
  • what to watch next

It is built for engineers shipping policy, sanctions, regulation, geopolitical-risk, market-risk, and strategic-intelligence agents — where the output has to survive review by an analyst, not just sound plausible.

Bundled-example baseline (5 cases, reproduced with python3 evals/run_benchmark.py):

metric value
mean score 87.0 / 100
cases 5 (EU AI Act, EU CBAM, Red Sea shipping, sanctions routing, BIS AI Diffusion)
schema-valid 100%
with evidence pack 100%
with claim-level audit 100%
orphan evidence refs 0

What this is

  • Markdown protocol (Agenda-Intelligence.md) — a structured reasoning workflow agents can follow.
  • JSON schemas — validate brief structure, evidence packs, memory cards, lens manifests.
  • CLI checksvalidate-brief, validate-evidence, score, doctor for CI-style validation of agent output.
  • MCP server — a real stdio MCP server (agenda-intelligence-mcp) exposing the validation, read, and scoring tools.
  • Eval starter kit — rubric, LLM-judge prompt, human checklist, sample cases, benchmark seed.
  • Source / evidence policy — explicit rules for claim-level discipline, including per-claim provenance tags (Axis A: [primary] [secondary] [user-provided] [inference] [analyst-judgment]; Axis B: [verify] [stale-risk: YYYY-MM]). See skills/agenda-intelligence/references/evidence-discipline.md.
  • Signal lifecycle tracker — markdown + JSON schema for tracking signals across sessions (detected → developing → escalated → stable → resolved → archived). See skills/agenda-intelligence/references/signal-lifecycle.md and schemas/signal-tracker.schema.json.
  • Source normalization skill (skills/source-ingest/) — normalize documents (PDF, DOCX, URL) into structured source records for evidence packs.
  • Regional & sector lenses — compact reference packs inside the protocol (Central Asia & Caspian, Middle East, EU; sanctions, export controls). For deep regional analysis, use the dedicated vertical specialist skills: Central Asia + Caspian or Gulf + Middle East.

Where this sits in the production AI stack

Reasoning skills (markdown-first reasoning contracts for agents):

Evidence & audit layer (CI / MCP / schemas):

  • → Agenda Intelligence MD (this repo) — validate, score and audit strategic-risk agent output structure

The skills define how agents reason. Agenda Intelligence MD defines how the output is audited. Together they let agents produce auditable strategic-intelligence — not just plausible-sounding summaries.

What this is not

  • Not a factuality verifier. It does not check whether claims are true. It checks whether they are structurally sound, evidence-labeled, and decision-shaped.
  • Not an autonomous news agent. It does not crawl, retrieve, or rank sources by itself.
  • Not a source retriever. Live retrieval is not implemented.
  • Not a replacement for analyst judgment. Pass/fail signals tell you form, not substance.
  • Not a guarantee of correctness. It surfaces missing evidence and uncertainty hooks; it does not guarantee them.
  • Not a mature benchmark suite yet. The benchmark seed in evals/benchmark_set.json is a starting point, not validated results.

60-second quickstart

# From PyPI
pip install agenda-intelligence-md
# Or pinned wheel:
# pip install https://github.com/vassiliylakhonin/agenda-intelligence-md/releases/download/v0.7.4/agenda_intelligence_md-0.7.4-py3-none-any.whl

# 1. Get a source plan for a domain
agenda-intelligence start technology-ai

# 2. Validate an agent-produced brief against the schema
agenda-intelligence validate-brief examples/agenda-brief.json

# 3. Score the brief (heuristic 0-100 structural rubric)
agenda-intelligence score examples/agenda-brief.json

# 4. Score with evidence-linked feedback
agenda-intelligence score examples/agenda-brief.json --evidence examples/source/evidence-pack.json

# 5. Run the structural bench across all bundled examples
agenda-intelligence bench examples/source-backed --strict --min-score 80

# 6. Diagnose local install + MCP tool surface
agenda-intelligence doctor

# 7. Print local MCP client config
agenda-intelligence mcp-config --client cursor

Expected scoring output:

score: 90/100
note: Heuristic structural/evidence-discipline score; does not verify factual truthfulness.
evidence_support: ... claims supported: 1/1 supported ...

Flagship example: EU AI Act

A weak baseline summary vs. an Agenda-Intelligence-MD brief, plus the evidence pack used to back each claim.

The evidence URLs in flagship examples are illustrative placeholders. The point is the shape of evidence-backed reasoning, not live citations.

Run the full pipeline on this example:

agenda-intelligence validate-brief examples/source-backed/eu-ai-act.brief.json
agenda-intelligence validate-evidence examples/source-backed/eu-ai-act.evidence.json
agenda-intelligence audit-claims examples/source-backed/eu-ai-act.audit.json --strict
agenda-intelligence score examples/source-backed/eu-ai-act.brief.json --evidence examples/source-backed/eu-ai-act.evidence.json --min-score 80

Before / after (sketch)

Baseline LLM Agenda-Intelligence-MD
Output shape Free-text summary Schema-valid brief
Claims Implicit Explicit, classified
Evidence Mixed in / absent Separate evidence pack
Uncertainty Often missing Required field
Watch-next Often missing Required, ≥1 indicator
Schema validation N/A validate-brief pass/fail
Evidence audit N/A validate-evidence pass/fail
Heuristic score N/A score 0–100

CLI

agenda-intelligence start <category>            # source plan + brief template
agenda-intelligence validate-brief <brief.json>
agenda-intelligence validate-evidence <pack.json>
agenda-intelligence audit-claims <claims.json> [--format json] [--strict]
agenda-intelligence score <brief.json> [--evidence <pack.json>] [--format json] [--min-score N]
agenda-intelligence score <before-after.md>
agenda-intelligence bench <dir>                  # validate + audit + score across a case directory
agenda-intelligence verify-quotes <pack.json>
agenda-intelligence source-plan <category>
agenda-intelligence list-lenses [--type ...]
agenda-intelligence get-lens <type> <id>
agenda-intelligence get-protocol <name>
agenda-intelligence validate-manifest
agenda-intelligence memory-search <query>
agenda-intelligence mcp-config [--client cursor|codex|claude-desktop]
agenda-intelligence doctor [--json]
agenda-intelligence --version

MCP

MCP as distribution surface. MCP turns the validation, audit and scoring tools into agent-consumable functions, not just CLI commands. Any MCP-compatible host (Claude Desktop, Cursor, Codex, custom agents) can call them as tools inside the agent loop — no separate CI step, no copy-paste between systems. The markdown-first reasoning skills define how memos are reasoned; this layer is where their output gets validated and audited without leaving the agent.

The package ships a real stdio MCP server, agenda-intelligence-mcp, plus small Python tool functions in agenda_intelligence.mcp_server. See MCP.md and docs/integrations/mcp.md.

Implemented MCP tools (all verified by scripts/smoke_mcp.py):

  • validate_brief(brief_json) — schema check
  • validate_evidence(evidence_json) — schema check
  • audit_claims(audit_json) — claim-level evidence audit
  • get_protocol(name) — return packaged protocol markdown
  • list_lenses(lens_type=None) — read from manifest
  • get_lens(lens_type, lens_id) — return packaged lens markdown
  • source_plan(category) — return source requirements
  • score_output(before_text, after_text) — heuristic structure / decision-readiness score

MCP verification status: wire-protocol verified — scripts/smoke_mcp.py exercises the full JSON-RPC cycle (initialize → tools/list → tools/call) against the running stdio server. See MCP.md.

Live source retrieval is not implemented.

Example agent flow

  1. Agent receives a policy/risk update.
  2. Agent calls source_plan for the relevant category.
  3. Agent drafts a brief in the protocol shape.
  4. Agent calls validate_brief and validate_evidence.
  5. Agent calls score_output for a decision-readiness signal.
  6. Agent returns the brief, with explicit uncertainty and watch-next.

CI / checking concept

validate-brief and validate-evidence behave like linters: zero exit on success, non-zero on failure, errors on stderr. Drop them into any CI pipeline that produces strategic briefs from agents:

agenda-intelligence validate-brief examples/agenda-brief.json
agenda-intelligence validate-evidence examples/source/evidence-pack.json
agenda-intelligence score examples/agenda-brief.json --evidence examples/source/evidence-pack.json --min-score 70

Architecture

flowchart LR
  Agent[Strategic-intelligence agent] -->|drafts| Brief[Agenda brief JSON]
  Agent -->|cites| Evidence[Evidence pack JSON]
  Brief --> Check[validate-brief]
  Evidence --> Audit[validate-evidence]
  Brief --> Score[score]
  Evidence --> Score
  P[Agenda-Intelligence.md] -.guides.-> Agent
  L[regional/sector lenses] -.guides.-> Agent
  S[source requirements] -.guides.-> Agent

Schemas

Schema Purpose
agenda-brief.schema.json Brief structure
evidence-pack.schema.json Evidence pack structure
signal-classification.schema.json Signal taxonomy
memory-card.schema.json AnalysisBank cards
lens-manifest.schema.json Lens manifest
evidence-audit.schema.json Claim-level evidence audit
signal-tracker.schema.json Signal lifecycle tracker

Evidence audit

Each important claim should be traceable:

{
  "claim_id": "c1",
  "claim": "EU AI Act tightens obligations on high-risk systems.",
  "claim_type": "regulatory_change",
  "evidence_ids": ["e1", "e2"],
  "support_level": "direct",
  "uncertainty": "Enforcement timeline per sector unclear.",
  "risk_if_wrong": "Compliance plans miss deadline."
}

support_level is one of direct | partial | weak | unsupported. This schema is not wired into validate-evidence by default; use audit-claims directly.


Evals

See docs/evaluation.md for the full layer breakdown.

Key honesty rule:

Current scoring does not verify factual truth. It evaluates structure, completeness, evidence labeling, and decision-readiness signals.

Bundled-example baseline: mean 87.0/100, 5 cases, 100% schema-valid, 0 orphan refs. Reproduce with python evals/run_benchmark.py. Human-judge benchmarking is not done yet.


Status

Component Status
Markdown protocol Stable
JSON schemas (brief, evidence, lens, memory, signal) Stable
CLI: validate-*, score, start, source-plan, doctor, mcp-config Stable
Lenses (Central Asia, Middle East, EU; sanctions, export controls) Stable
MCP stdio server (agenda-intelligence-mcp) Stable
MCP tool functions (validate / read / score / audit_claims) Stable
Evidence-audit schema (claim-level) Stable
Signal-tracker schema (lifecycle) Stable
Live source retrieval Not implemented
Heuristic benchmark baseline (5 bundled cases) Produced — mean 87.0/100
Human-judge benchmark results Not produced yet
Factual-truth verification Not in scope today

Limitations

  • No factual verification. The toolkit checks form, not truth.
  • No live source retrieval. Evidence packs are user- or agent-supplied.
  • Scoring is heuristic. The rubric is documented; an LLM-judge prompt is provided; results are not benchmarked yet.
  • Lens coverage is intentionally narrow.

Contributing eval cases

The most valuable contribution is a case: a real public event with a baseline agent output, a target brief, and a human checklist. See CONTRIBUTING.md and evals/cases/.


Repository layout

agenda-intelligence-md/
├─ src/agenda_intelligence/   # Python package (CLI + MCP server + tools)
├─ schemas/                   # JSON schemas
├─ examples/                  # briefs, evidence packs, before/after
├─ analysis-bank/             # reusable reasoning patterns (memory cards)
├─ evals/                     # rubric, judge prompt, checklist, cases
├─ docs/                      # guides, integrations, use-cases
├─ skills/agenda-intelligence/# OpenClaw skill wrapper
├─ skills/source-ingest/      # Source normalization skill (PDF/DOCX/URL → structured source record)
└─ tests/                     # pytest suite

Documentation

Resource Link
Quickstart docs/quickstart.md
End-to-end tutorial docs/tutorial.md
Evaluation docs/evaluation.md
Evidence audit docs/evidence-audit.md
Agent integration sketch docs/integrations/agent-loop.md
Use-cases docs/use-cases/
Integrations docs/integrations/
Roadmap ROADMAP.md
Changelog CHANGELOG.md

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agenda_intelligence_md-0.7.4.tar.gz (130.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agenda_intelligence_md-0.7.4-py3-none-any.whl (72.3 kB view details)

Uploaded Python 3

File details

Details for the file agenda_intelligence_md-0.7.4.tar.gz.

File metadata

  • Download URL: agenda_intelligence_md-0.7.4.tar.gz
  • Upload date:
  • Size: 130.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agenda_intelligence_md-0.7.4.tar.gz
Algorithm Hash digest
SHA256 c796cd5f76d521adad6b588c9a7a42766f828b541d0f3c4778d45cacce07b07e
MD5 e7e641a408cf65fedcf71ffede807c78
BLAKE2b-256 6ff839c501238a95f32422172ec6590ed5052eb17a5300e815d49d35a7fe63ca

See more details on using hashes here.

File details

Details for the file agenda_intelligence_md-0.7.4-py3-none-any.whl.

File metadata

File hashes

Hashes for agenda_intelligence_md-0.7.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d4347858915f627fdd63877775ce57a0f2893660742c442c8aee9959a465c045
MD5 dc1499ab27a76b6cf0da36f125b43bb4
BLAKE2b-256 32a9587571974b9ef0e5d90526ccaca61d46f7cd1acab350c91d3421d3b95189

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page