Skip to main content

Score how similar N agent outputs are — exact match, Jaccard token overlap, divergence point, composite 0-1 score. Stdlib-only.

Project description

agent-convergence-scorer

Measure how similar N agent outputs are. Score exact-match rate, Jaccard token overlap, divergence point, and a composite 0–1 convergence score over any list of agent runs.

PyPI Python License: MIT CI Hermes Seal

If you run the same prompt through N agents and want a number for "are they producing N distinct outputs or have they collapsed to one idea?" — this is that number.

Pain

  • You just ran a fan-out of N agents and eyeballing whether they converged is slow and subjective.
  • Your eval harness reports accuracy but not reproducibility; same prompt, two runs, two answers, no metric.
  • Multi-agent hackathon or swarm setup; half the agents picked the same target. You want evidence, not vibes.
  • LLM temperature study where "temp=0.3 vs temp=0.7" needs a downstream consistency number.
  • You caught agents rephrasing each other but there is no column in your CSV for it.

Install

pip install agent-convergence-scorer

Python 3.9+. Zero runtime dependencies (stdlib only).

Quick start

echo '{"runs": ["The capital is Paris.", "The capital is Paris.", "The capital is Lyon."]}' \
  | agent-convergence-scorer -

Output:

{
  "num_runs": 3,
  "exact_match_rate": 0.667,
  "token_metrics": {
    "avg_overlap": 0.733,
    "jaccard": 1.0
  },
  "convergence_score": 0.703,
  "divergence_point": {
    "diverges_at_token": "paris.",
    "token_position": 3,
    "num_tokens_to_divergence": 3
  }
}

Interpret:

  • convergence_score = 0.703 — high but not perfect consistency.
  • exact_match_rate = 0.667 — 2 of 3 runs identical to run 0.
  • Divergence at token 3 — they agreed on the prefix "The capital is" then split.

Library usage

from agent_convergence_scorer import score_runs

runs = [
    "The answer is A",
    "The answer is B",
    "The answer is C",
]
print(score_runs(runs))
# {'num_runs': 3, 'exact_match_rate': 0.333,
#  'token_metrics': {'avg_overlap': 0.6, 'jaccard': 0.6},
#  'convergence_score': 0.497,
#  'divergence_point': {'diverges_at_token': 'a', 'token_position': 3, 'num_tokens_to_divergence': 3}}

Individual metrics are importable too: exact_match_rate, token_overlap, divergence_point, convergence_score, tokenize.

Metrics — what they mean

Metric Range What it measures
exact_match_rate [0, 1] Fraction of runs byte-identical to runs[0]. Crude reproducibility floor.
token_metrics.jaccard [0, 1] Token-set Jaccard of the first two runs (quick eyeball).
token_metrics.avg_overlap [0, 1] Mean Jaccard over all C(N,2) pairs. Robust to N.
divergence_point.num_tokens_to_divergence [0, min_len] First position where runs disagree. Late divergence = strong shared prefix.
convergence_score [0, 1] Composite: 0.5 * exact_match + 0.3 * avg_overlap + 0.2 * div_distance_norm.

When to use it

  • Quick single-number consistency check for multi-agent fan-outs.
  • CI gate: fail if N reruns of a prompt drop below a convergence threshold.
  • Measuring the effect of a temperature, prompt, or framing change on output stability.
  • Quantifying ideation collapse in multi-agent hackathons (N agents → how many distinct ideas?).

When not to use it

  • Semantic similarity. Tokenization is whitespace-only; "Paris, France" and "paris, france," are different token sets. If you need meaning-level comparison, pair these metrics with a sentence-embedding similarity (or a reranker) externally.
  • Subword tokenization studies. This is not a BPE/WordPiece tokenizer.
  • Multilingual corpora where whitespace isn't the word boundary (Chinese, Japanese, Thai, etc.) — tokenize upstream, pass the tokenized-then-joined form.
  • Ranking quality (nDCG, MRR, etc.) — use ir-measures or ranx instead.
  • Concurrency-safe incremental scoring over streams — this is a batch tool.

The composite weights (50/30/20) are heuristic; override by calling the individual functions and combining yourself.

Example: measuring a hackathon collapse

from agent_convergence_scorer import score_runs

# 4 agents, same prompt, different (or identical) outputs
runs = [agent.run(prompt) for agent in agents]
result = score_runs(runs)

if result["convergence_score"] > 0.8:
    print(f"⚠️ collapse: {result['convergence_score']:.2f} — agents are rephrasing each other")
else:
    print(f"✓ diverse: {result['convergence_score']:.2f}")

Origin

Built during the Hermes Labs Cascade Hackathon on 2026-04-22, as part of a controlled experiment measuring whether prompt framing affects ideation diversity across N concurrent agents. In the prior-day baseline, 12 agents sharing context collapsed to 2 dominant idea clusters; in the cascade experiment, agents under distinct-persona or distinct-constraint framing produced 4 distinct clusters per arm of 4. This scorer is the mechanism by which the collapse was measured.

Security and supply chain

  • Tamper evidence: the repository carries a staged hermes-seal v1 manifest at .hermes-seal.yaml. Signature is granted out-of-band with a root-owned key and verified with hermes-seal verify <path-to-repo>.
  • SBOM: sbom.cdx.json (CycloneDX 1.5) at repo root.
  • Security policy: see SECURITY.md.

Contributing

See CONTRIBUTING.md. Issues and PRs welcome. For agent-driven contributors, see AGENTS.md.

License

MIT — see LICENSE.

About

Built by Hermes Labs. Sealed with hermes-seal v1.

Related work: lintlang (static linter for AI agent code), cogito-ergo (agent memory with integer-pointer fidelity), claude-router (scaffold router).


If this saved you the five minutes of eyeballing a fan-out's outputs, ⭐ the repo — it helps others find it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_convergence_scorer-0.1.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_convergence_scorer-0.1.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file agent_convergence_scorer-0.1.0.tar.gz.

File metadata

  • Download URL: agent_convergence_scorer-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for agent_convergence_scorer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2d3378f388cf36945e6c7832c4146e4982890aa0b2f8b2a7b23616ce03ea5975
MD5 4ad5a58017d3753763ba5e36b79a008c
BLAKE2b-256 257ce5220f30a72165531945200b5cee0878780cbe8e67e748fe33b3d7122f05

See more details on using hashes here.

File details

Details for the file agent_convergence_scorer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_convergence_scorer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b30a03a12a307a486389a20e173c04c8f6861e50dcf1be6705421eb14e62fe35
MD5 b1406546958c18f6a7cf5646bccbab03
BLAKE2b-256 c69614c4dc6f9289999385d8f036a9a377e19ad9b35092a421b975dc094b5516

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page