Skip to main content

Benchmark for measuring novelty in autonomous research-agent proposals, built on Prime Intellect's autonomous-speedrunning archive.

Project description

autoresearch-novelty-bench

A benchmark for testing whether an autonomous AI research agent proposes novel, mechanism-distinct hypotheses that anticipate breakthroughs later found by other researchers.

By Evo. Built on Prime Intellect's autonomous-speedrunning archive — two AI agents (Claude Code and Codex) competing on modded-nanogpt's optimization speedrun.


What is this

You give a snapshot in time: "here's where two agents got on the modded-nanogpt speedrun by 2026-05-12 11:17." Your proposer agent writes N candidate ideas for how to go faster. The benchmark grades each candidate:

  • Rediscovery? It matches something already tried before this moment.
  • Anticipation? It matches something that other researchers proved out later (you predicted a winning thread).
  • Novel? Mechanistically new — no prior or future match.
  • Invalid? Malformed or missing required structure.

Plus a diversity bonus for keeping the N candidates mechanistically distinct, and a validity term for how many of the N parse cleanly.


How the dataset was built

The upstream artifact is Prime Intellect's raw archive: 10,380 training runs, 646 idea writeups, 337,648 validation checkpoints, across 8 scratchpads (2 agents × 4 waves). The build pipeline turns that into a clean evaluation dataset.

1. Parse everything. Walk each scratchpad, normalize per-wave schemas, parse training logs into per-checkpoint validation trajectories. Backfill timestamps from log headers (### START_ISO=) — recovers real launch times for 99.5% of runs.

2. Link runs to ideas. Five-tier matcher (heuristic prefix, abbreviation index, THREAD walk, LLM matcher, dedup) links 63% of runs to a formal idea writeup.

3. Recover variant.py for every run — four cascade passes:

pass trick recovered
extract from log scripts dump source to stdout before a ==== delimiter 1,120
dynamo warnings pytorch's recompile logs include the variant.py path 1,981
sbatch stubs each <run_id>.sh carries --variant ".../<file>.py" 471
name-stem fuzzy <mechanism>-s0train_gpt_simple_<mechanism>.py 1,630

Net: 54% → 95.6% of runs have variant.py linkage.

4. Build the experiments catalog. One row per training run (10,380), with timestamp, verdict (improved/neutral/regressed/inconclusive), best step count, linked variant + idea.

5. Write a description for every experiment. Three sources, in priority: parent idea writeup (~6,580 rows), variant docstring, or — for the ~3,800 runs without either — an LLM (gpt-4o-mini, ~$0.26 total) decodes the run_id shorthand + log preamble into a 1-2 sentence mechanism summary.

6. Embed every description. Two backends shipped: OpenAI text-embedding-3-large (3072-dim, $0.51 total) and BGE-large (1024-dim, free local fallback).

7. Build 40 snapshots. 5 time-anchored evaluation positions per scope (at the 10/30/50/70/90 percentile of the scope's run arc). Each snapshot freezes: priors visible by time T, future improvements still ahead, rejected ideas the agent had killed by T, cross-agent visible work, current best recipe.

8. Calibrate the judge. Generate 78 labeled cases (sampled from priors, futures, paraphrases, cross-scopes, malformed stubs) → sweep cosine threshold → tune to 0.75 (87% accuracy).

Reproduce from scratch:

git clone https://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning <PI>
python -m novelty_bench.indexer                       --pi-repo <PI>
python -m novelty_bench.indexer.extract_embedded_sources --pi-repo <PI>
python -m novelty_bench.indexer.link_via_dynamo_warnings --pi-repo <PI>
python -m novelty_bench.indexer.link_via_sbatch_stubs    --pi-repo <PI>
python -m novelty_bench.indexer.link_via_name_stem       --pi-repo <PI>
python -m novelty_bench.indexer.experiments_llm_augment  --pi-repo <PI>
python -m novelty_bench.embeddings --table experiments --backend openai
python -m novelty_bench.embeddings --table experiments --backend bge
python -m novelty_bench.calibration.generate
python -m novelty_bench.calibration.evaluate --sweep

Usage

pip install evo-autoresearch-novelty-bench

The parquet tables auto-download from evo-hq/autoresearch-novelty-bench on first use.

import novelty_bench as nb

# Browse the dataset
nb.load_experiments(agent="codex", wave="v3", verdict="improved")
nb.load_snapshots(split="dev")     # 16 dev snapshots for tuning
nb.load_snapshots(split="test")    # 24 held-out test snapshots
nb.load_ideas()                    # 646 formal writeups

# Score one of your proposer's outputs
snap = nb.load_snapshots()
snap = snap[snap["snapshot_id"] == "snap_codex_v3_k0922"].iloc[0]
result = nb.score(snap, ["candidate_1.md", "candidate_2.md", "candidate_3.md"])
print(result.explain())

Render a working directory for your proposer to operate in (via the scaffold repo):

git clone https://github.com/evo-hq/autoresearch-novelty-bench-scaffold
cd autoresearch-novelty-bench-scaffold
python build.py --snapshot snap_codex_v3_k0922 --out /tmp/workspace
# Your proposer agent reads /tmp/workspace/AGENTS.md and writes
# N candidates into /tmp/workspace/scratchpad/ideas/{slug}.md

The scaffold is near-blank-slate: the proposer sees the goal, the wave's gating rule, the field-wide best step count at the snapshot's wall-clock, and the current-best variant.py they're building on. Nothing else — no peer ideas, no prior writeups, no future hints. They use their own tools (web search, paper retrieval) to gather context. Why this minimal context: PI's archive doesn't preserve per-file creation dates, so mirroring the agent's broader scratchpad state at a past moment would necessarily leak future work.


Scoring

set_score = sum(per_candidate_scores)
          + 0.5 × diversity_bonus          # mean pairwise cosine distance between N candidates
          + 0.1 × validity_term            # fraction of N that passed the structural check

Per-candidate, the judge classifies in this priority order:

order outcome score when
1 invalid −1.0 missing ## Proposal section, no title, or body < 50 chars
2 novel_validated +impact × confidence matches a future-improved experiment. impact ∈ {1.0 frontier_idea, 0.6 improved_idea, 0.5 frontier_experiment, 0.4 improved_experiment}
3 rediscovery −0.5 × rejection_mult × confidence matches a prior. rejection_mult: none 1.0, failed 1.4, family_ruled_out 1.6, audit_noncompliant 1.6, existence_killed 2.0
4 novel_unvalidated +0.3 × confidence no future match, no prior match

Future-first priority: copying a prior that ended up on a winning thread counts as anticipation, not rediscovery.

Default judge backend: llm-hybrid. Cosine retrieval picks the top-20 nearest priors and top-20 nearest futures by text-embedding-3-large, then gpt-5-mini (reasoning_effort=medium) classifies the candidate with structured JSON output (classification + matched_id + confidence + reasoning). The reasoning step catches paraphrases that cosine misses (e.g. cosine 0.73 → LLM "novel_validated, confidence 0.95"). Final score is multiplied by the LLM's confidence so weak matches are softened.

Cost: ~$0.005 per N=5 candidate set → ~$0.12 per 24-snapshot benchmark.

Opt-in to the deterministic backend with --judge-backend cosine for zero-LLM, pinned-leaderboard runs (cosine ≥ 0.75 threshold, ~$0.0003/set).


Limits

What we can't recreate from the upstream archive: the agent's conversation with their orchestrator, their accumulated context window, real-time search queries, decision rationale (we see what ran, not why), PI operator interventions, cluster state, and the agent's training cutoff. See the design notes for the full enumeration.

License

Apache 2.0. Raw ideas/*.md and variants/*.py files belong to Prime Intellect; this package ships only derived metadata.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evo_autoresearch_novelty_bench-0.1.1.tar.gz (81.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evo_autoresearch_novelty_bench-0.1.1-py3-none-any.whl (104.4 kB view details)

Uploaded Python 3

File details

Details for the file evo_autoresearch_novelty_bench-0.1.1.tar.gz.

File metadata

File hashes

Hashes for evo_autoresearch_novelty_bench-0.1.1.tar.gz
Algorithm Hash digest
SHA256 96339fe29e7bf4105f1067ef30bffdd2d3ab358bfa71c003d3a7bf1b5f843b40
MD5 ddd24214f423fa05cdab93859ab16771
BLAKE2b-256 8d477442facbd401ca5fcae44f38bc0846977ad6139553cd05a6a6712593d8ff

See more details on using hashes here.

File details

Details for the file evo_autoresearch_novelty_bench-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for evo_autoresearch_novelty_bench-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2f43b8b244d67517e113e85d532830dcec521e5ddc13596a3fcd53ca1c0eb392
MD5 3f9dadf6e1274ea8dbfacb31695e6db2
BLAKE2b-256 2d9f4a3a77d6326bd6bd30f40c5922a75521efd68d45a579c57366865cee60df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page