Benchmark for measuring novelty in autonomous research-agent proposals, built on Prime Intellect's autonomous-speedrunning archive.
Project description
autoresearch-novelty-bench
A benchmark for testing whether an autonomous AI research agent proposes novel, mechanism-distinct hypotheses that anticipate breakthroughs later found by other researchers.
By Evo. Built on Prime Intellect's
autonomous-speedrunning archive
— two AI agents (Claude Code and Codex) competing on modded-nanogpt's
optimization speedrun.
What is this
You give a snapshot in time: "here's where two agents got on the modded-nanogpt speedrun by 2026-05-12 11:17." Your proposer agent writes N candidate ideas for how to go faster. The benchmark grades each candidate:
- Rediscovery? It matches something already tried before this moment.
- Anticipation? It matches something that other researchers proved out later (you predicted a winning thread).
- Novel? Mechanistically new — no prior or future match.
- Invalid? Malformed or missing required structure.
Plus a diversity bonus for keeping the N candidates mechanistically distinct, and a validity term for how many of the N parse cleanly.
How the dataset was built
The upstream artifact is Prime Intellect's raw archive: 10,380 training runs, 646 idea writeups, 337,648 validation checkpoints, across 8 scratchpads (2 agents × 4 waves). The build pipeline turns that into a clean evaluation dataset.
1. Parse everything. Walk each scratchpad, normalize per-wave schemas,
parse training logs into per-checkpoint validation trajectories. Backfill
timestamps from log headers (### START_ISO=) — recovers real launch times
for 99.5% of runs.
2. Link runs to ideas. Five-tier matcher (heuristic prefix, abbreviation index, THREAD walk, LLM matcher, dedup) links 63% of runs to a formal idea writeup.
3. Recover variant.py for every run — four cascade passes:
| pass | trick | recovered |
|---|---|---|
| extract from log | scripts dump source to stdout before a ==== delimiter |
1,120 |
| dynamo warnings | pytorch's recompile logs include the variant.py path | 1,981 |
| sbatch stubs | each <run_id>.sh carries --variant ".../<file>.py" |
471 |
| name-stem fuzzy | <mechanism>-s0 → train_gpt_simple_<mechanism>.py |
1,630 |
Net: 54% → 95.6% of runs have variant.py linkage.
4. Build the experiments catalog. One row per training run (10,380), with timestamp, verdict (improved/neutral/regressed/inconclusive), best step count, linked variant + idea.
5. Write a description for every experiment. Three sources, in priority: parent idea writeup (~6,580 rows), variant docstring, or — for the ~3,800 runs without either — an LLM (gpt-4o-mini, ~$0.26 total) decodes the run_id shorthand + log preamble into a 1-2 sentence mechanism summary.
6. Embed every description. Two backends shipped: OpenAI
text-embedding-3-large (3072-dim, $0.51 total) and BGE-large (1024-dim,
free local fallback).
7. Build 40 snapshots. 5 time-anchored evaluation positions per scope (at the 10/30/50/70/90 percentile of the scope's run arc). Each snapshot freezes: priors visible by time T, future improvements still ahead, rejected ideas the agent had killed by T, cross-agent visible work, current best recipe.
8. Calibrate the judge. Generate 78 labeled cases (sampled from priors, futures, paraphrases, cross-scopes, malformed stubs) → sweep cosine threshold → tune to 0.75 (87% accuracy).
Reproduce from scratch:
git clone https://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning <PI>
python -m novelty_bench.indexer --pi-repo <PI>
python -m novelty_bench.indexer.extract_embedded_sources --pi-repo <PI>
python -m novelty_bench.indexer.link_via_dynamo_warnings --pi-repo <PI>
python -m novelty_bench.indexer.link_via_sbatch_stubs --pi-repo <PI>
python -m novelty_bench.indexer.link_via_name_stem --pi-repo <PI>
python -m novelty_bench.indexer.experiments_llm_augment --pi-repo <PI>
python -m novelty_bench.embeddings --table experiments --backend openai
python -m novelty_bench.embeddings --table experiments --backend bge
python -m novelty_bench.calibration.generate
python -m novelty_bench.calibration.evaluate --sweep
Usage
pip install evo-autoresearch-novelty-bench
The parquet tables auto-download from
evo-hq/autoresearch-novelty-bench
on first use.
import novelty_bench as nb
# Browse the dataset
nb.load_experiments(agent="codex", wave="v3", verdict="improved")
nb.load_snapshots(split="dev") # 16 dev snapshots for tuning
nb.load_snapshots(split="test") # 24 held-out test snapshots
nb.load_ideas() # 646 formal writeups
# Score one of your proposer's outputs
snap = nb.load_snapshots()
snap = snap[snap["snapshot_id"] == "snap_codex_v3_k0922"].iloc[0]
result = nb.score(snap, ["candidate_1.md", "candidate_2.md", "candidate_3.md"])
print(result.explain())
Render a working directory for your proposer to operate in (via the scaffold repo):
git clone https://github.com/evo-hq/autoresearch-novelty-bench-scaffold
cd autoresearch-novelty-bench-scaffold
python build.py --snapshot snap_codex_v3_k0922 --out /tmp/workspace
# Your proposer agent reads /tmp/workspace/AGENTS.md and writes
# N candidates into /tmp/workspace/scratchpad/ideas/{slug}.md
The scaffold is blank-slate: the proposer sees the goal and wave constraint, nothing more. They use their own tools (web search, paper retrieval) to gather context. Why blank-slate: PI's archive doesn't preserve per-file creation dates, so mirroring prior state at a past moment would necessarily leak future work.
Scoring
set_score = sum(per_candidate_scores)
+ 0.5 × diversity_bonus # mean pairwise cosine distance between N candidates
+ 0.1 × validity_term # fraction of N that passed the structural check
Per-candidate, the judge classifies in this priority order:
| order | outcome | score | when |
|---|---|---|---|
| 1 | invalid | −1.0 | missing ## Proposal section, no title, or body < 50 chars |
| 2 | novel_validated | +0.4 to +1.0 | matches a future-improved experiment; multiplier depends on which pool (frontier_idea=1.0, improved_idea=0.6, frontier_experiment=0.5, improved_experiment=0.4) |
| 3 | rediscovery | −0.5 to −1.0 | matches a prior; harsher if that prior was already rejected (failed×1.4, family_ruled_out×1.6, audit_noncompliant×1.6, existence_killed×2.0) |
| 4 | novel_unvalidated | +0.3 | no future match, no prior match |
Matching = cosine similarity ≥ 0.75 over precomputed text embeddings of each experiment's description. Future-first priority: copying a prior that ended up on a winning thread counts as anticipation, not rediscovery.
The judge is deterministic and fast — no LLM call, just an embedding of your candidate (~50ms, ~$0.0003) and a matrix dot-product against the snapshot's prior/future pools. Per-call cost: pennies.
Limits
What we can't recreate from the upstream archive: the agent's conversation with their orchestrator, their accumulated context window, real-time search queries, decision rationale (we see what ran, not why), PI operator interventions, cluster state, and the agent's training cutoff. See the design notes for the full enumeration.
License
Apache 2.0. Raw ideas/*.md and variants/*.py files belong to Prime
Intellect; this package ships only derived metadata.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evo_autoresearch_novelty_bench-0.1.0.tar.gz.
File metadata
- Download URL: evo_autoresearch_novelty_bench-0.1.0.tar.gz
- Upload date:
- Size: 80.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3eeff44b77225100e013933c6bd01d585fc94e2665daceeb477d61ac5c363204
|
|
| MD5 |
d9107b9c0efd432529cb5054186cae6f
|
|
| BLAKE2b-256 |
3e2d7e62a01938d615207d57dad7e18d1ce86982125910170582471e02b26e93
|
File details
Details for the file evo_autoresearch_novelty_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: evo_autoresearch_novelty_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 103.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d2ea71c1e40a62a5ab6961bafcca3de32594ee39e66bfd29aa8865a970a5b3f
|
|
| MD5 |
08e4ecb28e4e358b34bfa4e23c22d3cf
|
|
| BLAKE2b-256 |
f3b36c3258b5aaee3252ceb757ea6ceaee17ced953e54ba15e9084200c94a523
|