Benchmark for measuring novelty in autonomous research-agent proposals, built on Prime Intellect's autonomous-speedrunning archive.

These details have not been verified by PyPI

Project links

Project description

autoresearch-novelty-bench

A benchmark for testing whether an autonomous AI research agent proposes novel, mechanism-distinct hypotheses that anticipate breakthroughs later found by other researchers.

By Evo. Built on Prime Intellect's autonomous-speedrunning archive — two AI agents (Claude Code and Codex) competing on modded-nanogpt's optimization speedrun.

What is this

You give a snapshot in time: "here's where two agents got on the modded-nanogpt speedrun by 2026-05-12 11:17." Your proposer agent writes N candidate ideas for how to go faster. The benchmark grades each candidate:

Rediscovery? It matches something already tried before this moment.
Anticipation? It matches something that other researchers proved out later (you predicted a winning thread).
Novel? Mechanistically new — no prior or future match.
Invalid? Malformed or missing required structure.

Plus a diversity bonus for keeping the N candidates mechanistically distinct, and a validity term for how many of the N parse cleanly.

How the dataset was built

The upstream artifact is Prime Intellect's raw archive: 10,380 training runs, 646 idea writeups, 337,648 validation checkpoints, across 8 scratchpads (2 agents × 4 waves). The build pipeline turns that into a clean evaluation dataset.

1. Parse everything. Walk each scratchpad, normalize per-wave schemas, parse training logs into per-checkpoint validation trajectories. Backfill timestamps from log headers (### START_ISO=) — recovers real launch times for 99.5% of runs.

2. Link runs to ideas. Five-tier matcher (heuristic prefix, abbreviation index, THREAD walk, LLM matcher, dedup) links 63% of runs to a formal idea writeup.

3. Recover variant.py for every run — four cascade passes:

pass	trick	recovered
extract from log	scripts dump source to stdout before a `====` delimiter	1,120
dynamo warnings	pytorch's recompile logs include the variant.py path	1,981
sbatch stubs	each `<run_id>.sh` carries `--variant ".../<file>.py"`	471
name-stem fuzzy	`<mechanism>-s0` → `train_gpt_simple_<mechanism>.py`	1,630

Net: 54% → 95.6% of runs have variant.py linkage.

4. Build the experiments catalog. One row per training run (10,380), with timestamp, verdict (improved/neutral/regressed/inconclusive), best step count, linked variant + idea.

5. Write a description for every experiment. Three sources, in priority: parent idea writeup (~6,580 rows), variant docstring, or — for the ~3,800 runs without either — an LLM (gpt-4o-mini, ~$0.26 total) decodes the run_id shorthand + log preamble into a 1-2 sentence mechanism summary.

6. Embed every description. Two backends shipped: OpenAI text-embedding-3-large (3072-dim, $0.51 total) and BGE-large (1024-dim, free local fallback).

7. Build 40 snapshots. 5 time-anchored evaluation positions per scope (at the 10/30/50/70/90 percentile of the scope's run arc). Each snapshot freezes: priors visible by time T, future improvements still ahead, rejected ideas the agent had killed by T, cross-agent visible work, current best recipe.

8. Calibrate the judge. Generate 78 labeled cases (sampled from priors, futures, paraphrases, cross-scopes, malformed stubs) → sweep cosine threshold → tune to 0.75 (87% accuracy).

Reproduce from scratch:

git clone https://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning <PI>
python -m novelty_bench.indexer                       --pi-repo <PI>
python -m novelty_bench.indexer.extract_embedded_sources --pi-repo <PI>
python -m novelty_bench.indexer.link_via_dynamo_warnings --pi-repo <PI>
python -m novelty_bench.indexer.link_via_sbatch_stubs    --pi-repo <PI>
python -m novelty_bench.indexer.link_via_name_stem       --pi-repo <PI>
python -m novelty_bench.indexer.experiments_llm_augment  --pi-repo <PI>
python -m novelty_bench.embeddings --table experiments --backend openai
python -m novelty_bench.embeddings --table experiments --backend bge
python -m novelty_bench.calibration.generate
python -m novelty_bench.calibration.evaluate --sweep

Usage

pip install evo-autoresearch-novelty-bench

The parquet tables auto-download from evo-hq/autoresearch-novelty-bench on first use.

import novelty_bench as nb

# Browse the dataset
nb.load_experiments(agent="codex", wave="v3", verdict="improved")
nb.load_snapshots(split="dev")     # 16 dev snapshots for tuning
nb.load_snapshots(split="test")    # 24 held-out test snapshots
nb.load_ideas()                    # 646 formal writeups

# Score one of your proposer's outputs
snap = nb.load_snapshots()
snap = snap[snap["snapshot_id"] == "snap_codex_v3_k0922"].iloc[0]
result = nb.score(snap, ["candidate_1.md", "candidate_2.md", "candidate_3.md"])
print(result.explain())

Render a working directory for your proposer to operate in (via the scaffold repo):

git clone https://github.com/evo-hq/autoresearch-novelty-bench-scaffold
cd autoresearch-novelty-bench-scaffold
python build.py --snapshot snap_codex_v3_k0922 --out /tmp/workspace
# Your proposer agent reads /tmp/workspace/AGENTS.md and writes
# N candidates into /tmp/workspace/scratchpad/ideas/{slug}.md

The scaffold is blank-slate: the proposer sees the goal and wave constraint, nothing more. They use their own tools (web search, paper retrieval) to gather context. Why blank-slate: PI's archive doesn't preserve per-file creation dates, so mirroring prior state at a past moment would necessarily leak future work.

Scoring

set_score = sum(per_candidate_scores)
          + 0.5 × diversity_bonus          # mean pairwise cosine distance between N candidates
          + 0.1 × validity_term            # fraction of N that passed the structural check

Per-candidate, the judge classifies in this priority order:

order	outcome	score	when
1	invalid	−1.0	missing `## Proposal` section, no title, or body < 50 chars
2	novel_validated	+0.4 to +1.0	matches a future-improved experiment; multiplier depends on which pool (frontier_idea=1.0, improved_idea=0.6, frontier_experiment=0.5, improved_experiment=0.4)
3	rediscovery	−0.5 to −1.0	matches a prior; harsher if that prior was already rejected (failed×1.4, family_ruled_out×1.6, audit_noncompliant×1.6, existence_killed×2.0)
4	novel_unvalidated	+0.3	no future match, no prior match

Matching = cosine similarity ≥ 0.75 over precomputed text embeddings of each experiment's description. Future-first priority: copying a prior that ended up on a winning thread counts as anticipation, not rediscovery.

The judge is deterministic and fast — no LLM call, just an embedding of your candidate (~50ms, ~$0.0003) and a matrix dot-product against the snapshot's prior/future pools. Per-call cost: pennies.

Limits

What we can't recreate from the upstream archive: the agent's conversation with their orchestrator, their accumulated context window, real-time search queries, decision rationale (we see what ran, not why), PI operator interventions, cluster state, and the agent's training cutoff. See the design notes for the full enumeration.

License

Apache 2.0. Raw ideas/*.md and variants/*.py files belong to Prime Intellect; this package ships only derived metadata.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 17, 2026

This version

0.1.0

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evo_autoresearch_novelty_bench-0.1.0.tar.gz (80.6 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evo_autoresearch_novelty_bench-0.1.0-py3-none-any.whl (103.8 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file evo_autoresearch_novelty_bench-0.1.0.tar.gz.

File metadata

Download URL: evo_autoresearch_novelty_bench-0.1.0.tar.gz
Upload date: May 17, 2026
Size: 80.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for evo_autoresearch_novelty_bench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3eeff44b77225100e013933c6bd01d585fc94e2665daceeb477d61ac5c363204`
MD5	`d9107b9c0efd432529cb5054186cae6f`
BLAKE2b-256	`3e2d7e62a01938d615207d57dad7e18d1ce86982125910170582471e02b26e93`

See more details on using hashes here.

File details

Details for the file evo_autoresearch_novelty_bench-0.1.0-py3-none-any.whl.

File metadata

Download URL: evo_autoresearch_novelty_bench-0.1.0-py3-none-any.whl
Upload date: May 17, 2026
Size: 103.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for evo_autoresearch_novelty_bench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2d2ea71c1e40a62a5ab6961bafcca3de32594ee39e66bfd29aa8865a970a5b3f`
MD5	`08e4ecb28e4e358b34bfa4e23c22d3cf`
BLAKE2b-256	`f3b36c3258b5aaee3252ceb757ea6ceaee17ced953e54ba15e9084200c94a523`

See more details on using hashes here.

evo-autoresearch-novelty-bench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

autoresearch-novelty-bench

What is this

How the dataset was built

Usage

Scoring

Limits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes