MCTS with Bayesian surprise for open-ended scientific discovery
Project description
Surprisal
MCTS with Bayesian surprise for open-ended scientific discovery.
Surprisal is inspired by AllenAI's AutoDiscovery and the Surprisal-Guided Selection paper cited below. It explores a research domain by generating literature-grounded hypotheses, running bounded experiments in a sandbox with real tools and network access, and ranking branches by how much the evidence changes the model's beliefs.
Quick start
curl -fsSL https://raw.githubusercontent.com/jbarnes850/surprisal/main/install.sh | bash
uv run surprisal init \
--domain "AI for scientific discovery" \
--seed "LLM self-evaluation accuracy drops as task compositional depth increases"
uv run surprisal explore --budget 10 --concurrency 1
uv run surprisal status --tree
uv run surprisal export --top 5 --format md
The default backend (auto) runs experiments directly on your host with no Docker dependency. Progress streams through generator, runner, review, and belief phases.
If you switch to backend = "docker" for sandboxed execution, Surprisal will build surprisal-cpu:latest on first run and prompt for a claude setup-token if your CLI auth is subscription-backed.
Codex-based analysis and review stages run from per-experiment workspaces under /tmp/.../experiments/node_*, so the CLI invocation explicitly skips git-repo enforcement there.
What it does
Each expansion runs a per-node FSM:
experiment_generator: Claude searches recent literature and proposes one hypothesis plus one executable plan.experiment_runner: a sandbox backend executes the plan with Python, Bash, local files, public network access, HuggingFace resources, and optional W&B logging.experiment_analyst: Codex or Claude reviews the execution for fidelity and validity.experiment_reviewer: Codex or Claude decides whether the evidence is usable.experiment_reviser: if needed, the plan is revised and retried within configured bounds.hypothesis_generator: Claude formalizes the post-experiment hypothesis record.belief_elicitation: Claude samples prior and posterior binary judgments and Surprisal computes Bayesian surprise.
The deterministic MCTS layer never calls LLMs directly. It only consumes node state and reward signals.
Runtime model
- Claude is required for research-facing roles: generator, hypothesis formalization, and belief elicitation.
- If Codex is available, it handles analysis, review, and revision roles.
- If Codex is not available, Claude handles all roles.
- Agent sessions persist per branch in
sessions.json: Claude research sessions, code-analysis sessions, and runner sessions are tracked separately and resumed automatically across nodes on the same branch. - Belief elicitation forks from the persisted research session instead of mutating it, so prior and posterior samples stay independent while still inheriting branch context.
- Experiment execution uses the configured sandbox backend:
auto(default): host-native runner, no Docker required, GPU autodetectiondocker: Docker-based sandbox for isolated execution (requires Docker +claude setup-token)hf_jobs: one-shot Hugging Face Jobs execution path for remote batch runs
Commands
| Command | Purpose | Machine-readable output |
|---|---|---|
surprisal init |
Create or reuse an exploration for a domain | --json |
surprisal explore |
Run exploration on the latest or a specific exploration | --json |
surprisal status |
Show exploration summary and optional tree | --json |
surprisal export |
Export results as markdown, CSV, JSON, or JSONL training data | --format json or --json |
surprisal resume |
Alias for explore against the latest or a specific exploration |
--json |
surprisal prune |
Mark low-value branches as pruned | --json |
surprisal config |
Show, set, or reset config | --json |
resume resumes an exploration, not a per-agent conversational session.
Architecture
Three layers:
src/surprisal/mcts.pyDeterministic tree policy, UCT scoring, progressive widening, and backpropagation.src/surprisal/db.py,src/surprisal/exploration.py,src/surprisal/workspace.pySQLite WAL persistence plus per-branch workspaces.src/surprisal/orchestrator.py,src/surprisal/fsm_runner.pyAsync worker orchestration and the multi-agent experiment FSM.
Key files:
src/surprisal/fsm_runner.py: per-node live FSMsrc/surprisal/orchestrator.py: worker pool, selection, branching, and dedup schedulingsrc/surprisal/bayesian.py: Beta posterior updates and belief-shift scoringsrc/surprisal/prompts/: prompt contracts for generator, runner, analyst, reviewer, reviser, and belief stages
Configuration
Exploration state defaults to ~/.surprisal.
Config is loaded from:
${SURPRISAL_HOME}/config.tomlwhenSURPRISAL_HOMEis set~/.surprisal/config.tomlwhen that file exists- otherwise
${XDG_CONFIG_HOME:-~/.config}/surprisal/config.toml
Show the active config:
uv run surprisal config --show
Live config knobs:
| Setting | Default | Description |
|---|---|---|
general.default_budget |
100 |
Default exploration budget |
general.default_concurrency |
2 |
Default worker count |
mcts.c_explore |
1.414 |
UCT exploration constant |
mcts.k_progressive |
1.0 |
Progressive widening coefficient |
mcts.alpha_progressive |
0.5 |
Progressive widening exponent |
mcts.max_depth |
30 |
Maximum tree depth |
mcts.belief_samples |
10 |
Samples per prior and posterior belief phase (set higher for publication-grade runs) |
mcts.virtual_loss |
2 |
Virtual loss applied during parallel selection |
mcts.dedup_interval |
50 |
Run deduplication every N completed expansions |
agents.claude_model |
opus |
Claude model for research roles |
agents.codex_model |
gpt-5.4 |
Codex model for analysis, review, and revision roles |
agents.max_turns |
20 |
Max Claude turns per invocation |
agents.code_attempts |
6 |
Total runner attempts before failure |
agents.revision_attempts |
1 |
Total plan revisions after rejection |
agents.generator_timeout |
180 |
Generator timeout in seconds |
sandbox.backend |
auto |
auto (host-native, recommended), docker (sandboxed), or hf_jobs (remote) |
sandbox.image |
auto |
Docker sandbox image tag (only used with backend = "docker") |
sandbox.gpu |
true |
Enable GPU passthrough for the Docker sandbox |
sandbox.memory_limit |
16g |
Docker sandbox memory limit |
sandbox.cpu_limit |
4 |
Docker sandbox CPU limit |
sandbox.timeout |
1800 |
Sandbox timeout in seconds |
sandbox.network |
true |
Allow public network access in the sandbox |
sandbox.hf_flavor |
a10g-small |
HF Jobs hardware flavor |
sandbox.hf_timeout |
2h |
HF Jobs timeout |
belief.provider |
claude |
Belief elicitation provider: claude (Likert sampling) or openrouter (logprob-based) |
belief.model |
"" |
OpenRouter model ID for belief elicitation (e.g., minimax/minimax-m2.5) |
belief.samples |
30 |
Samples per prior and posterior belief phase |
belief.kl_scale |
5.0 |
KL divergence scaling factor for Bayesian surprise |
belief.evidence_weight |
2.0 |
Evidence weight for posterior Beta fitting |
credentials.wandb_api_key |
"" |
Optional W&B API key |
credentials.hf_token |
"" |
Optional HuggingFace token |
credentials.claude_oauth_token |
"" |
Cached Claude OAuth token for Docker runner (auto-prompted on first run) |
Belief calibration
Surprisal computes Bayesian surprise by comparing prior and posterior belief distributions. Two providers are available:
- Claude (default): Samples Likert-scale judgments (
definitely_truethroughdefinitely_false) via concurrent Claude calls. Higher fidelity but more API calls. - OpenRouter: Single-call logprob-based estimation. Faster and cheaper. Requires an OpenRouter API key.
To use OpenRouter belief elicitation:
cp .env.example .env
# Add your OpenRouter API key to .env
uv run surprisal config --set belief.provider openrouter
uv run surprisal config --set belief.model minimax/minimax-m2.5
Prior beliefs are clamped to [0.1, 0.9] to prevent degenerate Beta distributions from overconfident models. A calibration warning is logged when clamping shifts the prior mean by more than 0.05.
Literature grounding
The generator prefers alphaxiv MCP when available and falls back to the HuggingFace Papers API otherwise.
One-time alphaxiv setup:
claude mcp add --transport http alphaxiv https://api.alphaxiv.org/mcp/v1
Each hypothesis stores the papers that motivated it.
Validation
Run the test suite:
uv run pytest tests/ -q --tb=short
References
- Agarwal et al., AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise
- Shi and Evans, Surprising combinations of research contents and contexts are related to impact
- Barnes et al., Surprisal-Guided Selection
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file surprisal_search-0.1.0.tar.gz.
File metadata
- Download URL: surprisal_search-0.1.0.tar.gz
- Upload date:
- Size: 5.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc6542b18c63dc118a351162492b20adc770ddf1a4e8795c4dc7d2ac412400cc
|
|
| MD5 |
ee8ffaae2f59e7c980c646d1093b3010
|
|
| BLAKE2b-256 |
d68ab658a9162eb783273f80469f4ee6d4fca7287ce24b0f33de73d4cccb8230
|
File details
Details for the file surprisal_search-0.1.0-py3-none-any.whl.
File metadata
- Download URL: surprisal_search-0.1.0-py3-none-any.whl
- Upload date:
- Size: 65.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dde3b7e488695c32471950c3e01b08a8f46a26c6163e515ef9e3385406df3cf3
|
|
| MD5 |
647b9f3191eca499a18f3c8985cc344d
|
|
| BLAKE2b-256 |
b23edf2ff6e653b2795c1992390e2bfd475bda04ecd63be2791c2ebc51ce4812
|