Skip to main content

MCTS with Bayesian surprise for open-ended scientific discovery

Project description

Surprisal

MCTS with Bayesian surprise for open-ended scientific discovery.

Surprisal is inspired by AllenAI's AutoDiscovery and the Surprisal-Guided Selection paper cited below. It explores a research domain by generating literature-grounded hypotheses, running bounded experiments in a sandbox with real tools and network access, and ranking branches by how much the evidence changes the model's beliefs.

Quick start

curl -fsSL https://raw.githubusercontent.com/jbarnes850/surprisal/main/install.sh | bash

uv run surprisal init \
  --domain "AI for scientific discovery" \
  --seed "LLM self-evaluation accuracy drops as task compositional depth increases"

uv run surprisal explore --budget 10 --concurrency 1
uv run surprisal status --tree
uv run surprisal export --top 5 --format md

The default backend (auto) runs experiments directly on your host with no Docker dependency. Progress streams through generator, runner, review, and belief phases.

If you switch to backend = "docker" for sandboxed execution, Surprisal will build surprisal-cpu:latest on first run and prompt for a claude setup-token if your CLI auth is subscription-backed.

Codex-based analysis and review stages run from per-experiment workspaces under /tmp/.../experiments/node_*, so the CLI invocation explicitly skips git-repo enforcement there.

What it does

Each expansion runs a per-node FSM:

  1. experiment_generator: Claude searches recent literature and proposes one hypothesis plus one executable plan.
  2. experiment_runner: a sandbox backend executes the plan with Python, Bash, local files, public network access, HuggingFace resources, and optional W&B logging.
  3. experiment_analyst: Codex or Claude reviews the execution for fidelity and validity.
  4. experiment_reviewer: Codex or Claude decides whether the evidence is usable.
  5. experiment_reviser: if needed, the plan is revised and retried within configured bounds.
  6. hypothesis_generator: Claude formalizes the post-experiment hypothesis record.
  7. belief_elicitation: Claude samples prior and posterior binary judgments and Surprisal computes Bayesian surprise.

The deterministic MCTS layer never calls LLMs directly. It only consumes node state and reward signals.

Runtime model

  • Claude is required for research-facing roles: generator, hypothesis formalization, and belief elicitation.
  • If Codex is available, it handles analysis, review, and revision roles.
  • If Codex is not available, Claude handles all roles.
  • Agent sessions persist per branch in sessions.json: Claude research sessions, code-analysis sessions, and runner sessions are tracked separately and resumed automatically across nodes on the same branch.
  • Belief elicitation forks from the persisted research session instead of mutating it, so prior and posterior samples stay independent while still inheriting branch context.
  • Experiment execution uses the configured sandbox backend:
    • auto (default): host-native runner, no Docker required, GPU autodetection
    • docker: Docker-based sandbox for isolated execution (requires Docker + claude setup-token)
    • hf_jobs: one-shot Hugging Face Jobs execution path for remote batch runs

Commands

Command Purpose Machine-readable output
surprisal init Create or reuse an exploration for a domain --json
surprisal explore Run exploration on the latest or a specific exploration --json
surprisal status Show exploration summary and optional tree --json
surprisal export Export results as markdown, CSV, JSON, or JSONL training data --format json or --json
surprisal resume Alias for explore against the latest or a specific exploration --json
surprisal prune Mark low-value branches as pruned --json
surprisal config Show, set, or reset config --json

resume resumes an exploration, not a per-agent conversational session.

Architecture

Three layers:

  1. src/surprisal/mcts.py Deterministic tree policy, UCT scoring, progressive widening, and backpropagation.
  2. src/surprisal/db.py, src/surprisal/exploration.py, src/surprisal/workspace.py SQLite WAL persistence plus per-branch workspaces.
  3. src/surprisal/orchestrator.py, src/surprisal/fsm_runner.py Async worker orchestration and the multi-agent experiment FSM.

Key files:

  • src/surprisal/fsm_runner.py: per-node live FSM
  • src/surprisal/orchestrator.py: worker pool, selection, branching, and dedup scheduling
  • src/surprisal/bayesian.py: Beta posterior updates and belief-shift scoring
  • src/surprisal/prompts/: prompt contracts for generator, runner, analyst, reviewer, reviser, and belief stages

Configuration

Exploration state defaults to ~/.surprisal.

Config is loaded from:

  • ${SURPRISAL_HOME}/config.toml when SURPRISAL_HOME is set
  • ~/.surprisal/config.toml when that file exists
  • otherwise ${XDG_CONFIG_HOME:-~/.config}/surprisal/config.toml

Show the active config:

uv run surprisal config --show

Live config knobs:

Setting Default Description
general.default_budget 100 Default exploration budget
general.default_concurrency 2 Default worker count
mcts.c_explore 1.414 UCT exploration constant
mcts.k_progressive 1.0 Progressive widening coefficient
mcts.alpha_progressive 0.5 Progressive widening exponent
mcts.max_depth 30 Maximum tree depth
mcts.belief_samples 10 Samples per prior and posterior belief phase (set higher for publication-grade runs)
mcts.virtual_loss 2 Virtual loss applied during parallel selection
mcts.dedup_interval 50 Run deduplication every N completed expansions
agents.claude_model opus Claude model for research roles
agents.codex_model gpt-5.4 Codex model for analysis, review, and revision roles
agents.max_turns 20 Max Claude turns per invocation
agents.code_attempts 6 Total runner attempts before failure
agents.revision_attempts 1 Total plan revisions after rejection
agents.generator_timeout 180 Generator timeout in seconds
sandbox.backend auto auto (host-native, recommended), docker (sandboxed), or hf_jobs (remote)
sandbox.image auto Docker sandbox image tag (only used with backend = "docker")
sandbox.gpu true Enable GPU passthrough for the Docker sandbox
sandbox.memory_limit 16g Docker sandbox memory limit
sandbox.cpu_limit 4 Docker sandbox CPU limit
sandbox.timeout 1800 Sandbox timeout in seconds
sandbox.network true Allow public network access in the sandbox
sandbox.hf_flavor a10g-small HF Jobs hardware flavor
sandbox.hf_timeout 2h HF Jobs timeout
belief.provider claude Belief elicitation provider: claude (Likert sampling) or openrouter (logprob-based)
belief.model "" OpenRouter model ID for belief elicitation (e.g., minimax/minimax-m2.5)
belief.samples 30 Samples per prior and posterior belief phase
belief.kl_scale 5.0 KL divergence scaling factor for Bayesian surprise
belief.evidence_weight 2.0 Evidence weight for posterior Beta fitting
credentials.wandb_api_key "" Optional W&B API key
credentials.hf_token "" Optional HuggingFace token
credentials.claude_oauth_token "" Cached Claude OAuth token for Docker runner (auto-prompted on first run)

Belief calibration

Surprisal computes Bayesian surprise by comparing prior and posterior belief distributions. Two providers are available:

  • Claude (default): Samples Likert-scale judgments (definitely_true through definitely_false) via concurrent Claude calls. Higher fidelity but more API calls.
  • OpenRouter: Single-call logprob-based estimation. Faster and cheaper. Requires an OpenRouter API key.

To use OpenRouter belief elicitation:

cp .env.example .env
# Add your OpenRouter API key to .env

uv run surprisal config --set belief.provider openrouter
uv run surprisal config --set belief.model minimax/minimax-m2.5

Prior beliefs are clamped to [0.1, 0.9] to prevent degenerate Beta distributions from overconfident models. A calibration warning is logged when clamping shifts the prior mean by more than 0.05.

Literature grounding

The generator prefers alphaxiv MCP when available and falls back to the HuggingFace Papers API otherwise.

One-time alphaxiv setup:

claude mcp add --transport http alphaxiv https://api.alphaxiv.org/mcp/v1

Each hypothesis stores the papers that motivated it.

Validation

Run the test suite:

uv run pytest tests/ -q --tb=short

References

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

surprisal_search-0.1.0.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

surprisal_search-0.1.0-py3-none-any.whl (65.6 kB view details)

Uploaded Python 3

File details

Details for the file surprisal_search-0.1.0.tar.gz.

File metadata

  • Download URL: surprisal_search-0.1.0.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for surprisal_search-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dc6542b18c63dc118a351162492b20adc770ddf1a4e8795c4dc7d2ac412400cc
MD5 ee8ffaae2f59e7c980c646d1093b3010
BLAKE2b-256 d68ab658a9162eb783273f80469f4ee6d4fca7287ce24b0f33de73d4cccb8230

See more details on using hashes here.

File details

Details for the file surprisal_search-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for surprisal_search-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dde3b7e488695c32471950c3e01b08a8f46a26c6163e515ef9e3385406df3cf3
MD5 647b9f3191eca499a18f3c8985cc344d
BLAKE2b-256 b23edf2ff6e653b2795c1992390e2bfd475bda04ecd63be2791c2ebc51ce4812

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page