Skip to main content

Python implementation of ToolTree (ICLR 2026): MCTS-based tool planning for LLM agents with dual-feedback evaluation and bidirectional pruning

Project description

ToolTree

This is an open-source Python implementation of the paper: ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning Shuo Yang, Soyeon Caren Han, Yihao Ding, Shuhe Wang, Eduard Hoy — ICLR 2026

All credit for the algorithm, equations, and core ideas belongs to the original authors. This repo provides a pip-installable implementation for practical use.


MCTS-based tool planning for LLM agents. Instead of picking tools greedily, ToolTree runs a Monte Carlo Tree Search over possible tool sequences, scores each path with dual LLM feedback, and returns the optimal multi-step plan.


Why ToolTree

When an LLM agent needs to answer a question using tools, it typically tries one tool at a time and hopes for the best. This works for simple tasks but fails on multi-hop reasoning — where the output of tool A determines which tool to call next, and a wrong first choice wastes all subsequent calls.

ToolTree treats tool selection as a search problem:

  • Explores multiple tool sequences in parallel
  • Prunes bad branches before executing them (saves API calls)
  • Prunes branches after seeing bad outputs (stops wasting rollouts)
  • Caches duplicate tool calls across the tree
  • Stops early when the search converges

Install

# Core library (no LLM provider)
pip install tooltree

# With OpenAI (GPT-4o, GPT-5.1, etc.)
pip install tooltree[openai]

# With Google Gemini
pip install tooltree[gemini]

# With Anthropic Claude
pip install tooltree[anthropic]

# Everything
pip install tooltree[all]

Requires Python 3.10+.


Quick Start

from tooltree import ToolTreePlanner, Tool

# 1. Define your tools
def search_web(query: str) -> str:
    # your implementation
    return f"Results for {query}"

search_tool = Tool(
    name="search_web",
    description="Search the web for factual information.",
    parameters={
        "query": {"type": "string", "description": "Search query"},
    },
    execute=search_web,
)

# 2. Create the planner
planner = ToolTreePlanner(
    tools=[search_tool],
    llm="openai:gpt-4o-mini",   # or "gemini:gemini-2.0-flash", "anthropic:claude-haiku-4-5-20251001"
)

# 3. Solve
result = planner.solve_sync("What is the boiling point of ethanol in Celsius?")

print(result.answer)
print(result.tool_trace)   # list of {tool, arguments, output}

Set your API key before running:

# OpenAI
export OPENAI_API_KEY="sk-..."

# Gemini
export GEMINI_API_KEY="..."

# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

Multi-hop Example

The real value of ToolTree shows on tasks that require chaining tools — where you can't know which tool to call second until you see the output of the first.

from tooltree import ToolTreePlanner, Tool, Config

tools = [
    Tool(
        name="search_in_file",
        description="Search for a regex pattern in a file. Returns matching lines with line numbers.",
        parameters={
            "path":    {"type": "string", "description": "File path"},
            "pattern": {"type": "string", "description": "Regex pattern"},
        },
        execute=search_in_file,
    ),
    Tool(
        name="read_file_lines",
        description="Read a range of lines from a file. Use after search_in_file.",
        parameters={
            "path":       {"type": "string", "description": "File path"},
            "start_line": {"type": "string", "description": "First line (1-indexed)"},
            "end_line":   {"type": "string", "description": "Last line (inclusive)"},
        },
        execute=read_file_lines,
    ),
]

config = Config(
    max_rollouts=20,
    max_depth=6,
    tau_pre=0.3,
    tau_post=0.4,
)

planner = ToolTreePlanner(tools=tools, llm="openai:gpt-4o-mini", config=config)

result = planner.solve_sync(
    "What is the default value of max_rollouts in the Config class?"
)
# MCTS discovers: search_in_file -> read_file_lines (2-hop chain)

See examples/integration_test.py for a full runnable version with 4 test queries.


Configuration

from tooltree import Config

config = Config(
    max_rollouts=60,           # Max MCTS iterations (default: 60)
    max_depth=15,              # Max tool chain length (default: 15)
    tau_pre=0.3,               # Pre-pruning threshold — candidates below this are dropped before execution
    tau_post=0.4,              # Post-pruning threshold — branches below this stop expanding
    top_k=20,                  # Keep top-K candidates after pre-pruning
    lambda_exploration=1.4,    # UCT exploration constant (paper Eq. 1)
    early_stop_patience=10,    # Stop after N rollouts with no Q improvement
    early_stop_delta=1e-3,     # Minimum Q improvement to count as progress
    tool_timeout=30.0,         # Max seconds per tool call
    max_concurrent_llm_calls=5 # Parallel LLM calls during expand phase
)
Parameter Default What it controls
max_rollouts 60 How many MCTS iterations to run. More = better quality, higher cost.
max_depth 15 Maximum number of tool calls in a single path.
tau_pre 0.3 Pre-execution pruning threshold. Raise to prune more aggressively.
tau_post 0.4 Post-execution pruning threshold. Raise to abandon bad branches faster.
top_k 20 Max candidates kept after pre-pruning.
lambda_exploration 1.4 UCT exploration constant. Higher = more exploration of untested paths.
early_stop_patience 10 Rollouts without improvement before stopping early.
tool_timeout 30.0 Seconds before a tool call is killed.
max_concurrent_llm_calls 5 Semaphore limit for parallel LLM calls (rate limit protection).

LLM Providers

Provider Spec prefix Extra API key env var
OpenAI openai: tooltree[openai] OPENAI_API_KEY
Google Gemini gemini: tooltree[gemini] GEMINI_API_KEY or GOOGLE_API_KEY
Anthropic Claude anthropic: tooltree[anthropic] ANTHROPIC_API_KEY

Pass as a string to ToolTreePlanner:

planner = ToolTreePlanner(tools=tools, llm="openai:gpt-4o-mini")
planner = ToolTreePlanner(tools=tools, llm="gemini:gemini-2.0-flash")
planner = ToolTreePlanner(tools=tools, llm="anthropic:claude-haiku-4-5-20251001")

Or pass a provider instance directly:

from tooltree.llm import create_llm

llm = create_llm("openai:gpt-4o-mini", temperature=0.2, max_tokens=512)
planner = ToolTreePlanner(tools=tools, llm=llm)

Use a separate, stronger model as the judge (evaluates tool outputs while a cheaper model drives search):

planner = ToolTreePlanner(
    tools=tools,
    llm="openai:gpt-4o-mini",      # drives search (more calls)
    judge_llm="openai:gpt-4o",     # evaluates quality (fewer calls)
)

Custom / Local LLMs

Any class that extends LLMProvider works:

from tooltree.llm.base import LLMProvider

class MyProvider(LLMProvider):
    async def complete(self, messages: list[dict], **kwargs) -> str:
        # call your local model / API
        ...

planner = ToolTreePlanner(tools=tools, llm=MyProvider())

API Reference

ToolTreePlanner

ToolTreePlanner(
    tools: Sequence[Tool],
    llm: str | LLMProvider = "gemini:gemini-2.0-flash",
    config: Config | None = None,
    judge_llm: str | LLMProvider | None = None,
)
Method Description
solve(query) -> Awaitable[SearchResult] Async search
solve_sync(query) -> SearchResult Sync wrapper — works in scripts, Jupyter, and nested event loops

SearchResult

result.answer          # str — final answer synthesized from best trajectory
result.tool_trace      # list[dict] — [{tool, arguments, output}, ...]
result.total_rollouts  # int — iterations run
result.best_q          # float — best Q-value (0.0 to 1.0)
result.tree_stats      # dict — {total_nodes, max_depth_reached, early_stopped}

Tool

Tool(
    name: str,                          # e.g. "search_web"
    description: str,                   # Shown to the LLM — be specific
    parameters: dict[str, Any],         # JSON Schema-style: {param: {type, description}}
    execute: Callable,                  # Sync or async function
)

How It Works

ToolTree runs MCTS over the space of possible tool call sequences:

  1. Select — Traverse the tree via UCT score until a leaf node
  2. Expand — Ask the LLM to generate candidate tool calls; pre-evaluate each (score before executing); drop candidates below tau_pre; keep top-K
  3. Execute — Run the chosen tool with timeout; cache the result
  4. Post-evaluate — Score the actual output; if below tau_post, mark the branch as non-expandable
  5. Backpropagate — Update Q-values up to root
  6. Repeat until max_rollouts or early stop

After search: extract the highest-Q path, ask the LLM to synthesize a final answer from the trajectory.

Dual pruning means bad tool calls are rejected at two checkpoints — before they waste API calls, and after they waste search budget. This keeps rollout cost low on multi-hop tasks where the naive approach would try many dead-end paths.


Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run all unit tests (no API key needed — uses MockLLM)
pytest tests/ -v

# Run integration test (requires API key)
TOOLTREE_MODEL=openai:gpt-4o-mini python -u examples/integration_test.py

Project Structure

src/tooltree/
├── planner.py              # ToolTreePlanner — main entry point
├── config.py               # Config dataclass
├── core/
│   ├── tree.py             # Action, SearchState, MCTSNode
│   ├── mcts.py             # MCTSPlanner — full search loop
│   ├── pruning.py          # pre_prune(), post_prune()
│   └── cache.py            # ResultCache — deduplicates tool calls
├── evaluation/
│   ├── pre_evaluator.py    # Scores tool calls before execution
│   ├── post_evaluator.py   # Scores tool outputs after execution
│   └── prompts.py          # LLM judge prompts (paper Appendix B)
├── llm/
│   ├── base.py             # LLMProvider ABC, create_llm() factory
│   ├── openai_provider.py
│   ├── gemini_provider.py
│   └── anthropic_provider.py
└── tools/
    ├── schema.py           # Tool dataclass
    ├── registry.py         # ToolRegistry
    └── executor.py         # ToolExecutor with timeout

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tooltree_mcts-0.1.0.tar.gz (70.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tooltree_mcts-0.1.0-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file tooltree_mcts-0.1.0.tar.gz.

File metadata

  • Download URL: tooltree_mcts-0.1.0.tar.gz
  • Upload date:
  • Size: 70.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for tooltree_mcts-0.1.0.tar.gz
Algorithm Hash digest
SHA256 95bd8e6fc9c027ddf7fac9eb756cb6fac9d6878309b86cb28f457266f4894c12
MD5 93ca2eee3b10a22d22012eccfeb90026
BLAKE2b-256 850f6aec7cf012a404be7c27f8ee4e3802693a5d4cb003bd8005015b21ba2a0c

See more details on using hashes here.

File details

Details for the file tooltree_mcts-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tooltree_mcts-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for tooltree_mcts-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f05d4e5103ad9d0b6f2f8eae35209e9a376b12f7825b644660856bfc1a6f82d
MD5 a57c0896cd109ddd472f8e9e928d231b
BLAKE2b-256 a00a2341e96588227d34575dcf63d0f77d872eaf8ab2358234460331ddd993fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page