Graph-structured tool retrieval for LLM agents — ontology-aware hybrid search
Project description
graph-tool-call
Graph-based Tool Retrieval for LLM Agents
Ingest, Analyze, Organize, Retrieve.
The Problem
LLM agents are getting access to more and more tools. A commerce platform might expose 1,200+ API endpoints. A company's internal toolset might have 500+ functions across multiple services.
But there's a hard limit: you can't put them all in the context window.
The common solution? Vector search — embed tool descriptions, find the closest matches. It works, but it misses something important:
Tools don't exist in isolation. They have relationships.
When a user says "cancel my order and process a refund", vector search might find cancelOrder. But it won't know that you need to call listOrders first (to get the order ID), and that processRefund should follow. These aren't just similar tools — they form a workflow.
The Solution
graph-tool-call models tool relationships as a graph and retrieves tools using a multi-signal hybrid search pipeline:
OpenAPI/MCP/Code → [Ingest] → [Analyze] → [Organize] → [Retrieve] → Agent
(convert) (relations) (graph) (wRRF hybrid)
4-source wRRF fusion: BM25 keyword matching + graph traversal + embedding similarity + MCP annotation scoring — combined via weighted Reciprocal Rank Fusion.
┌──────────┐
PRECEDES │listOrders│ PRECEDES
┌─────────┤ ├──────────┐
▼ └──────────┘ ▼
┌──────────┐ ┌───────────┐
│ getOrder │ │cancelOrder│
└──────────┘ └─────┬─────┘
│ COMPLEMENTARY
▼
┌──────────────┐
│processRefund │
└──────────────┘
Benchmark
The real test: put graph-tool-call between the LLM and tools, then compare. We gave an LLM the same user requests with 4 different configurations:
- baseline: pass all tool definitions to the LLM (no retrieval)
- retrieve-k3/k5/k10: pass only the top-K retrieved tools
Model: qwen3:4b (4-bit quantized, Ollama). All specs are public and reproducible.
Does graph-tool-call help the LLM?
| API | Tools | baseline | retrieve-k5 | + embedding | + ontology | Token savings |
|---|---|---|---|---|---|---|
| Petstore | 19 | 100% | 95.0% | — | — | 64% |
| GitHub | 50 | 100% | 87.5% | — | — | 88% |
| MCP Servers | 38 | 96.7% | 90.0% | — | — | 83% |
| Kubernetes | 248 | 18% | 74% | 72% | 74% | 80% |
The story in two lines:
- < 50 tools: The LLM handles them fine. graph-tool-call's value = 64–88% token savings (faster, cheaper).
- 248 tools: The LLM collapses to 18%. graph-tool-call delivers 72–74% accuracy — it's not an optimization, it's a requirement.
Full pipeline comparison (4 configurations)
Petstore (19 tools, 20 queries)
| Pipeline | Accuracy | Recall@K | Avg Tokens | Token Savings |
|---|---|---|---|---|
| baseline | 100.0% | 100.0% | 1,239 | — |
| retrieve-k3 | 90.0% | 93.3% | 305 | 75.4% |
| retrieve-k5 | 95.0% | 98.3% | 440 | 64.4% |
| retrieve-k10 | 100.0% | 98.3% | 720 | 41.9% |
GitHub (50 tools, 40 queries)
| Pipeline | Accuracy | Recall@K | Avg Tokens | Token Savings |
|---|---|---|---|---|
| baseline | 100.0% | 100.0% | 3,302 | — |
| retrieve-k3 | 85.0% | 87.5% | 289 | 91.3% |
| retrieve-k5 | 87.5% | 87.5% | 398 | 87.9% |
| retrieve-k10 | 90.0% | 92.5% | 662 | 79.9% |
Mixed MCP (38 tools, 30 queries)
| Pipeline | Accuracy | Recall@K | Avg Tokens | Token Savings |
|---|---|---|---|---|
| baseline | 96.7% | 100.0% | 2,741 | — |
| retrieve-k3 | 86.7% | 93.3% | 328 | 88.0% |
| retrieve-k5 | 90.0% | 96.7% | 461 | 83.2% |
| retrieve-k10 | 96.7% | 100.0% | 826 | 69.9% |
Kubernetes core/v1 (248 tools, 50 queries)
| Pipeline | Accuracy | Recall@K | Avg Tokens | Token Savings |
|---|---|---|---|---|
| baseline | 18.4% | 100.0% | 8,192 | — |
| retrieve-k5 | 74.0% | 88.0% | 1,613 | 80.3% |
| + embedding | 72.0% | 92.0% | 1,728 | 78.9% |
| + ontology | 74.0% | 92.0% | 1,793 | 78.1% |
| + both | 70.0% | 94.0% | 1,726 | 78.9% |
How good is the retrieval?
Before the LLM sees anything, graph-tool-call must first find the right tools. We measure this with Recall@K: "Is the correct tool in the top-K results?"
| API | Tools | Recall@3 | Recall@5 | Recall@10 |
|---|---|---|---|---|
| Petstore | 19 | 93.3% | 98.3% | 98.3% |
| GitHub | 50 | 87.5% | 87.5% | 92.5% |
| MCP | 38 | 93.3% | 96.7% | 100.0% |
| Kubernetes | 248 | 82.0% | 88.0% | 90.0% |
These are BM25 + graph traversal only — no embedding model.
When does embedding + ontology help?
Adding OpenAI embedding (text-embedding-3-small) and LLM ontology (gpt-4o-mini) on top of BM25 + graph — tested on the hardest dataset (K8s, 248 tools):
| Pipeline | Accuracy | Recall@K | Features |
|---|---|---|---|
| retrieve-k5 | 74.0% | 88.0% | BM25 + graph only |
| + embedding | 72.0% | 92.0% | + OpenAI embedding |
| + ontology | 74.0% | 92.0% | + GPT-4o-mini knowledge graph |
| + both | 70.0% | 94.0% | embedding + ontology |
- Embedding: Recall@5 88% → 92% (+4pp) — catches semantic matches that BM25 misses.
- Ontology: Recall@5 88% → 92% (+4pp), Accuracy 74% — LLM-enriched keywords improve BM25 scoring.
- Both combined: Recall@5 94% (highest) — ontology keywords + embedding semantics complement each other.
Reproduce it
# Retrieval quality (fast, no LLM needed)
python -m benchmarks.run_benchmark
python -m benchmarks.run_benchmark -d k8s -v
# Pipeline benchmark (LLM comparison)
python -m benchmarks.run_benchmark --mode pipeline -m qwen3:4b
python -m benchmarks.run_benchmark --mode pipeline --pipelines baseline retrieve-k3 retrieve-k5 retrieve-k10
# Save baseline and compare
python -m benchmarks.run_benchmark --mode pipeline --save-baseline
python -m benchmarks.run_benchmark --mode pipeline --diff
Installation
pip install graph-tool-call # core (BM25 + graph)
pip install graph-tool-call[embedding] # + embedding, cross-encoder reranker
pip install graph-tool-call[openapi] # + YAML support for OpenAPI specs
pip install graph-tool-call[all] # everything
All extras
pip install graph-tool-call[lint] # + ai-api-lint spec auto-fix
pip install graph-tool-call[similarity] # + rapidfuzz for deduplication
pip install graph-tool-call[visualization] # + pyvis for HTML graph export
pip install graph-tool-call[langchain] # + LangChain tool adapter
Quick Start
30-Second Example
from graph_tool_call import ToolGraph
# Build a tool graph from the official Petstore API
tg = ToolGraph.from_url(
"https://petstore3.swagger.io/api/v3/openapi.json",
cache="petstore.json", # saves graph locally for instant reload
)
print(tg)
# → ToolGraph(tools=19, nodes=22, edges=100)
# Search for tools — 98.3% Recall@5 on this spec
tools = tg.retrieve("create a new pet", top_k=5)
for t in tools:
print(f" {t.name}: {t.description}")
# → addPet: Add a new pet to the store.
# updatePet: Update an existing pet.
# getPetById: Find pet by ID.
# ...graph expansion brings the full CRUD workflow
From Your Own Swagger / OpenAPI
from graph_tool_call import ToolGraph
# From file (JSON or YAML)
tg = ToolGraph()
tg.ingest_openapi("path/to/openapi.json")
# From URL — auto-discovers all spec groups from Swagger UI
tg = ToolGraph.from_url("https://api.example.com/swagger-ui/index.html")
# With caching — build once, reload instantly
tg = ToolGraph.from_url(
"https://api.example.com/swagger-ui/index.html",
cache="my_api.json", # first call: fetch + build + save
) # next calls: load from file (no network)
# Supports: Swagger 2.0, OpenAPI 3.0, OpenAPI 3.1
From MCP Server Tools
from graph_tool_call import ToolGraph
mcp_tools = [
{
"name": "read_file",
"description": "Read a file",
"inputSchema": {"type": "object", "properties": {"path": {"type": "string"}}},
"annotations": {"readOnlyHint": True, "destructiveHint": False},
},
{
"name": "delete_file",
"description": "Delete a file permanently",
"inputSchema": {"type": "object", "properties": {"path": {"type": "string"}}},
"annotations": {"readOnlyHint": False, "destructiveHint": True},
},
]
tg = ToolGraph()
tg.ingest_mcp_tools(mcp_tools, server_name="filesystem")
# Annotation-aware: "delete files" → destructive tools ranked higher
tools = tg.retrieve("delete temporary files", top_k=5)
MCP annotations (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) are used as retrieval signals. Query intent is automatically classified and aligned with tool annotations.
From Python Functions
from graph_tool_call import ToolGraph
def read_file(path: str) -> str:
"""Read contents of a file."""
def write_file(path: str, content: str) -> None:
"""Write contents to a file."""
tg = ToolGraph()
tg.ingest_functions([read_file, write_file])
# Parameters extracted from type hints, description from docstring
Manual Tool Registration
from graph_tool_call import ToolGraph
tg = ToolGraph()
# OpenAI function-calling format — auto-detected
tg.add_tools([
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
},
},
},
])
# Define relationships manually
tg.add_relation("get_weather", "get_forecast", "complementary")
Embedding (Hybrid Search)
Add embedding-based semantic search on top of BM25 + graph. Any OpenAI-compatible endpoint works.
pip install graph-tool-call[embedding]
# Sentence-transformers (local, no API key needed)
tg.enable_embedding("sentence-transformers/all-MiniLM-L6-v2")
# OpenAI
tg.enable_embedding("openai/text-embedding-3-large")
# Ollama
tg.enable_embedding("ollama/nomic-embed-text")
# vLLM / llama.cpp / any OpenAI-compatible server
tg.enable_embedding("vllm/Qwen/Qwen3-Embedding-0.6B")
tg.enable_embedding("vllm/model@http://gpu-box:8000/v1")
tg.enable_embedding("llamacpp/model@http://192.168.1.10:8080/v1")
tg.enable_embedding("http://localhost:8000/v1@my-model") # URL@model format
# Custom callable
tg.enable_embedding(lambda texts: my_embed_fn(texts))
Weights are automatically rebalanced when embedding is enabled. You can fine-tune them:
tg.set_weights(keyword=0.1, graph=0.4, embedding=0.5)
Save & Load
Build once, reuse everywhere. The full graph structure (nodes, edges, relation types, weights) is preserved.
# Save
tg.save("my_graph.json")
# Load
tg = ToolGraph.load("my_graph.json")
# Or use cache= in from_url() for automatic save/load
tg = ToolGraph.from_url(url, cache="my_graph.json")
Advanced Features
Cross-Encoder Reranking
Second-stage reranking using a cross-encoder model. Jointly encodes (query, tool_description) pairs for more accurate scoring than independent embedding comparison.
tg.enable_reranker() # default: cross-encoder/ms-marco-MiniLM-L-6-v2
tools = tg.retrieve("cancel order", top_k=5)
# Results are first ranked by wRRF, then re-scored by cross-encoder
MMR Diversity
Maximal Marginal Relevance reranking reduces redundant results.
tg.enable_diversity(lambda_=0.7) # 0.7 = mostly relevant, some diversity
History-Aware Retrieval
Pass previously called tool names to improve context. Already-used tools are demoted, and their graph neighbors become seeds for expansion.
# First call
tools = tg.retrieve("find my order")
# → [listOrders, getOrder, ...]
# Second call — history-aware
tools = tg.retrieve("now cancel it", history=["listOrders", "getOrder"])
# → [cancelOrder, processRefund, ...]
# listOrders/getOrder demoted, cancelOrder boosted via graph proximity
wRRF Weight Tuning
Fine-tune the weighted Reciprocal Rank Fusion weights for each scoring source:
tg.set_weights(
keyword=0.2, # BM25 text matching
graph=0.5, # graph traversal (relation-based)
embedding=0.3, # semantic similarity
annotation=0.2, # MCP annotation matching
)
LLM-Enhanced Ontology
Build richer tool ontologies using any LLM. The LLM infers categories, relations, and generates search keywords (especially useful for non-English tool descriptions).
# Any of these work — auto-detected by wrap_llm()
tg.auto_organize(llm="ollama/qwen2.5:7b") # string shorthand
tg.auto_organize(llm=lambda p: my_llm(p)) # callable
tg.auto_organize(llm=openai.OpenAI()) # OpenAI client
tg.auto_organize(llm="litellm/claude-sonnet-4-20250514") # via litellm
Supported LLM inputs
| Input | Wrapped as |
|---|---|
OntologyLLM instance |
Pass-through |
callable(str) -> str |
CallableOntologyLLM |
OpenAI client (has chat.completions) |
OpenAIClientOntologyLLM |
"ollama/model" |
OllamaOntologyLLM |
"openai/model" |
OpenAICompatibleOntologyLLM |
"litellm/model" |
litellm.completion wrapper |
Duplicate Detection
Find and merge duplicate tools across multiple API specs:
duplicates = tg.find_duplicates(threshold=0.85)
merged = tg.merge_duplicates(duplicates)
# merged = {"getUser_1": "getUser", ...}
Export & Visualization
# Interactive HTML (vis.js)
tg.export_html("graph.html", progressive=True)
# GraphML (for Gephi, yEd)
tg.export_graphml("graph.graphml")
# Neo4j Cypher
tg.export_cypher("graph.cypher")
API Spec Lint Integration
Auto-fix poor OpenAPI specs before ingestion using ai-api-lint:
pip install graph-tool-call[lint]
tg = ToolGraph.from_url(url, lint=True) # auto-fix during ingest
Why Not Just Vector Search?
| Scenario | Vector-only | graph-tool-call |
|---|---|---|
| "cancel my order" | Returns cancelOrder |
Returns listOrders → getOrder → cancelOrder → processRefund (full workflow) |
| "read and save file" | Returns read_file |
Returns read_file + write_file (via COMPLEMENTARY) |
| "delete old records" | Returns any tool matching "delete" | Destructive tools ranked first (annotation-aware) |
| "now cancel it" (with history) | No context, same results | Demotes used tools, boosts next-step tools |
| Multiple Swagger specs with overlapping tools | Duplicate tools in results | Auto-deduplication across sources |
| 1,200 API endpoints | Slow, noisy results | Organized into categories, precise graph traversal |
Full API Reference
ToolGraph methods
| Method | Description |
|---|---|
add_tool(tool) |
Add a single tool (auto-detects format) |
add_tools(tools) |
Add multiple tools |
ingest_openapi(source) |
Ingest from OpenAPI/Swagger spec |
ingest_mcp_tools(tools) |
Ingest from MCP tool list |
ingest_functions(fns) |
Ingest from Python callables |
ingest_arazzo(source) |
Ingest Arazzo 1.0.0 workflow spec |
from_url(url, cache=...) |
Build from Swagger UI or spec URL |
add_relation(src, tgt, type) |
Add a manual relation |
auto_organize(llm=...) |
Auto-categorize tools |
build_ontology(llm=...) |
Build complete ontology |
retrieve(query, top_k=10) |
Search for tools |
enable_embedding(provider) |
Enable hybrid embedding search |
enable_reranker(model) |
Enable cross-encoder reranking |
enable_diversity(lambda_) |
Enable MMR diversity |
set_weights(...) |
Tune wRRF fusion weights |
find_duplicates(threshold) |
Find duplicate tools |
merge_duplicates(pairs) |
Merge detected duplicates |
apply_conflicts() |
Detect and add CONFLICTS_WITH edges |
save(path) / load(path) |
Serialize / deserialize graph |
export_html(path) |
Export interactive HTML visualization |
export_graphml(path) |
Export to GraphML format |
export_cypher(path) |
Export as Neo4j Cypher statements |
Feature Comparison
| Feature | Vector-only solutions | graph-tool-call |
|---|---|---|
| Tool source | Manual registration | Auto-ingest from Swagger/OpenAPI/MCP |
| Search method | Flat vector similarity | Multi-stage hybrid (wRRF + rerank + MMR) |
| Behavioral semantics | None | MCP annotation-aware retrieval |
| Tool relations | None | 6 relation types, auto-detected |
| Call ordering | None | State machine + CRUD + response→request data flow |
| Deduplication | None | Cross-source duplicate detection |
| Ontology | None | Auto / LLM-Auto modes (any LLM) |
| History awareness | None | Demotes used tools, boosts next-step |
| Spec quality | Assumes good specs | ai-api-lint auto-fix integration |
| LLM dependency | Required | Optional (better with, works without) |
Documentation
| Doc | Description |
|---|---|
| Architecture | System overview, pipeline layers, data model |
| WBS | Work Breakdown Structure — Phase 0~4 progress |
| Design | Algorithm design — spec normalization, dependency detection, search modes, call ordering, ontology modes |
| Research | Competitive analysis, API scale data, commerce patterns |
| OpenAPI Guide | How to write API specs that produce better tool graphs |
Contributing
Contributions are welcome!
# Development setup
git clone https://github.com/SonAIengine/graph-tool-call.git
cd graph-tool-call
pip install poetry
poetry install --with dev
# Run tests
poetry run pytest -v
# Lint
poetry run ruff check .
poetry run ruff format --check .
# Run benchmarks
python -m benchmarks.run_benchmark -v
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graph_tool_call-0.7.1.tar.gz.
File metadata
- Download URL: graph_tool_call-0.7.1.tar.gz
- Upload date:
- Size: 82.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76c3779f43f5bfeeb40789d6160a350c857a726c21e1c896d4709e8bd0745170
|
|
| MD5 |
d06848e0f01f94d93b541cd1cb712857
|
|
| BLAKE2b-256 |
63adc8f84438bd45882c634bb1b38522e46809796f533e66ec152bb467d72d27
|
File details
Details for the file graph_tool_call-0.7.1-py3-none-any.whl.
File metadata
- Download URL: graph_tool_call-0.7.1-py3-none-any.whl
- Upload date:
- Size: 98.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f12ca3b4b7ecf1779892df9ae781f9657183bf7f84e76df14d15c2b02dc9223a
|
|
| MD5 |
5a3303eddf0b9b76c8e551baee060b43
|
|
| BLAKE2b-256 |
5ca6823cd9661874d732c45db7129d5453d78d4b09a98d24ac1307a85f76b231
|