Pick the right tools for an agent task. Boring baseline. Open dataset.
Project description
agent-tool-router
Pick the right tools for an agent task. Boring baseline. Open dataset.
from agent_tool_router import Router
# First call downloads ~6 MB from huggingface.co/dalek-ai and caches it.
r = Router.from_pretrained("baseline-v1-desc")
r.route("cancel my pending order and refund the credit", k=3)
# ['refundOrder', 'modify_pending_order_items', 'cancel_pending_order']
Install: pip install agent-tool-router. No GPU, no torch, no API key.
En français : ce router open-source choisit les outils à appeler pour une tâche, parmi un catalogue de 18 000. Le pretrained multilingue sort 54% top-3 sur un panel de 50 tâches en français (baseline-v1-desc-hybrid-multilingual), sans coût mesurable côté anglais. Tout est téléchargeable depuis huggingface.co/dalek-ai, licence MIT.
Try it without installing
A hosted instance runs the baseline-v1-desc-hybrid-multilingual-next-v1 model at
dalek-ai-router-api.hf.space. One curl, no
signup, no API key:
curl -X POST https://dalek-ai-router-api.hf.space/route \
-H 'Content-Type: application/json' \
-d '{"task": "annule ma commande et rembourse-moi", "k": 3}'
Returns top-3 tools + scores + latency. Interactive Swagger UI at
/docs. Median latency ~200 ms on a
shared free CPU (vs ~9 ms locally on CPU). Free tier, rate-limited only by HF
Spaces quotas. Code: router/api/.
Waitlist & feedback
If you build agents and want the API beyond the public demo (private models, higher rate limits, eval datasets), drop your handle in the Waitlist discussion. Bug reports and feature requests also go there.
What this is
Most agent stacks today wire up a fixed bag of tools and let the LLM figure out when to call what. That works until the bag has more than ~30 tools, at which point prompt-stuffed tool descriptions blow up the context, latency creeps, and routing decisions start to get random.
agent-tool-router is a small library that takes a task description and
returns the top-k tools to use, ranked. The first model is a centroid retrieval
baseline trained on 14 000 traces from public agent benchmarks. It's
intentionally dumb and intentionally fast. You should be able to beat it.
What's in the box
agent_tool_router/: the SDK (Router.from_pretrained,route(task, k)).router/index/: loaders that normalize public datasets (tau-bench, Hermes function-calling-v1, ToolACE, SWE-bench Verified, OSWorld) into a unifiedTraceschema.router/eval/: the evaluation scripts that produced the numbers below.scripts/make_dataset.sh: rebuilddata/traces.jsonlfrom public sources.
data/ and models/ are gitignored. Generate them locally.
Numbers (baseline-v0)
Trained on 8 162 task→tool sequences (cross-corpus, after dedup). Test on 2 041 held-out tasks. Tool vocabulary filtered to names appearing ≥ 3 times in the training set: 265 tools.
| metric | model | random | ratio |
|---|---|---|---|
| top-1 per-call accuracy | 33.0% | 0.38% | 87.5× |
| top-3 per-call accuracy | 63.8% | 1.13% | 56.4× |
| top-5 per-call accuracy | 83.0% | 1.89% | 44.0× |
| top-10 per-call accuracy | 91.5% | 3.77% | 24.3× |
Per-source top-3 (same model, evaluated by source):
| source | n_test tasks | calls evaluated | top-3 acc | ratio |
|---|---|---|---|---|
| Hermes function-calling-v1 | 218 | 13 | 92.3% | 81.5× |
| ToolACE | 1 792 | 60 | 63.3% | 55.9× |
| tau-bench | 31 | 151 | 61.6% | 54.4× |
Caveats (read these before quoting the numbers)
-
Hermes leaks the tool name into the task text 21.5% of the time. A row like "Get the camera live feed" gold-calls
get_camera_live_feed. The model isn't really learning routing on those, it's doing fuzzy substring matching. We measured: tau-bench 0%, SWE-bench 0%, ToolACE 2.8%, Hermes 21.5%. The cross-corpus number above is the headline; the closest-to-clean number is tau-bench's top-3 = 5.0× random (vocab=23, separate baseline inrouter/eval/baseline_tfidf.py). -
The vocab is filtered. 95% of the union vocab (~10 000 tool names) only appears once or twice. The baseline can't learn anything about those, so it's evaluated on the 265 tools that actually have training signal. Cold-start tool routing is an open problem (see Roadmap).
-
Centroid retrieval is the floor, not the ceiling. This is what a TF-IDF model and a bit of arithmetic can do. Anything you build should beat it; if it doesn't, the problem is your model, not the dataset.
-
The pretrained model does not transfer across tool ecosystems via names. Each source brings its own private universe of tool names:
cancel_pending_order(tau-bench retail) doesn't appear in ToolACE;Get Stock Price(ToolACE) doesn't appear in Hermes. Leave-one-source-out vocab overlap is 0.0%–0.1% across the three sources, so name-based routing trained on N-1 sources scores ~0% on the held-out source. For your own tools, seeRouter.from_examples(). -
Even after stripping leaky rows, the cross-corpus baseline holds. Filtering out every row where the gold tool name appears verbatim or as in-order subtokens within a 4-token window of the task text drops the dataset from 14K to 10.4K rows but only moves the cross-corpus headline from 56.4× to 30.6× random top-3. See
router/eval/baseline_cross_corpus_clean.py.
Cross-source generalization via tool descriptions
The roadmap line "can a model bridge ecosystems via tool descriptions rather than tool names?" is now answered. Yes, mostly.
We extracted the natural-language description of every tool we could find
(2.6K from Hermes, 16K from ToolACE, 29 from tau-bench, in
data/tool_descriptions.jsonl) and re-ran leave-one-source-out, but
scoring tools by cosine(task, description) instead of by training tool
centroids on tool names. Description text only, no name subtokens:
| held out | catalog size | top-1 | top-3 | top-3 vs random |
|---|---|---|---|---|
| Hermes function-calling-v1 | 1 911 | 41.6% | 73.5% | 468× random |
| ToolACE | 10 065 | 22.5% | 34.6% | 1 162× random |
| tau-bench | 23 | 8.7% | 19.8% | 1.5× random |
For comparison, the same setup with names scored 0% top-3 across all
three sources. So descriptions transfer, names don't. tau-bench is the
weak case because its 23 tools are domain-specific customer-service flows
that have no analog in the training corpus, but it still beats random.
Source: router/eval/baseline_loso_descriptions.py.
A pre-trained sentence encoder (sentence-transformers/all-MiniLM-L6-v2) gives a different shape of result. Same protocol, descriptions only:
| held out | catalog | TF-IDF top-3 | bi-encoder top-3 | hybrid α=0.5 top-3 | hybrid (best α) top-3 |
|---|---|---|---|---|---|
| Hermes | 1 911 | 73.5% | 67.9% | 77.2% | 78.5% (α=0.7) |
| ToolACE | 10 065 | 34.6% | 58.5% | 61.8% | 62.1% (α=0.4) |
| tau-bench | 23 | 19.8% | 31.1% | 29.2% | 31.3% (α=0.1) |
The two backends are complementary: TF-IDF wins when task and description
share lexical surface (Hermes), the bi-encoder wins when the description
paraphrases the task without sharing words (ToolACE, tau-bench). A flat
hybrid 0.5·cos_tfidf + 0.5·cos_encoder Pareto-improves on Hermes and
ToolACE, with a small regression on tau-bench. Per-source-tuned α improves
all three. Source: router/eval/baseline_loso_descriptions_hybrid.py.
The bi-encoder pulls in sentence-transformers and torch (~250 Mo of
deps), so we don't ship it in the default install. The TF-IDF path is the
SDK default; the encoder and hybrid backends are available behind an
optional extras: pip install agent-tool-router[encoder].
Next-tool prediction (history-aware)
The retrieval router scores tools against the user query. That works when
the query describes the tool ("cancel my order" → cancelOrder). It
breaks when the agent is mid-trajectory and the next tool depends on
what was already called.
We mined 10 480 (query, history, next_tool) triplets from the 7 184
multi-turn traces in the dataset (ToolACE + Hermes + tau-bench, after
collapsing consecutive duplicates) and ran a Markov-1 rerank on top of
the shipped retrieval (top-K candidates, rerank by
α · retrieval + (1-α) · P(next | last_history_tool), 80/20 split by
trace_id, add-one smoothing, Markov-1 fit on train only).
| Setup | top-1 | top-3 | top-5 | top-10 |
|---|---|---|---|---|
| Retrieval-only | 13.8% | 32.7% | 38.8% | 45.4% |
| Markov-1 rerank top-50 (α=0.4) | 34.6% | 48.0% | 50.5% | 53.6% |
| Markov-1 rerank top-200 (α=0.1) ⬅ default | 39.0% | 54.9% | 57.7% | 60.6% |
That is +22.2pp top-3 over retrieval, and +6.7pp over the top-50 rerank baseline, from widening the retrieval bucket without any new training. Stratified by position, the gap is largest deep in the trajectory: on Hermes at t≥3 the rerank pulls top-3 from ~31% to 100%; on tau-bench at t=1 it goes 7.0% → 57.7%.
Retrieval recall@K is the mechanical ceiling on any rerank that lives on top-K candidates: recall@50 = 58.9%, recall@200 = 69.6%. Markov-1 reaches ~99% of that ceiling at both bucket sizes — the rerank is essentially optimal, the bottleneck is whether the gold tool is in the candidate set at all.
Reproduce:
python router/eval/build_next_tool_dataset.py
python router/eval/build_next_tool_cache.py
python router/eval/eval_next_tool_markov.py # K=50 sweep
python router/eval/eval_next_tool_widen.py # K=50/100/150/200 sweep
A small learned MLP rerank (concat of query / prev-tool / candidate
MiniLM embeddings + retrieval score, 1 hidden 128, trained on 5K
positives) was also tested in router/eval/train_next_tool_mlp.py and
eval_next_tool_mlp.py. On top-50, it lifts top-3 from retrieval 32.7%
to 40.3% (+7.6pp), but loses -7.7pp versus Markov-1 — the counts-based
transition prior is hard to outscore with dense features on this much
training data. To actually move past the retrieval-recall ceiling, train
the retriever directly on the next-tool objective rather than stacking
more reranks on top-K.
That direct-fine-tune is what
baseline-v1-desc-hybrid-next-v1
does. Fine-tuning MiniLM-L6 on 8 386 (task, gold_description,
hard_negative) triples reshapes the retrieval space itself: recall@200
on the same held-out triplets jumps from 69.6% to 93.1%, and the
Markov-1 top-3 rerank K=200 jumps from 54.9% to 75.5% (+20.6pp).
The fine-tuned encoder also Pareto-dominates the default English-only
encoder on the full LOSO refit benchmark (Hermes +1.3pp, ToolACE +6.1pp,
tau-bench +27.7pp top-3) — supervised next-tool signal generalizes to
single-task routing too. On the parallel EN/FR n=50 panel, the
fine-tune does not cost any French (26% → 28% top-3, within noise)
and adds +4pp English top-3 (82% → 86%) over the default hybrid,
so it is a free upgrade for English-or-mixed catalogs; the multilingual
model still owns French (54% top-3). Reproduce:
python router/eval/finetune_retriever_next_tool.py (re-creates the
encoder) + python -m agent_tool_router.train_descriptions --backend hybrid --alpha 0.5 --encoder-model models/_finetune/minilm-next-v1 --out models/baseline-v1-desc-hybrid-next-v1.
The rerank ships with baseline-v1-desc-hybrid (≥ 0.3.0). Pass the
tool names already called in the trace as history= and the top-200
candidates are reranked with the Markov-1 prior at α=0.1 (the sweep-best
on the held-out test):
r = Router.from_pretrained("baseline-v1-desc-hybrid")
# Without history: pure retrieval.
r.route("I want to add a checked bag to my reservation", k=3)
# ['update_reservation_baggages', 'completeReservation', 'book_reservation']
# With history: the prior pulls in tools that usually follow the last one.
r.route(
"I want to add a checked bag to my reservation",
k=3,
history=["update_reservation_flights"],
)
# ['update_reservation_baggages', 'update_reservation_passengers', 'cancel_reservation']
Override the mix with markov_alpha=0.0 (Markov-only) or 1.0
(retrieval-only). On baseline-v1-desc-hybrid the table adds ~21 KB to
the download; it's a no-op when history is omitted.
History bigram (Markov-2, ≥ 0.4.0) — passing two or more previous
tools triggers stupid backoff to a (prev2, prev1) → next bigram table
shipped in the same model dir (markov2_counts.npz + markov2_keys.npy,
~25 KB), falling back to Markov-1 when the bigram is unseen. On the
fine-tuned baseline-v1-desc-hybrid-next-v1, the bigram lifts held-out
next-tool top-1 from 52.9% to 57.3% (+4.4pp) and top-3 from 75.5%
to 77.4% (+1.9pp), with the biggest single-bucket gain on tau-bench
t≥3 (long-horizon agents): 84.8% → 87.9% top-3. On the multilingual
fine-tune the gain is larger: top-1 51.7% → 56.9% (+5.2pp), top-3
73.7% → 76.2% (+2.5pp), and tau-bench t=2 reaches 100% top-3. Reproduce
with python router/eval/eval_next_tool_markov2.py.
Use it on your own tools
If your agent has 5 custom tools (web_search, internal_kb,
run_sql, ...) the pretrained model has never seen them. Build a router
in memory from a small seed list of your own tasks:
from agent_tool_router import Router
r = Router.from_examples([
("search the web for recent papers on X", ["web_search"]),
("look up the customer's order history", ["internal_kb"]),
("run a SQL query to count active users", ["run_sql"]),
# ... ~10-30 examples per tool is enough to start
])
r.route("find the top sellers from last quarter", k=2)
# ['run_sql', 'internal_kb']
A full ~70-line example, with five mock tools and 68 seed examples, is in
examples/research_helper/. Run it:
python -m examples.research_helper.agent
If you already have OpenAI-style function specs (each tool comes with a
short natural-language description) and don't want to seed example tasks,
you can build the same kind of router directly from (name, description)
pairs. This is the same scoring rule that gave 73% top-3 cross-source in
our LOSO eval:
r = Router.from_descriptions([
("web_search", "Search the web and return the top results."),
("run_sql", "Execute a SQL query against the warehouse and return rows."),
("internal_kb","Look up a record in the internal customer knowledge base."),
# ... ideally 50+ tools; with <50 short descriptions TF-IDF is too thin
])
r.route("count how many customers churned last quarter", k=2)
For small tool sets, from_examples() works better because example tasks
fit the vectorizer on richer text.
If your tool descriptions paraphrase tasks more than they share vocabulary with them (a common case for OpenAPI / OpenAI specs), switch to the encoder or hybrid backend:
r = Router.from_descriptions(specs, backend="hybrid", alpha=0.5)
# requires: pip install agent-tool-router[encoder]
backend="tfidf" is the default and pulls no extra deps.
backend="encoder" runs sentence-transformers/all-MiniLM-L6-v2 and
scores by cosine on its embeddings. backend="hybrid" linearly combines
TF-IDF and encoder cosines: in our LOSO eval the hybrid Pareto-dominates
both solo backends with alpha=0.5 on 2/3 held-out sources (see the
table above).
On HuggingFace
Everything is mirrored on huggingface.co/dalek-ai:
| Surface | Link | What it's for |
|---|---|---|
| Space (live demo) | agent-tool-router-demo | One-click gradio app. Type a task in EN or FR, get the top-3 tools out of 18 000. No install. |
| Models (6) | dalek-ai | Pretrained routers downloadable via Router.from_pretrained(...). See models below. |
| Dataset | agent-tool-router-eval-fr | 50 parallel EN/FR evaluation queries used to measure the multilingual gap. MIT. |
Pretrained models
Router.from_pretrained("<name>") downloads from huggingface.co/dalek-ai on
first call and caches locally. Six pretrained models are published:
baseline-v0— 265 tool names that appear ≥ 3 times in the training corpus, centroids built from task TF-IDF. The smallest model, no extra dependencies, no long-tail coverage. ~100 MB.baseline-v1-desc— 18 671-tool long-tail catalog, centroids built from each tool's description. TF-IDF only, ~6 MB, no extra dependencies.baseline-v1-desc-hybrid— same catalog, hybrid TF-IDF + bi-encoder centroids scored as0.5 * cos_tfidf + 0.5 * cos_encoder. ~35 MB. Requirespip install agent-tool-router[encoder]at runtime (sentence-transformers + torch).baseline-v1-desc-hybrid-multilingual— same hybrid pipeline, but the encoder isparaphrase-multilingual-MiniLM-L12-v2(50+ languages). On a parallel EN/FR probe (n=50 hand-written queries) routed through the shipped model, FR top-3 jumps from 26% (default hybrid) to 54% with EN top-3 flat at 82%. On the full English LOSO refit benchmark the multilingual encoder trails the default by ~3.9pp weighted overall, so use the default if all your queries are English. ~80 MB.baseline-v1-desc-hybrid-next-v1— same hybrid pipeline, but the encoder has been fine-tuned on 8 386 next-tool prediction triplets with a contrastive (task, gold_description, hard_negative) loss. On held-out next-tool prediction (n=2 094), Markov-1 top-3 jumps from 54.9% (default hybrid, top-200 rerank) to 75.5% (+20.6pp), and recall@200 from 69.6% to 93.1%. On the full English LOSO refit benchmark (n=30 425), the fine-tuned encoder Pareto-dominates: Hermes +1.3pp, ToolACE +6.1pp, tau-bench +27.7pp top-3. On the parallel EN/FR n=50 panel, no French degradation (26%→28% top-3, within noise) and +4pp English top-3 (82%→86%) over the default hybrid. ~30 MB. Encoder weights live separately atdalek-ai/minilm-next-v1.baseline-v1-desc-hybrid-multilingual-next-v1— same fine-tune recipe applied to the multilingual L12 encoder. Pareto-dominates the plain multilingual on every LOSO refit source: Hermes +7.3pp, ToolACE +4.3pp, tau-bench +33.9pp top-3. Versus the EN-only next-v1: gives up ~3pp on Hermes/ToolACE in exchange for +4.7pp tau-bench and multilingual coverage. FR/EN n=50: 54% FR top-3 preserved (no drift from the EN-only training triples), +2pp EN top-3 over plain multilingual. ~33 MB. Encoder weights atdalek-ai/multilingual-next-v1. Use this model for mixed FR/EN catalogs that also benefit from history-aware rerank.
from agent_tool_router import Router
r = Router.from_pretrained("baseline-v1-desc-hybrid")
r.route("cancel my order and refund the credit", k=3)
# -> ['refundOrder', 'cancel_order', 'cancel_pending_order']
The encoder model is lazy-loaded on the first route() call, so import
cost is paid only when actually used.
Rebuild from source
If you prefer to retrain locally instead of downloading:
git clone https://github.com/dalek-ai/agent-tool-router.git
cd agent-tool-router
pip install -e .
bash scripts/make_dataset.sh
python -m agent_tool_router.train --out models/baseline-v0
python -m agent_tool_router.train_descriptions --out models/baseline-v1-desc
python -m agent_tool_router.train_descriptions --out models/baseline-v1-desc-hybrid \
--backend hybrid --alpha 0.5 # requires the [encoder] extras
Local models/<name>/ directories take precedence over HuggingFace lookups.
Per-call top-3 accuracy of baseline-v1-desc against the full 18 671-tool
catalog, on 30 425 calls drawn from the corpus, by backend (random baseline
= 3/V = 0.016%):
| source | n calls | tfidf | encoder | hybrid α=0.5 |
|---|---|---|---|---|
| Hermes function-calling-v1 | 4 376 | 74.3% | 60.7% | 74.9% |
| ToolACE | 17 169 | 52.4% | 54.8% | 62.8% |
| tau-bench | 8 880 | 3.2% | 6.1% | 9.9% |
| overall | 30 425 | 41.2% | 41.4% | 49.1% |
The two backends look tied on overall top-3 (41.2% vs 41.4%) but they get
different things right: TF-IDF wins on Hermes (lexical surface overlap),
the bi-encoder wins on ToolACE and tau-bench (semantic paraphrase). Their
linear combination Pareto-dominates both on every source and every k, +7.9pp
overall. The encoder backend is opt-in behind pip install agent-tool-router[encoder]
(adds torch + sentence-transformers, ~250 MB).
Read the tau-bench row carefully even at 9.9% hybrid. The same 23
customer-service tools score 19.8% top-3 against a restricted catalog (LOSO
eval, see below) and only 9.9% against the full v1-desc catalog because the
18 000 ToolACE and Hermes confounders win against domain-specific descriptions.
The takeaway: baseline-v1-desc is a discoverability layer for long-tail
public tools, not a substitute for routing on your own narrow catalog.
For domain-specific tool sets, use Router.from_descriptions(your_own).
Reproduce: python -m router.eval.eval_baseline_v1_desc (tfidf shipped
model), python -m router.eval.eval_v1_desc_encoder [--hybrid] (encoder /
hybrid, rebuilt from data/tool_descriptions.jsonl).
The same eval, but with the TF-IDF half refit per held-out source on N-1 sources only (encoder is pretrained, source-agnostic). This is the realistic cross-source number: how much does the shipped pipeline lose on a brand-new source it has never seen at training time?
| held out | n calls | tfidf | encoder | hybrid α=0.5 |
|---|---|---|---|---|
| Hermes function-calling-v1 | 4 376 | 70.3% | 60.7% | 72.3% |
| ToolACE | 17 169 | 36.5% | 54.8% | 58.7% |
| tau-bench | 8 880 | 10.1% | 6.1% | 11.1% |
Hybrid Pareto-dominates both solo backends on all three held-out sources.
ToolACE TF-IDF drops sharply (52.4% → 36.5%, -15.9pp) when the vocabulary
has not seen ToolACE descriptions; the encoder catches most of the fall.
The shipped hybrid, by contrast, only loses 2.6pp on Hermes and 4.1pp on
ToolACE vs the in-distribution number, and matches on tau-bench. Reproduce:
python -m router.eval.eval_v1_desc_loso_hybrid.
A bigger encoder (BAAI/bge-small-en-v1.5, 33M params, ~38 MB centroids)
was tested as an alternative to MiniLM-L6 (22M params, ~29 MB). LOSO refit
top-3: Hermes 72.3% → 75.0% (+2.7pp), tau-bench 11.1% → 14.3% (+3.2pp),
ToolACE 58.7% → 54.2% (-4.5pp). Weighted overall by n_calls drops 1.3pp.
MiniLM-L6 stays the default; pass --encoder-model BAAI/bge-small-en-v1.5
to the train script if Hermes/tau-bench is your weight class.
A multilingual encoder (paraphrase-multilingual-MiniLM-L12-v2, 117M
params) was also tested. On a 50-query parallel EN/FR probe routed through
the shipped pretrained, top-3 accuracy goes from 82%/26% (default hybrid,
EN/FR) to 82%/54% — same EN coverage, +28pp on FR. On the full English
LOSO refit, the multilingual encoder costs Hermes -8.8pp, ToolACE -3.9pp
and tau-bench -1.5pp on top-3 (weighted overall -3.9pp). Shipped as
baseline-v1-desc-hybrid-multilingual for users whose queries are not all
in English. Reproduce: python -m router.eval.eval_fr_pretrained (shipped
models on the EN/FR probe) and
python -m router.eval.eval_v1_desc_loso_hybrid --encoder-model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.
Per-task min-max / z-score / rank normalization of cos_tfidf and
cos_enc before the linear combo was also tested (router/eval/eval_v1_desc_loso_calibration.py).
None of them pareto-dominate the unnormalized baseline at α=0.5: minmax
and zscore each gain a couple of pp on Hermes and tau-bench but lose
7-11pp on ToolACE, because ToolACE is 86% of the catalog and per-task
rescaling amplifies same-source distractors when ToolACE is held out.
The shipped pipeline keeps the unnormalized linear combo.
CLI:
python -m agent_tool_router "Book me a flight to Tokyo and a hotel near Shibuya" -k 5 --scores
The dataset (14K traces)
| source | rows | gold actions? | notes |
|---|---|---|---|
| ToolACE (Team-ACE) | 8 971 | yes | Synthetic function calls; vast vocab. |
| Hermes function-calling-v1 (NousResearch) | 2 180 | yes | OpenAI-format <tool_call> blocks. |
| tau-bench (sierra-research) | 1 980 | yes | Real GPT-4o + Sonnet-3.5 trajectories on retail/airline. |
| SWE-bench Verified | 500 | partial | Patch-only, used for completeness. |
| OSWorld | 369 | no | Task definitions only, no realized rollouts. |
All loaders normalize to a single Trace schema (see
router/index/trace_schema.py). Adding a new source means writing a loader
that yields Trace instances; nothing else changes.
Roadmap
The current baseline answers "which tools, ranked" on a closed vocabulary. The interesting questions are downstream:
- Cold-start tool routing. Given a tool you've never seen, can you route to it from its description alone? This is the actual hard problem and where most of the dataset (95% singleton long-tail) is currently dead weight.
- Sequence routing. History-aware reranks shipped in
baseline-v1-desc-hybrid(Markov-1 ≥ 0.3.0, Markov-2 stupid backoff ≥ 0.4.0): top-3 next-tool accuracy 32.7% → 54.9% (Markov-1) → 55.1% (Markov-2) on a held-out test set (n=2094), ~99% of the recall@200 ceiling. Past 54.9% requires improving the retriever itself — training a retriever directly on the next-tool objective (baseline-v1-desc-hybrid-next-v1) lifts that ceiling to 75.5% top-3, and Markov-2 takes it to 77.4% (+1.9pp over Markov-1 on next-v1, +5.2pp top-1 on multilingual-next-v1). A learned MLP rerank was tested and archived (loses to Markov-1 by ~8pp top-3 on top-50; counts dominate dense features at this data scale). - Cross-source generalization. Names don't transfer (LOSO ≈ 0%);
descriptions do (LOSO ≈ 35–74% top-3 on the two held-out sources with
broad catalogs; see Caveats). Next step: make
Routerfirst-class on tool descriptions, not just names. Today the SDK assumes you pass(task, [tool_names]); tomorrow it should accept(task, [(name, description)])and pick the path automatically based on what you give it. - Real traces. Public benchmarks are great for bootstrapping but skewed toward synthetic prompts. Opt-in trace contribution is the long-game moat.
License
MIT. Be kind.
Status
Phase 0, still figuring out the right shape of this thing. Issues / PRs welcome. Break things early.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_tool_router-0.4.0.tar.gz.
File metadata
- Download URL: agent_tool_router-0.4.0.tar.gz
- Upload date:
- Size: 39.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce717677de0fed54f54d2d9070a73b4b55176cdbd04d1c9144ed32131053091f
|
|
| MD5 |
4e3d9ce66516e1d1c2388148293343bc
|
|
| BLAKE2b-256 |
8a7ed15e4101e1c3c37d402cc55df93f38d9abfd125443dfd57f91d2dbc715d5
|
File details
Details for the file agent_tool_router-0.4.0-py3-none-any.whl.
File metadata
- Download URL: agent_tool_router-0.4.0-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fe83ff823402152621d6a29d36b40c9fa527dc7a80a3f7b944c66909f12dedd
|
|
| MD5 |
440f9e0c76eb3f92a0386ef9e1bf1846
|
|
| BLAKE2b-256 |
cedd121265447b1cd1d37a836cb812700b91ed254acc33fc4cc7a5eda06fe740
|