Benchmark framework for comparing LLM agent tool-call quality across vendors.
Project description
Agent Vendor Verifier
Agent Vendor Verifier is a benchmarking framework for evaluating Agent tool-call effectiveness.
The benchmark aggregates these dimensions into a single, comparable fusion score (IRF), enabling fair cross-vendor comparison for agent-style tool usage.
Agent Vendor Verifier is built upon K2-Vendor-Verifier.
Why Agent Vendor Verifier?
- Multi-dimensional metrics: Go beyond “was a tool called” by measuring:
- correctness,
- schema compliance,
- request success and stability,
- latency and throughput.
- Comparable fusion score: combine heterogeneous metrics into a single score for ranking and model/vendor selection.
Metrics
For each sample, the benchmark records the finish_reason (e.g. tool_calls, stop, others) and optional tool-call validation results.
| Metric | What it Evaluates | Direction |
|---|---|---|
| F1 Score | Whether a model triggers tool calls on the right samples, compared against a designated baseline vendor | Higher is better |
| Success Rate | Whether requests successfully complete without API or runtime errors | Higher is better |
| Schema Accuracy | Whether generated tool-call arguments conform to the declared JSON Schema | Higher is better |
| Avg Token | Token usage efficiency per request (prompt + completion) | Lower is better |
| Avg TTFT | Responsiveness: time from request to first token (ms) | Lower is better |
| TPS | Generation performance during decoding (e.g. tokens/s) | Higher is better |
F1 Score
F1 score measures whether a model triggers tool calls on the correct samples, compared against a designated baseline vendor for the same model.
- A higher F1 indicates closer alignment with the baseline on when to issue tool calls.
| Model | Baseline Vendor |
|---|---|
| Claude | Anthropic |
| Gemini | |
| Deepseek | Deepseek |
| Minimax | Minimax |
| GLM | Bigmodel |
| Kimi | Moonshot |
Treat “ended with tool_calls” as positive, and “ended with stop or others” as negative.
| Case | Baseline result | Current vendor result | Meaning |
|---|---|---|---|
| TP | tool_calls |
tool_calls |
Both agree to trigger a tool call |
| FP | stop / others |
tool_calls |
False trigger |
| FN | tool_calls |
stop / others |
Missed trigger |
| TN | stop / others |
stop / others |
Both agree not to trigger |
The baseline is treated as ground truth for whether a tool call should occur.
F1 score computation:
$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$
$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
$$ \text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$
Note: The baseline vendor’s F1 is fixed at 1.0.
Success Rate
Success rate measures the proportion of requests that complete successfully, without API errors, timeouts, or runtime failures.
Success rate computation:
$$ \text{Success Rate} = \frac{\text{successful requests}}{\text{successful requests} + \text{failed requests}} $$
Schema Accuracy
Schema accuracy measures whether the arguments generated in tool calls conform to the declared JSON Schema provided in the request.
- Schema accuracy directly reflects tool-call output quality.
Schema accuracy computation:
$$ \text{Schema Accuracy} = \frac{\text{valid tool calls}}{\text{total tool calls}} $$
Average Tokens
Average tokens measures the average total token usage per request, including both prompt tokens and completion tokens.
- Lower values indicate better token efficiency.
Average TTFT
Average TTFT (Time to First Token) measures the average latency, in milliseconds, from request submission to the arrival of the first generated token.
- Lower values indicate faster initial response.
TPS
TPS measures the average token generation speed during the decoding phase, typically in tokens per second.
- Higher values indicate faster decoding performance.
Together, these six metrics form the basis for the IRF fusion score used for final ranking.
Fusion Score: IRF (Inverse Rank Fusion)
To combine heterogeneous metrics into a single score, Agent Vendor Verifier uses Inverse Rank Fusion (IRF) across six metrics:
- F1 Score
- Success Rate
- Schema Accuracy
- Avg Token
- Avg TTFT
- TPS
They cover correctness, stability, correctness of arguments, cost, and performance, making IRF a balanced indicator of Agent tool-call effectiveness.
Step 1: Rank per metric
For each metric, rank all participating entities:
- Higher is better: F1, Success Rate, Schema Accuracy, TPS → descending order
- Lower is better: Avg Token, Avg TTFT → ascending order
Step 2: Assign ranks
Each entity receives a rank ( r ) starting from 1 (ties handled by standard ranking rules).
Step 3: IRF contribution
For each metric where an entity has a value, compute:
$$ \text{contribution} = \frac{1}{r + k} $$
where k = 5 (In order to map IRF score to $(0, 1]$).
The final IRF score is the sum over all participating metrics:
$$ \text{IRF} = \sum_{\text{metrics}} \frac{1}{r_{\text{metric}} + 5} $$
- Better ranks across more metrics yield higher IRF scores.
- IRF score is comparable within the same model and different vendors.
Performance
We evaluated the tool-call effectiveness of the latest models across multiple vendors and ranked them using the IRF score. The results are shown below:
claude-haiku-4.5
| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
|---|---|---|---|---|---|---|---|
| anthropic (openrouter) | 0.8635 | 1 | 1 | 99.36 | 0.9346 | 3897 | 2571 |
| foxcode | 0.8357 | 1 | 0.7513 | 94.21 | 0.9891 | 4161 | 1945 |
| packyapi | 0.8040 | 1 | 0.7590 | 103.7 | 0.8989 | 4500 | 2504 |
| yunwu | 0.7583 | 1 | 0.7356 | 22.3 | 0.9041 | 4969 | 1412 |
claude-opus-4.5
| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
|---|---|---|---|---|---|---|---|
| anthropic (openrouter) | 0.9217 | 1 | 1 | 50.29 | 0.9714 | 4696 | 2585 |
| packyapi | 0.8741 | 1 | 0.7959 | 53.28 | 0.9082 | 5062 | 2510 |
| yunwu | 0.8095 | 0.9967 | 0.8384 | 18.71 | 0.9024 | 6105 | 1435 |
claude-sonnet-4.5
| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
|---|---|---|---|---|---|---|---|
| foxcode | 0.8774 | 1 | 0.8454 | 51.94 | 0.8000 | 4985 | 1861 |
| anthropic (openrouter) | 0.8357 | 1 | 1 | 33.95 | 0.8121 | 5175 | 2398 |
| yunwu | 0.8278 | 1 | 0.8304 | 18.94 | 0.8593 | 5649 | 1399 |
| packyapi | 0.7206 | 1 | 0.8293 | 48.36 | 0.7692 | 5676 | 2458 |
deepseek-v3.2
| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
|---|---|---|---|---|---|---|---|
| deepseek (openrouter) | 0.8234 | 0.9870 | 1 | 62.77 | 0.9854 | 1268 | 1859 |
| google-vertex (openrouter) | 0.7885 | 0.9995 | 0.6216 | 101.9 | 0.8480 | 777.3 | 1968 |
| siliconflow (openrouter) | 0.7706 | 0.9980 | 0.7282 | 151.5 | 0.8808 | 2465 | 2927 |
| atlascloud (openrouter) | 0.7567 | 1 | 0.7257 | 81.03 | 0.8438 | 1211 | 2368 |
| siliconflow | 0.7345 | 0.9815 | 0.7988 | 15.15 | 0.8633 | 5724 | 1719 |
gemini-2.5-flash
| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
|---|---|---|---|---|---|---|---|
| google-vertex (openrouter) | 0.9524 | 1 | 1 | 249.3 | 0.6875 | 1651 | 1731 |
| gemini (openrouter) | 0.9048 | 0.9933 | 0.8810 | 216.1 | 0.6897 | 2626 | 1658 |
glm-4.7
| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
|---|---|---|---|---|---|---|---|
| bigmodel | 0.9107 | 0.9725 | 1 | 123.4 | 0.8409 | 2243 | 1789 |
| z.ai (openrouter) | 0.8512 | 0.9980 | 0.8575 | 246 | 0.8291 | 4986 | 2303 |
| atlascloud (openrouter) | 0.8452 | 0.9975 | 0.8624 | 103.8 | 0.8327 | 1914 | 2311 |
kimi-k2
| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
|---|---|---|---|---|---|---|---|
| siliconflow | 0.9158 | 1 | 0.8427 | 58.23 | 1 | 1313 | 1581 |
| siliconflow_OR | 0.8690 | 0.9975 | 0.8355 | 92.81 | 0.9985 | 1115 | 1816 |
| moonshot ai_OR | 0.8205 | 0.9905 | 1 | 46.39 | 1 | 2662 | 1857 |
minimax-m2
| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
|---|---|---|---|---|---|---|---|
| google-vertex (openrouter) | 0.9217 | 0.9990 | 0.7662 | 347.6 | 1 | 491.7 | 2556 |
| minimax (openrouter) | 0.8512 | 0.9980 | 1 | 147.2 | 0.8205 | 1483 | 2278 |
| atlascloud (openrouter) | 0.8324 | 0.9960 | 0.7622 | 152.6 | 1 | 745.9 | 2443 |
Dataset
We made modifications to the open-sourced dataset from K2-Vendor-Verifier to fix errors that occurred when running it on other models and vendors. The dataset distribution details will be published separately.
Supported Vendors
Agent Vendor Verifier currently supports the following vendors:
- Openrouter
- Gemini
- Siliconflow
- 429.icu
- packyapi
- yunwu
- foxcode
Agent Vendor Verifier can easily add new vendors by inheriting from the Vendor base class, see Vendor-base.
Supported Models
To address that different vendors may use different names for the same model, Agent Vendor Verifier provides a unified model naming scheme. The models currently supported are listed below.
| Model Series | Unified Models Name |
|---|---|
| Gemini | gemini-2.5-flash |
| Claude |
claude-sonnet-4.5 claude-opus-4.5 claude-haiku-4.5 |
| DeepSeek | deepseek-v3.2 |
| GLM | glm-4.7 |
| MiniMax | minimax-m2.5 |
| Kimi | kimi-k2.5 |
When using Agent Vendor Verifier for evaluation, you only need to consider the model names listed in the table above.
Installation
pip install agent-vendor-verifier
The package requires Python 3.10 or newer and exposes the agent-vendor-verifier command-line entry point.
Agent Vendor Verifier Usage
Prepare:
- Dataset: See Dataset.
config.yaml: Define model and vendor in benchmark..env: Define API keys.
Then run:
agent-vendor-verifier \
--test-file-path "tool-calls/samples.jsonl" \
--config-file-path "config.yaml" \
--vendor-concurrency 5 \
--request-concurrency 30 \
--retries 10 \
--timeout 30
Configuration
Define model and vendor in benchmark
config.yaml maps each unified model name to the vendor configurations evaluated for that model.
Top-level keys are model names. Each model contains a vendors list with the request settings for every vendor endpoint.
We provide an example, see config_example:
minimax-m2.5:
vendors:
- name: openrouter
model_id: minimax/minimax-m2.5
api_key: xxxxxx
provider: minimax/
url: https://openrouter.ai/api/v1
validator: openai
is_baseline: true
Model names must follow this project’s unified naming convention (e.g. gemini-2.5-flash, claude-sonnet-4.5, deepseek-v3.2).
See supported models for details.
Define API keys
API keys are provided via .env.
- Variable names must match vendor names in config.yaml
Example:
openrouter=sk-or-v1-xxxx
siliconflow=sk-xxxx
bigmodel=xxxx
gemini=xxxx
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_vendor_verifier-0.1.0.tar.gz.
File metadata
- Download URL: agent_vendor_verifier-0.1.0.tar.gz
- Upload date:
- Size: 22.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d897487118cc57ea2377e1dc4bc0764c82f33f0b7f174bd69db5d1fee960bebc
|
|
| MD5 |
6481073d121412f4f4cf785995e9732c
|
|
| BLAKE2b-256 |
aa2d9f4812555cf52bc460d1a2132e57810bc623ca62bbf177f7c11667a19374
|
File details
Details for the file agent_vendor_verifier-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agent_vendor_verifier-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2c3597d7448851cc36383ec305221618ffff743dd00f217cab7e207d8910910
|
|
| MD5 |
182a323a5d23939ac18d8500f95409ba
|
|
| BLAKE2b-256 |
af8fd5e5a82f43929020836a124ddff73a6fca8e6f76499d9b51f127cebce9cc
|