Skip to main content

Benchmark framework for comparing LLM agent tool-call quality across vendors.

Project description

Agent Vendor Verifier

Agent Vendor Verifier is a benchmarking framework for evaluating Agent tool-call effectiveness.

The benchmark aggregates these dimensions into a single, comparable fusion score (IRF), enabling fair cross-vendor comparison for agent-style tool usage.

Agent Vendor Verifier is built upon K2-Vendor-Verifier.

Why Agent Vendor Verifier?

  • Multi-dimensional metrics: Go beyond “was a tool called” by measuring:
    • correctness,
    • schema compliance,
    • request success and stability,
    • latency and throughput.
  • Comparable fusion score: combine heterogeneous metrics into a single score for ranking and model/vendor selection.

Metrics

For each sample, the benchmark records the finish_reason (e.g. tool_calls, stop, others) and optional tool-call validation results.

Metric What it Evaluates Direction
F1 Score Whether a model triggers tool calls on the right samples, compared against a designated baseline vendor Higher is better
Success Rate Whether requests successfully complete without API or runtime errors Higher is better
Schema Accuracy Whether generated tool-call arguments conform to the declared JSON Schema Higher is better
Avg Token Token usage efficiency per request (prompt + completion) Lower is better
Avg TTFT Responsiveness: time from request to first token (ms) Lower is better
TPS Generation performance during decoding (e.g. tokens/s) Higher is better

F1 Score

F1 score measures whether a model triggers tool calls on the correct samples, compared against a designated baseline vendor for the same model.

  • A higher F1 indicates closer alignment with the baseline on when to issue tool calls.
Model Baseline Vendor
Claude Anthropic
Gemini Google
Deepseek Deepseek
Minimax Minimax
GLM Bigmodel
Kimi Moonshot

Treat “ended with tool_calls” as positive, and “ended with stop or others” as negative.

Case Baseline result Current vendor result Meaning
TP tool_calls tool_calls Both agree to trigger a tool call
FP stop / others tool_calls False trigger
FN tool_calls stop / others Missed trigger
TN stop / others stop / others Both agree not to trigger

The baseline is treated as ground truth for whether a tool call should occur.

F1 score computation:

$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

$$ \text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

Note: The baseline vendor’s F1 is fixed at 1.0.

Success Rate

Success rate measures the proportion of requests that complete successfully, without API errors, timeouts, or runtime failures.

Success rate computation:

$$ \text{Success Rate} = \frac{\text{successful requests}}{\text{successful requests} + \text{failed requests}} $$

Schema Accuracy

Schema accuracy measures whether the arguments generated in tool calls conform to the declared JSON Schema provided in the request.

  • Schema accuracy directly reflects tool-call output quality.

Schema accuracy computation:

$$ \text{Schema Accuracy} = \frac{\text{valid tool calls}}{\text{total tool calls}} $$

Average Tokens

Average tokens measures the average total token usage per request, including both prompt tokens and completion tokens.

  • Lower values indicate better token efficiency.

Average TTFT

Average TTFT (Time to First Token) measures the average latency, in milliseconds, from request submission to the arrival of the first generated token.

  • Lower values indicate faster initial response.

TPS

TPS measures the average token generation speed during the decoding phase, typically in tokens per second.

  • Higher values indicate faster decoding performance.

Together, these six metrics form the basis for the IRF fusion score used for final ranking.

Fusion Score: IRF (Inverse Rank Fusion)

To combine heterogeneous metrics into a single score, Agent Vendor Verifier uses Inverse Rank Fusion (IRF) across six metrics:

  • F1 Score
  • Success Rate
  • Schema Accuracy
  • Avg Token
  • Avg TTFT
  • TPS

They cover correctness, stability, correctness of arguments, cost, and performance, making IRF a balanced indicator of Agent tool-call effectiveness.

Step 1: Rank per metric

For each metric, rank all participating entities:

  • Higher is better: F1, Success Rate, Schema Accuracy, TPS → descending order
  • Lower is better: Avg Token, Avg TTFT → ascending order

Step 2: Assign ranks

Each entity receives a rank ( r ) starting from 1 (ties handled by standard ranking rules).

Step 3: IRF contribution

For each metric where an entity has a value, compute:

$$ \text{contribution} = \frac{1}{r + k} $$

where k = 5 (In order to map IRF score to $(0, 1]$).

The final IRF score is the sum over all participating metrics:

$$ \text{IRF} = \sum_{\text{metrics}} \frac{1}{r_{\text{metric}} + 5} $$

  • Better ranks across more metrics yield higher IRF scores.
  • IRF score is comparable within the same model and different vendors.

Performance

We evaluated the tool-call effectiveness of the latest models across multiple vendors and ranked them using the IRF score. The results are shown below:

claude-haiku-4.5

Vendor IRF Score Success Rate F1 Score TPS Schema Accuracy TTFT (ms) Avg Token
anthropic (openrouter) 0.8635 1 1 99.36 0.9346 3897 2571
foxcode 0.8357 1 0.7513 94.21 0.9891 4161 1945
packyapi 0.8040 1 0.7590 103.7 0.8989 4500 2504
yunwu 0.7583 1 0.7356 22.3 0.9041 4969 1412

claude-opus-4.5

Vendor IRF Score Success Rate F1 Score TPS Schema Accuracy TTFT (ms) Avg Token
anthropic (openrouter) 0.9217 1 1 50.29 0.9714 4696 2585
packyapi 0.8741 1 0.7959 53.28 0.9082 5062 2510
yunwu 0.8095 0.9967 0.8384 18.71 0.9024 6105 1435

claude-sonnet-4.5

Vendor IRF Score Success Rate F1 Score TPS Schema Accuracy TTFT (ms) Avg Token
foxcode 0.8774 1 0.8454 51.94 0.8000 4985 1861
anthropic (openrouter) 0.8357 1 1 33.95 0.8121 5175 2398
yunwu 0.8278 1 0.8304 18.94 0.8593 5649 1399
packyapi 0.7206 1 0.8293 48.36 0.7692 5676 2458

deepseek-v3.2

Vendor IRF Score Success Rate F1 Score TPS Schema Accuracy TTFT (ms) Avg Token
deepseek (openrouter) 0.8234 0.9870 1 62.77 0.9854 1268 1859
google-vertex (openrouter) 0.7885 0.9995 0.6216 101.9 0.8480 777.3 1968
siliconflow (openrouter) 0.7706 0.9980 0.7282 151.5 0.8808 2465 2927
atlascloud (openrouter) 0.7567 1 0.7257 81.03 0.8438 1211 2368
siliconflow 0.7345 0.9815 0.7988 15.15 0.8633 5724 1719

gemini-2.5-flash

Vendor IRF Score Success Rate F1 Score TPS Schema Accuracy TTFT (ms) Avg Token
google-vertex (openrouter) 0.9524 1 1 249.3 0.6875 1651 1731
gemini (openrouter) 0.9048 0.9933 0.8810 216.1 0.6897 2626 1658

glm-4.7

Vendor IRF Score Success Rate F1 Score TPS Schema Accuracy TTFT (ms) Avg Token
bigmodel 0.9107 0.9725 1 123.4 0.8409 2243 1789
z.ai (openrouter) 0.8512 0.9980 0.8575 246 0.8291 4986 2303
atlascloud (openrouter) 0.8452 0.9975 0.8624 103.8 0.8327 1914 2311

kimi-k2

Vendor IRF Score Success Rate F1 Score TPS Schema Accuracy TTFT (ms) Avg Token
siliconflow 0.9158 1 0.8427 58.23 1 1313 1581
siliconflow_OR 0.8690 0.9975 0.8355 92.81 0.9985 1115 1816
moonshot ai_OR 0.8205 0.9905 1 46.39 1 2662 1857

minimax-m2

Vendor IRF Score Success Rate F1 Score TPS Schema Accuracy TTFT (ms) Avg Token
google-vertex (openrouter) 0.9217 0.9990 0.7662 347.6 1 491.7 2556
minimax (openrouter) 0.8512 0.9980 1 147.2 0.8205 1483 2278
atlascloud (openrouter) 0.8324 0.9960 0.7622 152.6 1 745.9 2443

Dataset

We made modifications to the open-sourced dataset from K2-Vendor-Verifier to fix errors that occurred when running it on other models and vendors. The dataset distribution details will be published separately.

Supported Vendors

Agent Vendor Verifier currently supports the following vendors:

  • Openrouter
  • Gemini
  • Siliconflow
  • 429.icu
  • packyapi
  • yunwu
  • foxcode

Agent Vendor Verifier can easily add new vendors by inheriting from the Vendor base class, see Vendor-base.

Supported Models

To address that different vendors may use different names for the same model, Agent Vendor Verifier provides a unified model naming scheme. The models currently supported are listed below.

Model Series Unified Models Name
Gemini gemini-2.5-flash
Claude claude-sonnet-4.5
claude-opus-4.5
claude-haiku-4.5
DeepSeek deepseek-v3.2
GLM glm-4.7
MiniMax minimax-m2.5
Kimi kimi-k2.5

When using Agent Vendor Verifier for evaluation, you only need to consider the model names listed in the table above.

Installation

pip install agent-vendor-verifier

The package requires Python 3.10 or newer and exposes the agent-vendor-verifier command-line entry point.

Agent Vendor Verifier Usage

Prepare:

  • Dataset: See Dataset.
  • config.yaml: Define model and vendor in benchmark.
  • .env: Define API keys.

Then run:

agent-vendor-verifier \
  --test-file-path "tool-calls/samples.jsonl" \
  --config-file-path "config.yaml" \
  --vendor-concurrency 5 \
  --request-concurrency 30 \
  --retries 10 \
  --timeout 30

Configuration

Define model and vendor in benchmark

config.yaml maps each unified model name to the vendor configurations evaluated for that model.

Top-level keys are model names. Each model contains a vendors list with the request settings for every vendor endpoint.

We provide an example, see config_example:

minimax-m2.5:
  vendors:
    - name: openrouter
      model_id: minimax/minimax-m2.5
      api_key: xxxxxx
      provider: minimax/
      url: https://openrouter.ai/api/v1
      validator: openai
      is_baseline: true

Model names must follow this project’s unified naming convention (e.g. gemini-2.5-flash, claude-sonnet-4.5, deepseek-v3.2).

See supported models for details.

Define API keys

API keys are provided via .env.

  • Variable names must match vendor names in config.yaml

Example:

openrouter=sk-or-v1-xxxx
siliconflow=sk-xxxx
bigmodel=xxxx
gemini=xxxx

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_vendor_verifier-0.1.0.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_vendor_verifier-0.1.0-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file agent_vendor_verifier-0.1.0.tar.gz.

File metadata

  • Download URL: agent_vendor_verifier-0.1.0.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for agent_vendor_verifier-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d897487118cc57ea2377e1dc4bc0764c82f33f0b7f174bd69db5d1fee960bebc
MD5 6481073d121412f4f4cf785995e9732c
BLAKE2b-256 aa2d9f4812555cf52bc460d1a2132e57810bc623ca62bbf177f7c11667a19374

See more details on using hashes here.

File details

Details for the file agent_vendor_verifier-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_vendor_verifier-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b2c3597d7448851cc36383ec305221618ffff743dd00f217cab7e207d8910910
MD5 182a323a5d23939ac18d8500f95409ba
BLAKE2b-256 af8fd5e5a82f43929020836a124ddff73a6fca8e6f76499d9b51f127cebce9cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page