Benchmark framework for comparing LLM agent tool-call quality across vendors.

These details have not been verified by PyPI

Project links

Project description

Agent Vendor Verifier

Agent Vendor Verifier is a benchmarking framework for evaluating Agent tool-call effectiveness.

The benchmark aggregates these dimensions into a single, comparable fusion score (IRF), enabling fair cross-vendor comparison for agent-style tool usage.

Agent Vendor Verifier is built upon K2-Vendor-Verifier.

Why Agent Vendor Verifier?

Multi-dimensional metrics: Go beyond “was a tool called” by measuring:
- correctness,
- schema compliance,
- request success and stability,
- latency and throughput.
Comparable fusion score: combine heterogeneous metrics into a single score for ranking and model/vendor selection.

Metrics

For each sample, the benchmark records the finish_reason (e.g. tool_calls, stop, others) and optional tool-call validation results.

Metric	What it Evaluates	Direction
F1 Score	Whether a model triggers tool calls on the right samples, compared against a designated baseline vendor	Higher is better
Success Rate	Whether requests successfully complete without API or runtime errors	Higher is better
Schema Accuracy	Whether generated tool-call arguments conform to the declared JSON Schema	Higher is better
Avg Token	Token usage efficiency per request (prompt + completion)	Lower is better
Avg TTFT	Responsiveness: time from request to first token (ms)	Lower is better
TPS	Generation performance during decoding (e.g. tokens/s)	Higher is better

F1 Score

F1 score measures whether a model triggers tool calls on the correct samples, compared against a designated baseline vendor for the same model.

A higher F1 indicates closer alignment with the baseline on when to issue tool calls.

Model	Baseline Vendor
Claude	Anthropic
Gemini	Google
Deepseek	Deepseek
Minimax	Minimax
GLM	Bigmodel
Kimi	Moonshot

Treat “ended with tool_calls” as positive, and “ended with stop or others” as negative.

Case	Baseline result	Current vendor result	Meaning
TP	`tool_calls`	`tool_calls`	Both agree to trigger a tool call
FP	`stop` / others	`tool_calls`	False trigger
FN	`tool_calls`	`stop` / others	Missed trigger
TN	`stop` / others	`stop` / others	Both agree not to trigger

The baseline is treated as ground truth for whether a tool call should occur.

F1 score computation:

$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

$$ \text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

Note: The baseline vendor’s F1 is fixed at 1.0.

Success Rate

Success rate measures the proportion of requests that complete successfully, without API errors, timeouts, or runtime failures.

Success rate computation:

$$ \text{Success Rate} = \frac{\text{successful requests}}{\text{successful requests} + \text{failed requests}} $$

Schema Accuracy

Schema accuracy measures whether the arguments generated in tool calls conform to the declared JSON Schema provided in the request.

Schema accuracy directly reflects tool-call output quality.

Schema accuracy computation:

$$ \text{Schema Accuracy} = \frac{\text{valid tool calls}}{\text{total tool calls}} $$

Average Tokens

Average tokens measures the average total token usage per request, including both prompt tokens and completion tokens.

Lower values indicate better token efficiency.

Average TTFT

Average TTFT (Time to First Token) measures the average latency, in milliseconds, from request submission to the arrival of the first generated token.

Lower values indicate faster initial response.

TPS

TPS measures the average token generation speed during the decoding phase, typically in tokens per second.

Higher values indicate faster decoding performance.

Together, these six metrics form the basis for the IRF fusion score used for final ranking.

Fusion Score: IRF (Inverse Rank Fusion)

To combine heterogeneous metrics into a single score, Agent Vendor Verifier uses Inverse Rank Fusion (IRF) across six metrics:

F1 Score
Success Rate
Schema Accuracy
Avg Token
Avg TTFT
TPS

They cover correctness, stability, correctness of arguments, cost, and performance, making IRF a balanced indicator of Agent tool-call effectiveness.

Step 1: Rank per metric

For each metric, rank all participating entities:

Higher is better: F1, Success Rate, Schema Accuracy, TPS → descending order
Lower is better: Avg Token, Avg TTFT → ascending order

Step 2: Assign ranks

Each entity receives a rank ( r ) starting from 1 (ties handled by standard ranking rules).

Step 3: IRF contribution

For each metric where an entity has a value, compute:

$$ \text{contribution} = \frac{1}{r + k} $$

where k = 5 (In order to map IRF score to $(0, 1]$).

The final IRF score is the sum over all participating metrics:

$$ \text{IRF} = \sum_{\text{metrics}} \frac{1}{r_{\text{metric}} + 5} $$

Better ranks across more metrics yield higher IRF scores.
IRF score is comparable within the same model and different vendors.

Performance

We evaluated the tool-call effectiveness of the latest models across multiple vendors and ranked them using the IRF score. The results are shown below:

claude-haiku-4.5

Vendor	IRF Score	Success Rate	F1 Score	TPS	Schema Accuracy	TTFT (ms)	Avg Token
anthropic (openrouter)	0.8635	1	1	99.36	0.9346	3897	2571
foxcode	0.8357	1	0.7513	94.21	0.9891	4161	1945
packyapi	0.8040	1	0.7590	103.7	0.8989	4500	2504
yunwu	0.7583	1	0.7356	22.3	0.9041	4969	1412

claude-opus-4.5

Vendor	IRF Score	Success Rate	F1 Score	TPS	Schema Accuracy	TTFT (ms)	Avg Token
anthropic (openrouter)	0.9217	1	1	50.29	0.9714	4696	2585
packyapi	0.8741	1	0.7959	53.28	0.9082	5062	2510
yunwu	0.8095	0.9967	0.8384	18.71	0.9024	6105	1435

claude-sonnet-4.5

Vendor	IRF Score	Success Rate	F1 Score	TPS	Schema Accuracy	TTFT (ms)	Avg Token
foxcode	0.8774	1	0.8454	51.94	0.8000	4985	1861
anthropic (openrouter)	0.8357	1	1	33.95	0.8121	5175	2398
yunwu	0.8278	1	0.8304	18.94	0.8593	5649	1399
packyapi	0.7206	1	0.8293	48.36	0.7692	5676	2458

deepseek-v3.2

Vendor	IRF Score	Success Rate	F1 Score	TPS	Schema Accuracy	TTFT (ms)	Avg Token
deepseek (openrouter)	0.8234	0.9870	1	62.77	0.9854	1268	1859
google-vertex (openrouter)	0.7885	0.9995	0.6216	101.9	0.8480	777.3	1968
siliconflow (openrouter)	0.7706	0.9980	0.7282	151.5	0.8808	2465	2927
atlascloud (openrouter)	0.7567	1	0.7257	81.03	0.8438	1211	2368
siliconflow	0.7345	0.9815	0.7988	15.15	0.8633	5724	1719

gemini-2.5-flash

Vendor	IRF Score	Success Rate	F1 Score	TPS	Schema Accuracy	TTFT (ms)	Avg Token
google-vertex (openrouter)	0.9524	1	1	249.3	0.6875	1651	1731
gemini (openrouter)	0.9048	0.9933	0.8810	216.1	0.6897	2626	1658

glm-4.7

Vendor	IRF Score	Success Rate	F1 Score	TPS	Schema Accuracy	TTFT (ms)	Avg Token
bigmodel	0.9107	0.9725	1	123.4	0.8409	2243	1789
z.ai (openrouter)	0.8512	0.9980	0.8575	246	0.8291	4986	2303
atlascloud (openrouter)	0.8452	0.9975	0.8624	103.8	0.8327	1914	2311

kimi-k2

Vendor	IRF Score	Success Rate	F1 Score	TPS	Schema Accuracy	TTFT (ms)	Avg Token
siliconflow	0.9158	1	0.8427	58.23	1	1313	1581
siliconflow_OR	0.8690	0.9975	0.8355	92.81	0.9985	1115	1816
moonshot ai_OR	0.8205	0.9905	1	46.39	1	2662	1857

minimax-m2

Vendor	IRF Score	Success Rate	F1 Score	TPS	Schema Accuracy	TTFT (ms)	Avg Token
google-vertex (openrouter)	0.9217	0.9990	0.7662	347.6	1	491.7	2556
minimax (openrouter)	0.8512	0.9980	1	147.2	0.8205	1483	2278
atlascloud (openrouter)	0.8324	0.9960	0.7622	152.6	1	745.9	2443

Dataset

We made modifications to the open-sourced dataset from K2-Vendor-Verifier to fix errors that occurred when running it on other models and vendors. The dataset distribution details will be published separately.

Supported Vendors

Agent Vendor Verifier currently supports the following vendors:

Openrouter
Gemini
Siliconflow
429.icu
packyapi
yunwu
foxcode

Agent Vendor Verifier can easily add new vendors by inheriting from the Vendor base class, see Vendor-base.

Supported Models

To address that different vendors may use different names for the same model, Agent Vendor Verifier provides a unified model naming scheme. The models currently supported are listed below.

Model Series	Unified Models Name
Gemini	gemini-2.5-flash
Claude	claude-sonnet-4.5 claude-opus-4.5 claude-haiku-4.5
DeepSeek	deepseek-v3.2
GLM	glm-4.7
MiniMax	minimax-m2.5
Kimi	kimi-k2.5

When using Agent Vendor Verifier for evaluation, you only need to consider the model names listed in the table above.

Installation

pip install agent-vendor-verifier

The package requires Python 3.10 or newer and exposes the agent-vendor-verifier command-line entry point.

Agent Vendor Verifier Usage

Prepare:

Dataset: See Dataset.
config.yaml: Define model and vendor in benchmark.
.env: Define API keys.

Then run:

agent-vendor-verifier \
  --test-file-path "tool-calls/samples.jsonl" \
  --config-file-path "config.yaml" \
  --vendor-concurrency 5 \
  --request-concurrency 30 \
  --retries 10 \
  --timeout 30

Configuration

Define model and vendor in benchmark

config.yaml maps each unified model name to the vendor configurations evaluated for that model.

Top-level keys are model names. Each model contains a vendors list with the request settings for every vendor endpoint.

We provide an example, see config_example:

minimax-m2.5:
  vendors:
    - name: openrouter
      model_id: minimax/minimax-m2.5
      api_key: xxxxxx
      provider: minimax/
      url: https://openrouter.ai/api/v1
      validator: openai
      is_baseline: true

Model names must follow this project’s unified naming convention (e.g. gemini-2.5-flash, claude-sonnet-4.5, deepseek-v3.2).

See supported models for details.

Define API keys

API keys are provided via .env.

Variable names must match vendor names in config.yaml

Example:

openrouter=sk-or-v1-xxxx
siliconflow=sk-xxxx
bigmodel=xxxx
gemini=xxxx

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_vendor_verifier-0.1.0.tar.gz (22.4 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_vendor_verifier-0.1.0-py3-none-any.whl (27.2 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file agent_vendor_verifier-0.1.0.tar.gz.

File metadata

Download URL: agent_vendor_verifier-0.1.0.tar.gz
Upload date: Mar 12, 2026
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for agent_vendor_verifier-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d897487118cc57ea2377e1dc4bc0764c82f33f0b7f174bd69db5d1fee960bebc`
MD5	`6481073d121412f4f4cf785995e9732c`
BLAKE2b-256	`aa2d9f4812555cf52bc460d1a2132e57810bc623ca62bbf177f7c11667a19374`

See more details on using hashes here.

File details

Details for the file agent_vendor_verifier-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent_vendor_verifier-0.1.0-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 27.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for agent_vendor_verifier-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2c3597d7448851cc36383ec305221618ffff743dd00f217cab7e207d8910910`
MD5	`182a323a5d23939ac18d8500f95409ba`
BLAKE2b-256	`af8fd5e5a82f43929020836a124ddff73a6fca8e6f76499d9b51f127cebce9cc`

See more details on using hashes here.

agent-vendor-verifier 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Agent Vendor Verifier

Why Agent Vendor Verifier?

Metrics

F1 Score

Success Rate

Schema Accuracy

Average Tokens

Average TTFT

TPS

Fusion Score: IRF (Inverse Rank Fusion)

Step 1: Rank per metric

Step 2: Assign ranks

Step 3: IRF contribution

Performance

claude-haiku-4.5

claude-opus-4.5

claude-sonnet-4.5

deepseek-v3.2

gemini-2.5-flash

glm-4.7

kimi-k2

minimax-m2

Dataset

Supported Vendors

Supported Models

Installation

Agent Vendor Verifier Usage

Configuration

Define model and vendor in benchmark

Define API keys

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes