Lightweight tool-call testing for LLM agents. Deterministic, local, zero API cost. Compare expected vs actual tool calls in 3 lines of Python. Supports OpenAI, Anthropic, Gemini.
Project description
Toolscore
Lightweight tool-call testing for LLM agents — deterministic, local, zero API cost
Why Toolscore?
You ship an LLM agent. It calls tools — search APIs, databases, file ops. But after a prompt tweak or model upgrade, how do you know it still calls the right tools with the right arguments in the right order?
Toolscore gives you a deterministic score for that — no API calls, no cloud, no cost.
- Prompt changed — did tool calls break?
- Switched from GPT-4o to Claude — same behavior?
- CI/CD — catch regressions before production
Quick Start
from toolscore import evaluate
result = evaluate(
expected=[
{"tool": "get_weather", "args": {"city": "NYC"}},
{"tool": "send_email", "args": {"to": "user@example.com"}},
],
actual=[
{"tool": "get_weather", "args": {"city": "New York"}},
{"tool": "send_email", "args": {"to": "user@example.com"}},
],
)
print(result.score) # 0.85 — overall quality (weighted composite)
print(result.selection_accuracy) # 1.0 — right tools picked
print(result.argument_f1) # 0.7 — 70% of arguments correct
No files, no config, no API keys. Just Python objects in, score out.
Installation
pip install tool-scorer
What You Get
| Feature | How |
|---|---|
| In-memory evaluation | evaluate(expected, actual) |
| Auto-detect provider responses | evaluate(expected, openai_response) — no manual extraction |
| End-to-end agent testing | test_agent(agent=fn, input=..., expected=..., min_score=0.9) |
| One-liner test assertion | assert_tools(expected, actual, min_score=0.9) |
| Data-driven pytest tests | @toolscore.cases([...]) parametrize decorator |
| OpenAI/Anthropic/Gemini extraction | from_openai(response), from_anthropic(), from_gemini() |
| 6 CLI commands | toolscore eval, compare, regression, init, generate, validate |
| Self-explaining failures | Shows MISSING / EXTRA / MISMATCH with actionable tips |
| Regression testing | Save baselines, catch degradation in CI |
| Pytest plugin | Fixtures, markers, assertion helpers |
| GitHub Action | One-click CI/CD setup |
| 4 report formats | HTML, JSON, CSV, Markdown |
| 6 trace formats | OpenAI, Anthropic, Gemini, LangChain, MCP, Custom (auto-detected) |
Python API
Basic evaluation
from toolscore import evaluate, assert_tools
# Get a detailed result
result = evaluate(
expected=[{"tool": "search", "args": {"q": "test"}}],
actual=[{"tool": "search", "args": {"q": "test"}}],
)
assert result.score == 1.0
With LLM provider responses (auto-detected)
Pass raw API responses directly — Toolscore auto-detects the format:
from openai import OpenAI
from toolscore import evaluate
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
tools=[...],
)
# No from_openai() needed — auto-detected!
result = evaluate(expected=[...], actual=response)
Works with OpenAI, Anthropic, and Gemini responses. You can still use from_openai() / from_anthropic() / from_gemini() explicitly if you prefer.
End-to-end agent testing
from toolscore import test_agent
result = test_agent(
agent=my_agent_fn, # any callable that returns an LLM response
input="What's the weather?",
expected=[{"tool": "get_weather", "args": {"city": "NYC"}}],
min_score=0.9, # optional: raises if below
)
One-liner for tests
from toolscore import assert_tools
assert_tools(
expected=[{"tool": "search", "args": {"q": "test"}}],
actual=[{"tool": "search", "args": {"q": "test"}}],
min_score=0.9, # raises ToolScoreAssertionError if below
)
When Things Go Wrong
Toolscore doesn't just give you a number — it tells you what went wrong and how to fix it. Here's a failing evaluation:
result = evaluate(
expected=[
{"tool": "search_web", "args": {"query": "Python tutorials"}},
{"tool": "summarize", "args": {"text": "..."}},
],
actual=[
{"tool": "web_search", "args": {"query": "Python tutorials"}},
{"tool": "send_email", "args": {"to": "user@example.com"}},
],
)
print(result.score) # 0.35
Run the CLI with --verbose to see exactly what happened:
toolscore eval gold.json trace.json --verbose
Toolscore Evaluation Results
Expected calls: 2
Actual calls: 2
Metric Score Details
Selection Accuracy 0.0% 0 of 2 correct
Argument F1 50.0% P:100.0% R:50.0%
Sequence Accuracy 0.0% Edit distance: 2
What Went Wrong:
MISSING: Expected tool 'search_web' was never called
MISMATCH: Position 0 — expected 'search_web', got 'web_search' (similar name?)
EXTRA: Tool 'send_email' was called but not expected
Tips:
TIP: 'search_web' and 'web_search' look similar — use --llm-judge to check semantic equivalence
TIP: Review prompt instructions for tool naming conventions
Pytest Integration
The simplest approach — assert_tools works anywhere:
from toolscore import assert_tools
def test_my_agent():
actual = my_agent("What's the weather in NYC?")
assert_tools(
expected=[{"tool": "get_weather", "args": {"city": "NYC"}}],
actual=actual, # raw LLM response or list of dicts — both work
min_score=0.9,
)
Data-driven tests with @toolscore.cases():
import toolscore
@toolscore.cases([
{"input": "weather NYC", "expected": [{"tool": "get_weather", "args": {"city": "NYC"}}]},
{"input": "email bob", "expected": [{"tool": "send_email", "args": {"to": "bob"}}]},
])
def test_my_agent(input, expected):
response = my_agent(input)
toolscore.assert_tools(expected=expected, actual=response, min_score=0.9)
For file-based workflows, use the built-in fixtures:
def test_agent_accuracy(toolscore_eval, toolscore_assert):
"""Test that agent achieves high accuracy."""
result = toolscore_eval("gold_calls.json", "trace.json")
toolscore_assert.assert_selection_accuracy(result, min_accuracy=0.9)
toolscore_assert.assert_argument_f1(result, min_f1=0.8)
Configure directories via CLI options:
pytest --toolscore-gold-dir tests/gold_standards --toolscore-trace-dir tests/traces
CLI
Six commands cover the full workflow:
toolscore eval gold.json trace.json # Evaluate
toolscore eval gold.json trace.json --verbose # Full detail + failure analysis
toolscore eval gold.json trace.json --html report.html # HTML report
toolscore compare gold.json gpt4.json claude.json # Side-by-side model comparison
toolscore regression baseline.json trace.json -g gold.json # CI regression check
toolscore init # Scaffold a new project
toolscore generate --from-openai funcs.json # Synthetic test data from schemas
toolscore validate trace.json # Check trace format
Metrics Deep Dive
The composite result.score is a weighted average of four core metrics:
| Metric | Weight | Plain English |
|---|---|---|
| Selection Accuracy | 40% | Did it pick the right tools? |
| Argument F1 | 30% | Did it pass the right arguments? |
| Sequence Accuracy | 20% | Did it call them in the right order? |
| Redundancy (inverted) | 10% | Did it avoid unnecessary repeat calls? |
Custom weights are supported:
result = evaluate(
expected=[...],
actual=[...],
weights={
"selection_accuracy": 0.5,
"argument_f1": 0.5,
"sequence_accuracy": 0.0,
"redundant_rate": 0.0,
},
)
Additional metrics available in verbose mode: invocation accuracy, tool correctness, trajectory accuracy, cost tracking, latency.
CI/CD & Regression Testing
GitHub Action
- uses: yotambraun/toolscore@v1
with:
gold-file: tests/gold_standard.json
trace-file: tests/agent_trace.json
threshold: '0.90'
Regression testing
Save a baseline, then check for regressions on every run:
# Save a baseline
toolscore eval gold.json trace.json --save-baseline baseline.json
# Check for regressions (fails if accuracy drops >5%)
toolscore regression baseline.json new_trace.json --gold-file gold.json
Exit codes: 0 = PASS, 1 = FAIL (regression detected), 2 = ERROR — plug directly into CI.
Supported Formats
| Provider | Format | Auto-detected |
|---|---|---|
| OpenAI | tool_calls / function_call |
Yes |
| Anthropic | tool_use content blocks |
Yes |
| Google Gemini | functionCall parts |
Yes |
| MCP | JSON-RPC 2.0 | Yes |
| LangChain | tool / tool_input |
Yes |
| Custom | {"calls": [{"tool": ..., "args": ...}]} |
Yes |
Advanced Features
LLM-as-a-Judge
Semantic tool name matching when exact names don't line up (requires OpenAI API key):
toolscore eval gold.json trace.json --llm-judge
Cost Tracking
Token usage and pricing estimation for OpenAI, Anthropic, and Gemini models:
from toolscore.metrics.cost_estimator import calculate_llm_cost, estimate_trace_cost
cost = calculate_llm_cost("gpt-4o", input_tokens=1000, output_tokens=500)
trace_cost = estimate_trace_cost("gpt-4o", trace_calls)
Schema Validation
Validate argument types, ranges, and patterns against JSON schemas:
from toolscore.validators.schema import validate_argument_schema
valid, errors = validate_argument_schema(call, schema={
"query": {"type": "string", "minLength": 1},
"limit": {"type": "integer", "minimum": 1, "maximum": 100},
})
Side-Effect Validation
Verify HTTP responses, files created, and database rows after tool execution:
toolscore eval gold.json trace.json # side-effect validation is on by default
Trace Capture
Record production tool calls with the @capture_trace decorator:
from toolscore import capture_trace
@capture_trace(name="my-agent")
def run_agent(prompt):
# ... your agent code ...
return result
Synthetic Test Generation
Generate gold-standard test cases from OpenAI function schemas:
toolscore generate --from-openai functions.json -n 10 --output gold.json
Interactive Debug
Step through mismatches one by one:
toolscore eval gold.json trace.json --debug
Multi-Model Comparison
Compare two or more models side by side:
toolscore compare gold.json gpt4.json claude.json -n gpt-4 -n claude-3
When to Use Toolscore vs. Alternatives
| Use case | Recommendation |
|---|---|
| Fast, deterministic tool-call checks in CI without API costs | Toolscore |
| Comprehensive LLM evaluation across multiple dimensions (hallucination, toxicity, RAG, tool calls, etc.) | DeepEval |
| RAG pipeline evaluation (retrieval quality, answer faithfulness) | Ragas |
| Government/safety-focused AI evaluation | Inspect AI |
| Tracing and observability for LangChain apps | LangSmith |
Toolscore does one thing well: it checks whether your agent called the right tools with the right arguments, deterministically, with zero cost. If you need broader LLM evaluation, the tools above are excellent choices.
File-Based API
The original file-based API is still fully supported:
from toolscore import evaluate_trace
result = evaluate_trace(
gold_file="gold_calls.json",
trace_file="trace.json",
format="auto",
)
print(result.score)
print(result.selection_accuracy)
Development
pip install -e ".[dev]"
pytest
ruff check toolscore
mypy toolscore
License
Apache License 2.0 - see LICENSE for details.
Citation
@software{toolscore,
title = {Toolscore: Lightweight Tool-Call Testing for LLM Agents},
author = {Yotam Braun},
year = {2025},
url = {https://github.com/yotambraun/toolscore}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tool_scorer-1.6.0.tar.gz.
File metadata
- Download URL: tool_scorer-1.6.0.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7509907adcc04017c4d384e27c785df80148a6947f7532098ff87fefa1b38d5
|
|
| MD5 |
ebdbc1d0352e99b9d58570644f816707
|
|
| BLAKE2b-256 |
c7e422bca6b5855844addb5036413d8a1a0b67522f0cd916f60300c044a5189b
|
File details
Details for the file tool_scorer-1.6.0-py3-none-any.whl.
File metadata
- Download URL: tool_scorer-1.6.0-py3-none-any.whl
- Upload date:
- Size: 88.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93b7e42f5d51abd04a9ecf119f5f91cab8a364eaa0dd287e350d4493b63e43de
|
|
| MD5 |
2acfc6d50a73b0ad4e7a5672ca5a1f79
|
|
| BLAKE2b-256 |
08b43448ac6e20ba9fc8cdb5aabc9178542412e62f108eef9b894ec9217e5f26
|