Skip to main content

Lightweight tool-call testing for LLM agents. Deterministic, local, zero API cost. Compare expected vs actual tool calls in 3 lines of Python. Supports OpenAI, Anthropic, Gemini.

Project description

Toolscore Logo

Toolscore

Lightweight tool-call testing for LLM agents — deterministic, local, zero API cost

PyPI version License Downloads Python Versions CI


Why Toolscore?

You ship an LLM agent. It calls tools — search APIs, databases, file ops. But after a prompt tweak or model upgrade, how do you know it still calls the right tools with the right arguments in the right order?

Toolscore gives you a deterministic score for that — no API calls, no cloud, no cost.

  • Prompt changed — did tool calls break?
  • Switched from GPT-4o to Claude — same behavior?
  • CI/CD — catch regressions before production

Quick Start

from toolscore import evaluate

result = evaluate(
    expected=[
        {"tool": "get_weather", "args": {"city": "NYC"}},
        {"tool": "send_email", "args": {"to": "user@example.com"}},
    ],
    actual=[
        {"tool": "get_weather", "args": {"city": "New York"}},
        {"tool": "send_email", "args": {"to": "user@example.com"}},
    ],
)

print(result.score)              # 0.85 — overall quality (weighted composite)
print(result.selection_accuracy) # 1.0  — right tools picked
print(result.argument_f1)        # 0.7  — 70% of arguments correct

No files, no config, no API keys. Just Python objects in, score out.

Installation

pip install tool-scorer

What You Get

Feature How
In-memory evaluation evaluate(expected, actual)
Auto-detect provider responses evaluate(expected, openai_response) — no manual extraction
End-to-end agent testing test_agent(agent=fn, input=..., expected=..., min_score=0.9)
One-liner test assertion assert_tools(expected, actual, min_score=0.9)
Data-driven pytest tests @toolscore.cases([...]) parametrize decorator
OpenAI/Anthropic/Gemini extraction from_openai(response), from_anthropic(), from_gemini()
6 CLI commands toolscore eval, compare, regression, init, generate, validate
Self-explaining failures Shows MISSING / EXTRA / MISMATCH with actionable tips
Regression testing Save baselines, catch degradation in CI
Pytest plugin Fixtures, markers, assertion helpers
GitHub Action One-click CI/CD setup
4 report formats HTML, JSON, CSV, Markdown
6 trace formats OpenAI, Anthropic, Gemini, LangChain, MCP, Custom (auto-detected)

Python API

Basic evaluation

from toolscore import evaluate, assert_tools

# Get a detailed result
result = evaluate(
    expected=[{"tool": "search", "args": {"q": "test"}}],
    actual=[{"tool": "search", "args": {"q": "test"}}],
)
assert result.score == 1.0

With LLM provider responses (auto-detected)

Pass raw API responses directly — Toolscore auto-detects the format:

from openai import OpenAI
from toolscore import evaluate

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    tools=[...],
)

# No from_openai() needed — auto-detected!
result = evaluate(expected=[...], actual=response)

Works with OpenAI, Anthropic, and Gemini responses. You can still use from_openai() / from_anthropic() / from_gemini() explicitly if you prefer.

End-to-end agent testing

from toolscore import test_agent

result = test_agent(
    agent=my_agent_fn,          # any callable that returns an LLM response
    input="What's the weather?",
    expected=[{"tool": "get_weather", "args": {"city": "NYC"}}],
    min_score=0.9,              # optional: raises if below
)

One-liner for tests

from toolscore import assert_tools

assert_tools(
    expected=[{"tool": "search", "args": {"q": "test"}}],
    actual=[{"tool": "search", "args": {"q": "test"}}],
    min_score=0.9,  # raises ToolScoreAssertionError if below
)

When Things Go Wrong

Toolscore doesn't just give you a number — it tells you what went wrong and how to fix it. Here's a failing evaluation:

result = evaluate(
    expected=[
        {"tool": "search_web", "args": {"query": "Python tutorials"}},
        {"tool": "summarize", "args": {"text": "..."}},
    ],
    actual=[
        {"tool": "web_search", "args": {"query": "Python tutorials"}},
        {"tool": "send_email", "args": {"to": "user@example.com"}},
    ],
)
print(result.score)  # 0.35

Run the CLI with --verbose to see exactly what happened:

toolscore eval gold.json trace.json --verbose

  Toolscore Evaluation Results

  Expected calls: 2
  Actual calls:   2

  Metric              Score     Details
  Selection Accuracy  0.0%      0 of 2 correct
  Argument F1         50.0%     P:100.0% R:50.0%
  Sequence Accuracy   0.0%      Edit distance: 2

  What Went Wrong:
    MISSING: Expected tool 'search_web' was never called
    MISMATCH: Position 0 — expected 'search_web', got 'web_search' (similar name?)
    EXTRA: Tool 'send_email' was called but not expected

  Tips:
    TIP: 'search_web' and 'web_search' look similar — use --llm-judge to check semantic equivalence
    TIP: Review prompt instructions for tool naming conventions

Pytest Integration

The simplest approach — assert_tools works anywhere:

from toolscore import assert_tools

def test_my_agent():
    actual = my_agent("What's the weather in NYC?")
    assert_tools(
        expected=[{"tool": "get_weather", "args": {"city": "NYC"}}],
        actual=actual,  # raw LLM response or list of dicts — both work
        min_score=0.9,
    )

Data-driven tests with @toolscore.cases():

import toolscore

@toolscore.cases([
    {"input": "weather NYC", "expected": [{"tool": "get_weather", "args": {"city": "NYC"}}]},
    {"input": "email bob",   "expected": [{"tool": "send_email", "args": {"to": "bob"}}]},
])
def test_my_agent(input, expected):
    response = my_agent(input)
    toolscore.assert_tools(expected=expected, actual=response, min_score=0.9)

For file-based workflows, use the built-in fixtures:

def test_agent_accuracy(toolscore_eval, toolscore_assert):
    """Test that agent achieves high accuracy."""
    result = toolscore_eval("gold_calls.json", "trace.json")
    toolscore_assert.assert_selection_accuracy(result, min_accuracy=0.9)
    toolscore_assert.assert_argument_f1(result, min_f1=0.8)

Configure directories via CLI options:

pytest --toolscore-gold-dir tests/gold_standards --toolscore-trace-dir tests/traces

CLI

Six commands cover the full workflow:

toolscore eval gold.json trace.json              # Evaluate
toolscore eval gold.json trace.json --verbose     # Full detail + failure analysis
toolscore eval gold.json trace.json --html report.html  # HTML report
toolscore compare gold.json gpt4.json claude.json # Side-by-side model comparison
toolscore regression baseline.json trace.json -g gold.json  # CI regression check
toolscore init                                    # Scaffold a new project
toolscore generate --from-openai funcs.json       # Synthetic test data from schemas
toolscore validate trace.json                     # Check trace format

Metrics Deep Dive

The composite result.score is a weighted average of four core metrics:

Metric Weight Plain English
Selection Accuracy 40% Did it pick the right tools?
Argument F1 30% Did it pass the right arguments?
Sequence Accuracy 20% Did it call them in the right order?
Redundancy (inverted) 10% Did it avoid unnecessary repeat calls?

Custom weights are supported:

result = evaluate(
    expected=[...],
    actual=[...],
    weights={
        "selection_accuracy": 0.5,
        "argument_f1": 0.5,
        "sequence_accuracy": 0.0,
        "redundant_rate": 0.0,
    },
)

Additional metrics available in verbose mode: invocation accuracy, tool correctness, trajectory accuracy, cost tracking, latency.

CI/CD & Regression Testing

GitHub Action

- uses: yotambraun/toolscore@v1
  with:
    gold-file: tests/gold_standard.json
    trace-file: tests/agent_trace.json
    threshold: '0.90'

Regression testing

Save a baseline, then check for regressions on every run:

# Save a baseline
toolscore eval gold.json trace.json --save-baseline baseline.json

# Check for regressions (fails if accuracy drops >5%)
toolscore regression baseline.json new_trace.json --gold-file gold.json

Exit codes: 0 = PASS, 1 = FAIL (regression detected), 2 = ERROR — plug directly into CI.

Supported Formats

Provider Format Auto-detected
OpenAI tool_calls / function_call Yes
Anthropic tool_use content blocks Yes
Google Gemini functionCall parts Yes
MCP JSON-RPC 2.0 Yes
LangChain tool / tool_input Yes
Custom {"calls": [{"tool": ..., "args": ...}]} Yes

Advanced Features

LLM-as-a-Judge

Semantic tool name matching when exact names don't line up (requires OpenAI API key):

toolscore eval gold.json trace.json --llm-judge

Cost Tracking

Token usage and pricing estimation for OpenAI, Anthropic, and Gemini models:

from toolscore.metrics.cost_estimator import calculate_llm_cost, estimate_trace_cost

cost = calculate_llm_cost("gpt-4o", input_tokens=1000, output_tokens=500)
trace_cost = estimate_trace_cost("gpt-4o", trace_calls)

Schema Validation

Validate argument types, ranges, and patterns against JSON schemas:

from toolscore.validators.schema import validate_argument_schema

valid, errors = validate_argument_schema(call, schema={
    "query": {"type": "string", "minLength": 1},
    "limit": {"type": "integer", "minimum": 1, "maximum": 100},
})

Side-Effect Validation

Verify HTTP responses, files created, and database rows after tool execution:

toolscore eval gold.json trace.json  # side-effect validation is on by default

Trace Capture

Record production tool calls with the @capture_trace decorator:

from toolscore import capture_trace

@capture_trace(name="my-agent")
def run_agent(prompt):
    # ... your agent code ...
    return result

Synthetic Test Generation

Generate gold-standard test cases from OpenAI function schemas:

toolscore generate --from-openai functions.json -n 10 --output gold.json

Interactive Debug

Step through mismatches one by one:

toolscore eval gold.json trace.json --debug

Multi-Model Comparison

Compare two or more models side by side:

toolscore compare gold.json gpt4.json claude.json -n gpt-4 -n claude-3

When to Use Toolscore vs. Alternatives

Use case Recommendation
Fast, deterministic tool-call checks in CI without API costs Toolscore
Comprehensive LLM evaluation across multiple dimensions (hallucination, toxicity, RAG, tool calls, etc.) DeepEval
RAG pipeline evaluation (retrieval quality, answer faithfulness) Ragas
Government/safety-focused AI evaluation Inspect AI
Tracing and observability for LangChain apps LangSmith

Toolscore does one thing well: it checks whether your agent called the right tools with the right arguments, deterministically, with zero cost. If you need broader LLM evaluation, the tools above are excellent choices.

File-Based API

The original file-based API is still fully supported:

from toolscore import evaluate_trace

result = evaluate_trace(
    gold_file="gold_calls.json",
    trace_file="trace.json",
    format="auto",
)
print(result.score)
print(result.selection_accuracy)

Development

pip install -e ".[dev]"
pytest
ruff check toolscore
mypy toolscore

License

Apache License 2.0 - see LICENSE for details.

Citation

@software{toolscore,
  title = {Toolscore: Lightweight Tool-Call Testing for LLM Agents},
  author = {Yotam Braun},
  year = {2025},
  url = {https://github.com/yotambraun/toolscore}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tool_scorer-1.6.0.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tool_scorer-1.6.0-py3-none-any.whl (88.8 kB view details)

Uploaded Python 3

File details

Details for the file tool_scorer-1.6.0.tar.gz.

File metadata

  • Download URL: tool_scorer-1.6.0.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tool_scorer-1.6.0.tar.gz
Algorithm Hash digest
SHA256 f7509907adcc04017c4d384e27c785df80148a6947f7532098ff87fefa1b38d5
MD5 ebdbc1d0352e99b9d58570644f816707
BLAKE2b-256 c7e422bca6b5855844addb5036413d8a1a0b67522f0cd916f60300c044a5189b

See more details on using hashes here.

File details

Details for the file tool_scorer-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: tool_scorer-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 88.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tool_scorer-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 93b7e42f5d51abd04a9ecf119f5f91cab8a364eaa0dd287e350d4493b63e43de
MD5 2acfc6d50a73b0ad4e7a5672ca5a1f79
BLAKE2b-256 08b43448ac6e20ba9fc8cdb5aabc9178542412e62f108eef9b894ec9217e5f26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page