Lightweight tool-call testing for LLM agents. Deterministic, local, zero API cost. Compare expected vs actual tool calls in 3 lines of Python. Supports OpenAI, Anthropic, Gemini.

These details have not been verified by PyPI

Project links

Project description

Toolscore Logo

Toolscore

Lightweight tool-call testing for LLM agents — deterministic, local, zero API cost

Why Toolscore?

You ship an LLM agent. It calls tools — search APIs, databases, file ops. But after a prompt tweak or model upgrade, how do you know it still calls the right tools with the right arguments in the right order?

Toolscore gives you a deterministic score for that — no API calls, no cloud, no cost.

Prompt changed — did tool calls break?
Switched from GPT-4o to Claude — same behavior?
CI/CD — catch regressions before production

Quick Start

from toolscore import evaluate

result = evaluate(
    expected=[
        {"tool": "get_weather", "args": {"city": "NYC"}},
        {"tool": "send_email", "args": {"to": "user@example.com"}},
    ],
    actual=[
        {"tool": "get_weather", "args": {"city": "New York"}},
        {"tool": "send_email", "args": {"to": "user@example.com"}},
    ],
)

print(result.score)              # 0.85 — overall quality (weighted composite)
print(result.selection_accuracy) # 1.0  — right tools picked
print(result.argument_f1)        # 0.7  — 70% of arguments correct

No files, no config, no API keys. Just Python objects in, score out.

Installation

pip install tool-scorer

What You Get

Feature	How
In-memory evaluation	`evaluate(expected, actual)`
Auto-detect provider responses	`evaluate(expected, openai_response)` — no manual extraction
End-to-end agent testing	`test_agent(agent=fn, input=..., expected=..., min_score=0.9)`
One-liner test assertion	`assert_tools(expected, actual, min_score=0.9)`
Data-driven pytest tests	`@toolscore.cases([...])` parametrize decorator
OpenAI/Anthropic/Gemini extraction	`from_openai(response)`, `from_anthropic()`, `from_gemini()`
6 CLI commands	`toolscore eval`, `compare`, `regression`, `init`, `generate`, `validate`
Self-explaining failures	Shows MISSING / EXTRA / MISMATCH with actionable tips
Regression testing	Save baselines, catch degradation in CI
Pytest plugin	Fixtures, markers, assertion helpers
GitHub Action	One-click CI/CD setup
4 report formats	HTML, JSON, CSV, Markdown
6 trace formats	OpenAI, Anthropic, Gemini, LangChain, MCP, Custom (auto-detected)

Python API

Basic evaluation

from toolscore import evaluate, assert_tools

# Get a detailed result
result = evaluate(
    expected=[{"tool": "search", "args": {"q": "test"}}],
    actual=[{"tool": "search", "args": {"q": "test"}}],
)
assert result.score == 1.0

With LLM provider responses (auto-detected)

Pass raw API responses directly — Toolscore auto-detects the format:

from openai import OpenAI
from toolscore import evaluate

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    tools=[...],
)

# No from_openai() needed — auto-detected!
result = evaluate(expected=[...], actual=response)

Works with OpenAI, Anthropic, and Gemini responses. You can still use from_openai() / from_anthropic() / from_gemini() explicitly if you prefer.

End-to-end agent testing

from toolscore import test_agent

result = test_agent(
    agent=my_agent_fn,          # any callable that returns an LLM response
    input="What's the weather?",
    expected=[{"tool": "get_weather", "args": {"city": "NYC"}}],
    min_score=0.9,              # optional: raises if below
)

One-liner for tests

from toolscore import assert_tools

assert_tools(
    expected=[{"tool": "search", "args": {"q": "test"}}],
    actual=[{"tool": "search", "args": {"q": "test"}}],
    min_score=0.9,  # raises ToolScoreAssertionError if below
)

When Things Go Wrong

Toolscore doesn't just give you a number — it tells you what went wrong and how to fix it. Here's a failing evaluation:

result = evaluate(
    expected=[
        {"tool": "search_web", "args": {"query": "Python tutorials"}},
        {"tool": "summarize", "args": {"text": "..."}},
    ],
    actual=[
        {"tool": "web_search", "args": {"query": "Python tutorials"}},
        {"tool": "send_email", "args": {"to": "user@example.com"}},
    ],
)
print(result.score)  # 0.35

Run the CLI with --verbose to see exactly what happened:

toolscore eval gold.json trace.json --verbose

  Toolscore Evaluation Results

  Expected calls: 2
  Actual calls:   2

  Metric              Score     Details
  Selection Accuracy  0.0%      0 of 2 correct
  Argument F1         50.0%     P:100.0% R:50.0%
  Sequence Accuracy   0.0%      Edit distance: 2

  What Went Wrong:
    MISSING: Expected tool 'search_web' was never called
    MISMATCH: Position 0 — expected 'search_web', got 'web_search' (similar name?)
    EXTRA: Tool 'send_email' was called but not expected

  Tips:
    TIP: 'search_web' and 'web_search' look similar — use --llm-judge to check semantic equivalence
    TIP: Review prompt instructions for tool naming conventions

Pytest Integration

The simplest approach — assert_tools works anywhere:

from toolscore import assert_tools

def test_my_agent():
    actual = my_agent("What's the weather in NYC?")
    assert_tools(
        expected=[{"tool": "get_weather", "args": {"city": "NYC"}}],
        actual=actual,  # raw LLM response or list of dicts — both work
        min_score=0.9,
    )

Data-driven tests with @toolscore.cases():

import toolscore

@toolscore.cases([
    {"input": "weather NYC", "expected": [{"tool": "get_weather", "args": {"city": "NYC"}}]},
    {"input": "email bob",   "expected": [{"tool": "send_email", "args": {"to": "bob"}}]},
])
def test_my_agent(input, expected):
    response = my_agent(input)
    toolscore.assert_tools(expected=expected, actual=response, min_score=0.9)

For file-based workflows, use the built-in fixtures:

def test_agent_accuracy(toolscore_eval, toolscore_assert):
    """Test that agent achieves high accuracy."""
    result = toolscore_eval("gold_calls.json", "trace.json")
    toolscore_assert.assert_selection_accuracy(result, min_accuracy=0.9)
    toolscore_assert.assert_argument_f1(result, min_f1=0.8)

Configure directories via CLI options:

pytest --toolscore-gold-dir tests/gold_standards --toolscore-trace-dir tests/traces

CLI

Six commands cover the full workflow:

toolscore eval gold.json trace.json              # Evaluate
toolscore eval gold.json trace.json --verbose     # Full detail + failure analysis
toolscore eval gold.json trace.json --html report.html  # HTML report
toolscore compare gold.json gpt4.json claude.json # Side-by-side model comparison
toolscore regression baseline.json trace.json -g gold.json  # CI regression check
toolscore init                                    # Scaffold a new project
toolscore generate --from-openai funcs.json       # Synthetic test data from schemas
toolscore validate trace.json                     # Check trace format

Metrics Deep Dive

The composite result.score is a weighted average of four core metrics:

Metric	Weight	Plain English
Selection Accuracy	40%	Did it pick the right tools?
Argument F1	30%	Did it pass the right arguments?
Sequence Accuracy	20%	Did it call them in the right order?
Redundancy (inverted)	10%	Did it avoid unnecessary repeat calls?

Custom weights are supported:

result = evaluate(
    expected=[...],
    actual=[...],
    weights={
        "selection_accuracy": 0.5,
        "argument_f1": 0.5,
        "sequence_accuracy": 0.0,
        "redundant_rate": 0.0,
    },
)

Additional metrics available in verbose mode: invocation accuracy, tool correctness, trajectory accuracy, cost tracking, latency.

CI/CD & Regression Testing

GitHub Action

- uses: yotambraun/toolscore@v1
  with:
    gold-file: tests/gold_standard.json
    trace-file: tests/agent_trace.json
    threshold: '0.90'

Regression testing

Save a baseline, then check for regressions on every run:

# Save a baseline
toolscore eval gold.json trace.json --save-baseline baseline.json

# Check for regressions (fails if accuracy drops >5%)
toolscore regression baseline.json new_trace.json --gold-file gold.json

Exit codes: 0 = PASS, 1 = FAIL (regression detected), 2 = ERROR — plug directly into CI.

Supported Formats

Provider	Format	Auto-detected
OpenAI	`tool_calls` / `function_call`	Yes
Anthropic	`tool_use` content blocks	Yes
Google Gemini	`functionCall` parts	Yes
MCP	JSON-RPC 2.0	Yes
LangChain	`tool` / `tool_input`	Yes
Custom	`{"calls": [{"tool": ..., "args": ...}]}`	Yes

Advanced Features

LLM-as-a-Judge

Semantic tool name matching when exact names don't line up (requires OpenAI API key):

toolscore eval gold.json trace.json --llm-judge

Cost Tracking

Token usage and pricing estimation for OpenAI, Anthropic, and Gemini models:

from toolscore.metrics.cost_estimator import calculate_llm_cost, estimate_trace_cost

cost = calculate_llm_cost("gpt-4o", input_tokens=1000, output_tokens=500)
trace_cost = estimate_trace_cost("gpt-4o", trace_calls)

Schema Validation

Validate argument types, ranges, and patterns against JSON schemas:

from toolscore.validators.schema import validate_argument_schema

valid, errors = validate_argument_schema(call, schema={
    "query": {"type": "string", "minLength": 1},
    "limit": {"type": "integer", "minimum": 1, "maximum": 100},
})

Side-Effect Validation

Verify HTTP responses, files created, and database rows after tool execution:

toolscore eval gold.json trace.json  # side-effect validation is on by default

Trace Capture

Record production tool calls with the @capture_trace decorator:

from toolscore import capture_trace

@capture_trace(name="my-agent")
def run_agent(prompt):
    # ... your agent code ...
    return result

Synthetic Test Generation

Generate gold-standard test cases from OpenAI function schemas:

toolscore generate --from-openai functions.json -n 10 --output gold.json

Interactive Debug

Step through mismatches one by one:

toolscore eval gold.json trace.json --debug

Multi-Model Comparison

Compare two or more models side by side:

toolscore compare gold.json gpt4.json claude.json -n gpt-4 -n claude-3

When to Use Toolscore vs. Alternatives

Use case	Recommendation
Fast, deterministic tool-call checks in CI without API costs	Toolscore
Comprehensive LLM evaluation across multiple dimensions (hallucination, toxicity, RAG, tool calls, etc.)	DeepEval
RAG pipeline evaluation (retrieval quality, answer faithfulness)	Ragas
Government/safety-focused AI evaluation	Inspect AI
Tracing and observability for LangChain apps	LangSmith

Toolscore does one thing well: it checks whether your agent called the right tools with the right arguments, deterministically, with zero cost. If you need broader LLM evaluation, the tools above are excellent choices.

File-Based API

The original file-based API is still fully supported:

from toolscore import evaluate_trace

result = evaluate_trace(
    gold_file="gold_calls.json",
    trace_file="trace.json",
    format="auto",
)
print(result.score)
print(result.selection_accuracy)

Development

pip install -e ".[dev]"
pytest
ruff check toolscore
mypy toolscore

License

Apache License 2.0 - see LICENSE for details.

Citation

@software{toolscore,
  title = {Toolscore: Lightweight Tool-Call Testing for LLM Agents},
  author = {Yotam Braun},
  year = {2025},
  url = {https://github.com/yotambraun/toolscore}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.6.0

Mar 20, 2026

1.5.0

Feb 6, 2026

1.4.2

Jan 9, 2026

1.4.1

Jan 9, 2026

1.4.0

Jan 9, 2026

1.3.3

Oct 28, 2025

1.3.2

Oct 28, 2025

1.3.1

Oct 28, 2025

1.3.0

Oct 28, 2025

1.2.0

Oct 18, 2025

1.1.1

Oct 18, 2025

1.1.0

Oct 18, 2025

1.0.4

Oct 13, 2025

0.1.0

Oct 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tool_scorer-1.6.0.tar.gz (1.7 MB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tool_scorer-1.6.0-py3-none-any.whl (88.8 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file tool_scorer-1.6.0.tar.gz.

File metadata

Download URL: tool_scorer-1.6.0.tar.gz
Upload date: Mar 20, 2026
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tool_scorer-1.6.0.tar.gz
Algorithm	Hash digest
SHA256	`f7509907adcc04017c4d384e27c785df80148a6947f7532098ff87fefa1b38d5`
MD5	`ebdbc1d0352e99b9d58570644f816707`
BLAKE2b-256	`c7e422bca6b5855844addb5036413d8a1a0b67522f0cd916f60300c044a5189b`

See more details on using hashes here.

File details

Details for the file tool_scorer-1.6.0-py3-none-any.whl.

File metadata

Download URL: tool_scorer-1.6.0-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 88.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for tool_scorer-1.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93b7e42f5d51abd04a9ecf119f5f91cab8a364eaa0dd287e350d4493b63e43de`
MD5	`2acfc6d50a73b0ad4e7a5672ca5a1f79`
BLAKE2b-256	`08b43448ac6e20ba9fc8cdb5aabc9178542412e62f108eef9b894ec9217e5f26`

See more details on using hashes here.

tool-scorer 1.6.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Toolscore

Why Toolscore?

Quick Start

Installation

What You Get

Python API

Basic evaluation

With LLM provider responses (auto-detected)

End-to-end agent testing

One-liner for tests

When Things Go Wrong

Pytest Integration

CLI

Metrics Deep Dive

CI/CD & Regression Testing

GitHub Action

Regression testing

Supported Formats

Advanced Features

LLM-as-a-Judge

Cost Tracking

Schema Validation

Side-Effect Validation

Trace Capture

Synthetic Test Generation

Interactive Debug

Multi-Model Comparison

When to Use Toolscore vs. Alternatives

File-Based API

Development

License

Citation

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes