Pytest for LLM agents. Self-explaining metrics show exactly WHY your agent failed. Regression testing catches degradation before deployment. GitHub Action for one-click CI/CD. Supports OpenAI, Anthropic, Gemini.
Project description
Toolscore
pytest for LLM agents - catch regressions before deployment
Test tool-calling accuracy for OpenAI, Anthropic, and Gemini
Stop shipping broken LLM agents. Toolscore automatically tests tool-calling behavior by comparing actual agent traces against expected behavior, catching regressions before they reach production. Works with OpenAI, Anthropic, Gemini, LangChain, and custom agents.
๐ What is Toolscore?
Toolscore evaluates LLM tool usage - it doesn't call LLM APIs directly. Think of it as a testing framework for function-calling agents:
- โ Evaluates existing tool usage traces from OpenAI, Anthropic, or custom sources
- โ Compares actual behavior against expected gold standards
- โ Reports detailed metrics on accuracy, efficiency, and correctness
- โ Does NOT call LLM APIs or execute tools (you capture traces separately)
Use Toolscore to:
- Benchmark different LLM models on tool usage tasks
- Validate that your agent calls the right tools with the right arguments
- Track improvements in function calling accuracy over time
- Compare agent performance across different prompting strategies
Features
- Self-Explaining Metrics: Know exactly WHY your agent failed with detailed explanations, similar name detection, and actionable tips
- Regression Testing:
toolscore regressioncommand catches performance degradation with baseline comparison - GitHub Action: One-click CI/CD setup with
yotambraun/toolscore@v1 - Trace vs. Spec Comparison: Load agent tool-use traces (OpenAI, Anthropic, Gemini, MCP, LangChain, or custom) and compare against gold standard specifications
- Comprehensive Metrics Suite:
- Tool Invocation Accuracy
- Tool Selection Accuracy
- Tool Correctness (were all expected tools called?)
- Tool Call Sequence Edit Distance
- Trajectory Accuracy (did agent take the correct reasoning path?)
- Argument Match F1 Score
- Parameter Schema Validation (types, ranges, patterns)
- Redundant Call Rate
- Side-Effect Success Rate (with content validation)
- Cost Tracking & Estimation (token usage, pricing for OpenAI/Anthropic/Gemini)
- Integrated LLM-as-a-judge semantic evaluation
- Multiple Trace Adapters: Built-in support for OpenAI, Anthropic, Google Gemini, MCP (Anthropic), LangChain, and custom JSON formats
- Production Trace Capture: Decorator to capture real agent executions and convert them to test cases
- CLI and API: Command-line interface and Python API for programmatic use
- Beautiful Console Output: Color-coded metrics, tables, and progress indicators with Rich
- Rich Output Reports: Interactive HTML, JSON, CSV (Excel/Sheets), Markdown (GitHub/docs) formats
- Pytest Integration: Seamless test integration with pytest plugin and assertion helpers
- Interactive Tutorials: Jupyter notebooks for hands-on learning
- Example Datasets: 5 realistic gold standards for common agent types (weather, ecommerce, code, RAG, multi-tool)
- Enhanced Validators: Validate side-effects with content checking (file content, database rows, HTTP responses)
- CI/CD Ready: GitHub Actions workflow template included
- Automated Releases: Semantic versioning with conventional commits
๐ Why Toolscore?
| Feature | Toolscore | LangSmith | OpenAI Evals | Weights & Biases | Manual Testing |
|---|---|---|---|---|---|
| Self-Explaining Metrics | โ WHY it failed + tips | โ | โ | โ | โ |
| Regression Testing | โ Baseline comparison | โ ๏ธ Manual | โ | โ ๏ธ Custom | โ |
| GitHub Action | โ One-click CI | โ ๏ธ Custom setup | โ | โ ๏ธ Custom | โ |
| Multi-Provider Support | โ OpenAI, Anthropic, Gemini, MCP | โ ๏ธ LangChain-focused | โ ๏ธ OpenAI-focused | โ Yes | โ |
| Trajectory Evaluation | โ Multi-step path analysis | โ Yes | โ | โ ๏ธ Custom | โ |
| Production Trace Capture | โ Decorator + auto-save | โ Yes | โ | โ Yes | โ |
| Open Source & Free | โ Apache 2.0 | โ Paid (limited free tier) | โ MIT | โ Paid | โ Free |
| Pytest Integration | โ Native plugin | โ ๏ธ Custom | โ | โ ๏ธ Custom | โ ๏ธ Manual |
| Comprehensive Metrics | โ 12+ specialized metrics | โ ๏ธ General metrics | โ ๏ธ Basic scoring | โ General ML metrics | โ |
| Content Validation | โ File/DB content checks | โ | โ | โ | โ |
| Schema Validation | โ Types, ranges, patterns | โ | โ | โ | โ |
| Tool Correctness Check | โ Deterministic coverage | โ | โ | โ | โ |
| LLM-as-a-Judge | โ Built-in | โ Yes | โ ๏ธ External | โ Yes | โ |
| Example Datasets | โ 5 realistic templates | โ ๏ธ Few examples | โ ๏ธ Limited | โ | โ |
| Beautiful HTML Reports | โ Interactive | โ Dashboard | โ ๏ธ Basic | โ Advanced | โ |
| Side-effect Validation | โ HTTP, FS, DB | โ | โ | โ | โ |
| Zero-Config Setup | โ
toolscore init |
โ ๏ธ Requires setup | โ ๏ธ Requires setup | โ ๏ธ Complex setup | โ |
| CI/CD Templates | โ GitHub Actions ready | โ Yes | โ ๏ธ Manual | โ Yes | โ |
| Local-First | โ No cloud required | โ Cloud-based | โ Local | โ Cloud-based | โ |
| Type Safety | โ Fully typed | โ ๏ธ Partial | โ ๏ธ Partial | โ ๏ธ Partial | โ |
Perfect for: Teams that want open-source, multi-provider evaluation with pytest integration and no cloud dependencies.
๐ Integrations
Toolscore works seamlessly with your existing stack:
| Category | Supported |
|---|---|
| LLM Providers | OpenAI, Anthropic, Google Gemini, MCP (Model Context Protocol), Custom APIs |
| Frameworks | LangChain, AutoGPT, CrewAI, Semantic Kernel, raw API calls |
| Testing | Pytest (native plugin), unittest, CI/CD pipelines (GitHub Actions, GitLab CI) |
| Input Formats | JSON, OpenAI format, Anthropic format, Gemini format, MCP (JSON-RPC 2.0), LangChain format, custom adapters |
| Output Formats | HTML reports, JSON, CSV, Markdown, Terminal (Rich), Prometheus metrics |
| Development | VS Code, PyCharm, Jupyter notebooks, Google Colab |
Coming Soon: DataDog integration, Weights & Biases export, Slack notifications
GitHub Action
Add LLM agent evaluation to your CI in seconds:
- uses: yotambraun/toolscore@v1
with:
gold-file: tests/gold_standard.json
trace-file: tests/agent_trace.json
threshold: '0.90'
With regression testing:
- uses: yotambraun/toolscore@v1
with:
gold-file: tests/gold_standard.json
trace-file: tests/agent_trace.json
baseline-file: tests/baseline.json
regression-threshold: '0.05'
See action.yml for all options.
Regression Testing
Catch performance degradation automatically:
# Step 1: Create a baseline from your best evaluation
toolscore eval gold.json trace.json --save-baseline baseline.json
# Step 2: Run regression checks in CI (fails if accuracy drops >5%)
toolscore regression baseline.json new_trace.json --gold-file gold.json
# With custom threshold (10% allowed regression)
toolscore regression baseline.json trace.json -g gold.json -t 0.10
Exit codes:
0: PASS - No regression detected1: FAIL - Regression detected (accuracy dropped)2: ERROR - Invalid files or other errors
๐ฅ Who Uses Toolscore?
Toolscore is trusted by ML engineers and teams building production LLM applications:
- Startups building agent-first products
- Research teams benchmarking LLM capabilities
- Enterprise teams ensuring agent reliability in production
- Independent developers optimizing prompt engineering
"Toolscore cut our agent testing time by 80% and caught 3 critical regressions before deployment" - ML Engineer
Using Toolscore? Share your story โ
๐ฆ Installation
# Install from PyPI
pip install tool-scorer
# Or install from source
git clone https://github.com/yotambraun/toolscore.git
cd toolscore
pip install -e .
Optional Dependencies
# Install with HTTP validation support
pip install tool-scorer[http]
# Install with LLM-as-a-judge metrics (requires OpenAI API key)
pip install tool-scorer[llm]
# Install with LangChain support
pip install tool-scorer[langchain]
# Install all optional features
pip install tool-scorer[all]
Development Installation
# Install with development dependencies
pip install -e ".[dev]"
# Install with dev + docs dependencies
pip install -e ".[dev,docs]"
# Or using uv (faster)
uv pip install -e ".[dev]"
What's New in v1.4.0
Toolscore v1.4.0 introduces three high-impact features based on real user needs:
Self-Explaining Metrics
Know exactly WHY your agent failed - not just that it failed. Get detailed explanations after each evaluation.
toolscore eval gold.json trace.json --verbose
# Output:
# What Went Wrong:
# MISSING: Expected tool 'search_web' was never called
# MISMATCH: Position 2: Expected 'summarize' but got 'summary' (similar names detected)
# WRONG_ARGS: Argument 'limit' expected 10, got 100
#
# Tips:
# Use --llm-judge to catch semantic equivalence (search vs web_search)
Regression Testing (toolscore regression)
Catch performance degradation automatically in CI/CD. 58% of prompt+model combinations degrade over API updates - now you'll know immediately.
# Create baseline from your best run
toolscore eval gold.json trace.json --save-baseline baseline.json
# Run regression checks (fails if accuracy drops >5%)
toolscore regression baseline.json new_trace.json --gold-file gold.json
# Exit codes: 0=PASS, 1=FAIL (regression), 2=ERROR
GitHub Action
One-click CI setup. Add agent quality gates to any repository in 30 seconds:
- uses: yotambraun/toolscore@v1
with:
gold-file: tests/gold_standard.json
trace-file: tests/agent_trace.json
threshold: '0.90'
fail-on-regression: 'true'
See examples/github_actions/ for complete workflow examples.
Also in Toolscore
- Zero-Friction Onboarding:
toolscore init- interactive project setup in 30 seconds - Synthetic Test Generator:
toolscore generate- create test cases from OpenAI schemas - Quick Compare:
toolscore compare- compare multiple models side-by-side - Interactive Debug Mode:
--debugflag for step-by-step failure analysis - LLM-as-a-Judge:
--llm-judgeflag for semantic tool name matching - Schema Validation: Validate argument types, ranges, patterns
- Example Datasets: 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)
๐ Quick Start
๐ 30-Second Start
The fastest way to start evaluating:
# Install
pip install tool-scorer
# Initialize project (interactive)
toolscore init
# Evaluate (included templates)
toolscore eval gold_calls.json example_trace.json
Done! You now have evaluation results with detailed metrics.
5-Minute Complete Workflow
-
Install Toolscore:
pip install tool-scorer
-
Initialize a project (choose from 5 agent types):
toolscore init # Select agent type โ Get templates + examples
-
Generate test cases (if you have OpenAI function schemas):
toolscore generate --from-openai functions.json --count 20
-
Run evaluation with your agent's trace:
# Basic evaluation toolscore eval gold_calls.json my_trace.json --html report.html # With semantic matching (catches similar tool names) toolscore eval gold_calls.json my_trace.json --llm-judge # With interactive debugging toolscore eval gold_calls.json my_trace.json --debug
-
Compare multiple models:
toolscore compare gold.json gpt4.json claude.json \ -n gpt-4 -n claude-3
-
View results:
- Console shows color-coded metrics
- Open
report.htmlfor interactive analysis - Check
toolscore.jsonfor machine-readable results
Want to test with your own LLM? See the Complete Tutorial for step-by-step instructions on capturing traces from OpenAI/Anthropic APIs.
Command Line Usage
# ===== GETTING STARTED =====
# Initialize new project (interactive)
toolscore init
# Generate test cases from OpenAI function schemas
toolscore generate --from-openai functions.json --count 20 -o gold.json
# Validate trace file format
toolscore validate trace.json
# ===== EVALUATION =====
# Basic evaluation
toolscore eval gold_calls.json trace.json
# With HTML report
toolscore eval gold_calls.json trace.json --html report.html
# With semantic matching (LLM-as-a-judge)
toolscore eval gold_calls.json trace.json --llm-judge
# With interactive debugging
toolscore eval gold_calls.json trace.json --debug
# Verbose output (shows missing/extra tools)
toolscore eval gold_calls.json trace.json --verbose
# Specify trace format explicitly
toolscore eval gold_calls.json trace.json --format openai
# Use realistic example dataset
toolscore eval examples/datasets/ecommerce_agent.json trace.json
# ===== MULTI-MODEL COMPARISON =====
# Compare multiple models side-by-side
toolscore compare gold.json gpt4.json claude.json gemini.json
# With custom model names
toolscore compare gold.json model1.json model2.json \
-n "GPT-4" -n "Claude-3-Opus"
# Save comparison report
toolscore compare gold.json *.json -o comparison.json
Python API
from toolscore import evaluate_trace
# Run evaluation
result = evaluate_trace(
gold_file="gold_calls.json",
trace_file="trace.json",
format="auto" # auto-detect format
)
# Access metrics
print(f"Invocation Accuracy: {result.metrics['invocation_accuracy']:.2%}")
print(f"Selection Accuracy: {result.metrics['selection_accuracy']:.2%}")
sequence = result.metrics['sequence_metrics']
print(f"Sequence Accuracy: {sequence['sequence_accuracy']:.2%}")
arguments = result.metrics['argument_metrics']
print(f"Argument F1: {arguments['f1']:.2%}")
Pytest Integration
Toolscore includes a pytest plugin for seamless test integration:
# test_my_agent.py
def test_agent_accuracy(toolscore_eval, toolscore_assertions):
"""Test that agent achieves high accuracy."""
result = toolscore_eval("gold_calls.json", "trace.json")
# Use built-in assertions
toolscore_assertions.assert_invocation_accuracy(result, min_accuracy=0.9)
toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.9)
toolscore_assertions.assert_argument_f1(result, min_f1=0.8)
The plugin is automatically loaded when you install Toolscore. See the examples for more patterns.
Interactive Tutorials
Try Toolscore in your browser with our Jupyter notebooks:
- Quickstart Tutorial - 5-minute introduction
- Custom Formats - Working with custom traces
- Advanced Metrics - Deep dive into metrics
Open them in Google Colab for instant experimentation.
๐ Gold Standard Format
Create a gold_calls.json file defining the expected tool calls:
[
{
"tool": "make_file",
"args": {
"filename": "poem.txt",
"lines_of_text": ["Roses are red,", "Violets are blue."]
},
"side_effects": {
"file_exists": "poem.txt"
},
"description": "Create a file with a poem"
}
]
๐ Trace Formats
Toolscore supports multiple trace formats:
OpenAI Format
[
{
"role": "assistant",
"function_call": {
"name": "get_weather",
"arguments": "{\"location\": \"Boston\"}"
}
}
]
Anthropic Format
[
{
"role": "assistant",
"content": [
{
"type": "tool_use",
"id": "toolu_123",
"name": "search",
"input": {"query": "Python"}
}
]
}
]
LangChain Format
[
{
"tool": "search",
"tool_input": {"query": "Python tutorials"},
"log": "Invoking search..."
}
]
Or modern format:
[
{
"name": "search",
"args": {"query": "Python"},
"id": "call_123"
}
]
Custom Format
{
"calls": [
{
"tool": "read_file",
"args": {"path": "data.txt"},
"result": "file contents"
}
]
}
๐ Metrics Explained
Tool Invocation Accuracy
Measures whether the agent invoked tools when needed and refrained when not needed.
Tool Selection Accuracy
Proportion of tool calls that match expected tool names.
Tool Correctness (NEW)
Checks if all expected tools were called at least once - complements selection accuracy by measuring coverage rather than per-call matching.
Sequence Edit Distance
Levenshtein distance between expected and actual tool call sequences.
Argument Match F1
Precision and recall of argument correctness across all tool calls.
Schema Validation (NEW)
Validates argument types, numeric ranges, string patterns, enums, and required fields. Define schemas in your gold standard:
{
"tool": "search",
"args": {"query": "test", "limit": 10},
"metadata": {
"schema": {
"query": {"type": "string", "minLength": 1},
"limit": {"type": "integer", "minimum": 1, "maximum": 100}
}
}
}
Redundant Call Rate
Percentage of unnecessary or duplicate tool calls.
Side-Effect Success Rate
Proportion of validated side-effects (HTTP, filesystem, database) that succeeded.
LLM-as-a-judge Semantic Evaluation (Integrated)
Now built into core evaluation! Use --llm-judge flag to evaluate semantic equivalence beyond exact string matching. Perfect for catching cases where tool names differ but intentions match (e.g., search_web vs web_search).
# CLI usage - easiest way
tool-scorer eval gold.json trace.json --llm-judge
# Python API
result = evaluate_trace("gold.json", "trace.json", use_llm_judge=True)
print(f"Semantic Score: {result.metrics['semantic_metrics']['semantic_score']:.2%}")
๐๏ธ Project Structure
toolscore/
โโโ adapters/ # Trace format adapters
โ โโโ openai.py
โ โโโ anthropic.py
โ โโโ custom.py
โโโ metrics/ # Metric calculators
โ โโโ accuracy.py
โ โโโ sequence.py
โ โโโ arguments.py
โ โโโ ...
โโโ validators/ # Side-effect validators
โ โโโ http.py
โ โโโ filesystem.py
โ โโโ database.py
โโโ reports/ # Report generators
โโโ cli.py # CLI interface
โโโ core.py # Core evaluation logic
Development
# Install dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=toolscore
# Type checking
mypy toolscore
# Linting and formatting
ruff check toolscore
ruff format toolscore
๐ฏ Real-World Use Cases
1. Model Evaluation & Selection
Compare GPT-4 vs Claude vs Gemini on your specific tool-calling tasks:
models = ["gpt-4", "claude-3-5-sonnet", "gemini-pro"]
results = {}
for model in models:
trace = capture_trace(model, task="customer_support")
result = evaluate_trace("gold_standard.json", trace)
results[model] = result.metrics['selection_accuracy']
best_model = max(results, key=results.get)
print(f"Best model: {best_model} ({results[best_model]:.1%} accuracy)")
2. CI/CD Integration
Catch regressions in agent behavior before deployment:
# test_agent_quality.py
def test_agent_meets_sla(toolscore_eval, toolscore_assertions):
"""Ensure agent meets 95% accuracy SLA."""
result = toolscore_eval("gold_standard.json", "production_trace.json")
toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.95)
toolscore_assertions.assert_redundancy_rate(result, max_rate=0.1)
3. Prompt Engineering Optimization
A/B test different prompts and measure impact:
prompts = ["prompt_v1.txt", "prompt_v2.txt", "prompt_v3.txt"]
for prompt_file in prompts:
trace = run_agent_with_prompt(prompt_file)
result = evaluate_trace("gold_standard.json", trace)
print(f"{prompt_file}:")
print(f" Selection: {result.metrics['selection_accuracy']:.1%}")
print(f" Arguments: {result.metrics['argument_metrics']['f1']:.1%}")
print(f" Efficiency: {result.metrics['efficiency_metrics']['redundant_rate']:.1%}")
4. Production Monitoring
Track agent performance over time in production:
# Run daily
today_traces = collect_production_traces(date=today)
result = evaluate_trace("gold_standard.json", today_traces)
# Alert if degradation
if result.metrics['selection_accuracy'] < 0.90:
send_alert("Agent performance degraded!")
# Log metrics to dashboard
log_to_datadog({
"accuracy": result.metrics['selection_accuracy'],
"redundancy": result.metrics['efficiency_metrics']['redundant_rate'],
})
๐ Documentation
- ReadTheDocs - Complete API documentation
- Complete Tutorial - In-depth guide with end-to-end workflow
- Example Datasets - 5 realistic gold standards (weather, ecommerce, code, RAG, multi-tool)
- Examples Directory - Sample traces and capture scripts
- Jupyter Notebooks - Interactive tutorials
- Contributing Guide - How to contribute to Toolscore
What's New
v1.4.0 (Latest - January 2026)
Self-Explaining Metrics:
- Know exactly WHY your agent failed with detailed explanations
- Automatic detection of tool name mismatches and similar names
- Actionable tips like "use --llm-judge to catch semantic equivalence"
- Per-metric breakdowns showing missing, extra, and mismatched items
Regression Testing:
- New
toolscore regressioncommand for CI/CD integration - Save baselines with
--save-baselineflag - Automatic PASS/FAIL with configurable thresholds
- Detailed delta reports showing improvements and regressions
GitHub Action:
- Official action on GitHub Marketplace
- One-click CI setup for any repository
- Supports both threshold and regression testing modes
- Automatic report artifacts and job summaries
v1.1.0 (October 2025)
Major Product Improvements:
- Integrated LLM-as-a-Judge with
--llm-judgeflag - Tool Correctness Metric for complete tool coverage
- Parameter Schema Validation for types, ranges, patterns
- Example Datasets: 5 realistic gold standards
- Enhanced Console Output with Rich tables
v1.0.x
- LLM-as-a-judge metrics: Semantic correctness evaluation using OpenAI API
- LangChain adapter: Support for LangChain agent traces (legacy and modern formats)
- Beautiful console output: Color-coded metrics with Rich library
- Pytest plugin: Seamless test integration with fixtures and assertions
- Interactive tutorials: Jupyter notebooks for hands-on learning
- Comprehensive documentation: Sphinx docs on ReadTheDocs
- Test coverage: Increased to 80%+ with 123 passing tests
- Automated releases: Semantic versioning with conventional commits
- Enhanced PyPI presence: 16 searchable keywords, Beta status, comprehensive classifiers
See CHANGELOG.md for full release history.
๐ค Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
๐ License
Apache License 2.0 - see LICENSE for details.
๐ Citation
If you use Toolscore in your research, please cite:
@software{toolscore,
title = {Toolscore: LLM Tool Usage Evaluation Package},
author = {Yotam Braun},
year = {2025},
url = {https://github.com/yotambraun/toolscore}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tool_scorer-1.4.2.tar.gz.
File metadata
- Download URL: tool_scorer-1.4.2.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
028346ea4afe82011feb271be150ad77d9812806d1f1f3885fbf67b47c6be421
|
|
| MD5 |
4861a716059c3c73ba1059c03d4733d4
|
|
| BLAKE2b-256 |
8250d3c0caecd6e6d5fae54ba084f7bf68dd0d61dde367ba0c1554ab82246e6a
|
File details
Details for the file tool_scorer-1.4.2-py3-none-any.whl.
File metadata
- Download URL: tool_scorer-1.4.2-py3-none-any.whl
- Upload date:
- Size: 87.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e15db51d245327b256adb1e1b6a0950b354df856f038f382caeeecdd0f19e68
|
|
| MD5 |
5dcc429279d7bac98ad1045cd837e6e9
|
|
| BLAKE2b-256 |
174e04cee4a6e10c9ca046ce433d6647d46138ab2a4a2653fce60609d91e459d
|