Native PydanticAI evaluation with automatic cost tracking
Project description
Why Arbiter?
Most evaluation frameworks tell you if your outputs are good. Arbiter tells you that AND exactly what it cost. Every evaluation automatically tracks tokens, latency, and real dollar costs across any provider.
from arbiter_ai import evaluate
result = await evaluate(
output="Paris is the capital of France",
reference="The capital of France is Paris",
evaluators=["semantic"],
model="gpt-4o-mini"
)
print(f"Score: {result.overall_score:.2f}")
print(f"Cost: ${await result.total_llm_cost():.6f}")
print(f"Calls: {len(result.interactions)}")
What makes Arbiter different:
- Automatic cost tracking using LiteLLM's bundled pricing database
- PydanticAI native with type-safe structured outputs
- Pure library with no platform signup or server to run
- Complete observability with every LLM interaction tracked
Installation
pip install arbiter-ai
Optional features:
pip install arbiter-ai[cli] # Command-line interface
pip install arbiter-ai[scale] # Fast FAISS semantic backend
pip install arbiter-ai[storage] # PostgreSQL + Redis persistence
pip install arbiter-ai[verifiers] # Web search fact verification
Quick Start
Set your API key:
export OPENAI_API_KEY=sk-...
Run an evaluation:
from arbiter_ai import evaluate
result = await evaluate(
output="Paris is the capital of France",
reference="The capital of France is Paris",
evaluators=["semantic"],
model="gpt-4o-mini"
)
print(f"Score: {result.overall_score:.2f}")
print(f"Cost: ${await result.total_llm_cost():.6f}")
Evaluators
Semantic Similarity
result = await evaluate(
output="Paris is the capital of France",
reference="The capital of France is Paris",
evaluators=["semantic"],
model="gpt-4o-mini"
)
Custom Criteria (no reference needed)
result = await evaluate(
output="Medical advice about diabetes management",
criteria="Medical accuracy, HIPAA compliance, appropriate tone",
evaluators=["custom_criteria"],
model="gpt-4o-mini"
)
print(f"Criteria met: {result.scores[0].metadata['criteria_met']}")
Pairwise Comparison (A/B testing)
from arbiter_ai import compare
comparison = await compare(
output_a="GPT-4 response",
output_b="Claude response",
criteria="accuracy, clarity, completeness",
model="gpt-4o-mini"
)
print(f"Winner: {comparison.winner}") # output_a, output_b, or tie
print(f"Confidence: {comparison.confidence:.2f}")
Factuality, Groundedness, Relevance
# Hallucination detection
result = await evaluate(output=text, evaluators=["factuality"])
# RAG source attribution
result = await evaluate(output=rag_response, evaluators=["groundedness"])
# Query-output alignment
result = await evaluate(output=response, reference=query, evaluators=["relevance"])
Multiple Evaluators
result = await evaluate(
output="Your LLM output",
reference="Expected output",
criteria="Accuracy, clarity, completeness",
evaluators=["semantic", "custom_criteria", "factuality"],
model="gpt-4o-mini"
)
for score in result.scores:
print(f"{score.name}: {score.value:.2f}")
Batch Evaluation
from arbiter_ai import batch_evaluate
items = [
{"output": "Paris is capital of France", "reference": "Paris is France's capital"},
{"output": "Tokyo is capital of Japan", "reference": "Tokyo is Japan's capital"},
]
result = await batch_evaluate(
items=items,
evaluators=["semantic"],
model="gpt-4o-mini",
max_concurrency=5
)
print(f"Success: {result.successful_items}/{result.total_items}")
print(f"Total cost: ${await result.total_llm_cost():.4f}")
Command-Line Interface
pip install arbiter-ai[cli]
# Single evaluation
arbiter evaluate --output "Paris is the capital" --reference "Paris is France's capital"
# Batch evaluation from file
arbiter batch --file inputs.jsonl --evaluators semantic --output results.json
# Compare two outputs
arbiter compare --output-a "Response A" --output-b "Response B"
# List evaluators
arbiter list-evaluators
# Check costs
arbiter cost --model gpt-4o-mini --input-tokens 1000 --output-tokens 500
Provider Support
Arbiter works with any model via PydanticAI:
- OpenAI (GPT-4, GPT-4o, GPT-4o-mini)
- Anthropic (Claude)
- Google (Gemini)
- Groq
- Mistral
- Cohere
Set the appropriate API key as an environment variable.
Examples
# Run examples
python examples/basic_evaluation.py
python examples/custom_criteria_example.py
python examples/pairwise_comparison_example.py
python examples/batch_evaluation_example.py
python examples/observability_example.py
Development
git clone https://github.com/ashita-ai/arbiter.git
cd arbiter
uv sync --all-extras
make test
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arbiter_ai-0.2.0.tar.gz.
File metadata
- Download URL: arbiter_ai-0.2.0.tar.gz
- Upload date:
- Size: 105.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63a79e299579cbbee856c91f568b0c305d4e841855dcdadadb31278c54e8a64a
|
|
| MD5 |
07f859f9a89df77c5cf13923bb42eac1
|
|
| BLAKE2b-256 |
2598357f9cac8eb50df17bab9409a67ad827a401303e2b6be3f183c3776105f1
|
File details
Details for the file arbiter_ai-0.2.0-py3-none-any.whl.
File metadata
- Download URL: arbiter_ai-0.2.0-py3-none-any.whl
- Upload date:
- Size: 127.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
905294e58f33254dd636e5bfd7efaf57c93287c2d73aafee66cc71dfdaf531ef
|
|
| MD5 |
1fc88152b08849e831459045a520a2fc
|
|
| BLAKE2b-256 |
30279ba8b76554228b0c57866d59ef122ee55c74cf8e210deb4110483d28b1a0
|