Python SDK for the ValueHead agent trace evaluation platform
Project description
ValueHead
LLM-as-judge observability for AI agents. Submit your agent's conversation traces and get back per-turn scores, tool call evaluations, and trajectory-level assessments — all powered by rubric-based LLM judging.
Website & Dashboard: valuehead.ai
What is ValueHead?
ValueHead is an observability platform that evaluates multi-turn AI agent conversations. It scores each turn as helpful (+1), neutral (0), or harmful (-1) with detailed reasoning, and provides an overall trajectory assessment of the conversation.
It works with:
- Tool-calling agents — evaluates tool selection, parameter accuracy, and response synthesis
- Conversational agents — evaluates response quality, hallucination, and consistency
- Voice AI agents — transcribes audio recordings and evaluates the conversation
- Raw text transcripts — auto-parses speaker roles and evaluates
Installation
pip install valuehead
Or install from source:
pip install git+https://github.com/Aradhye2002/valuehead-sdk.git
Quick start
- Sign up at valuehead.ai and get your API key from Settings
- Submit a trace:
from valuehead import ValueHead
client = ValueHead(api_key="vh_your_api_key_here")
result = client.submit(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the weather in London?"},
{"role": "assistant", "content": "Let me check that for you.", "tool_calls": [
{"id": "call_1", "type": "function", "function": {
"name": "get_weather", "arguments": "{\"city\": \"London\"}"
}}
]},
{"role": "tool", "content": "{\"temp\": 15, \"condition\": \"cloudy\"}", "tool_call_id": "call_1"},
{"role": "assistant", "content": "London is currently 15°C and cloudy."},
],
metadata={"agent": "weather-bot", "version": "1.0"},
)
# Wait for evaluation to complete
session = client.wait(result.session_id)
# Print results
print(f"Turns: {session.total_turns}, Net score: {session.summary.net_score}")
for turn_num, j in session.judgements.items():
score = f"+{j.score}" if j.score > 0 else str(j.score)
tools = f" [{len(j.tool_calls)} tools]" if j.tool_calls else ""
print(f" Turn {j.turn}: {score}{tools} — {j.turn_reasoning[:100]}")
Input formats
1. Structured conversation traces
Submit OpenAI-format messages with system, user, assistant, and tool roles. Each user message starts a new evaluation turn.
result = client.submit(
messages=[
{"role": "system", "content": "You are a booking assistant."},
{"role": "user", "content": "Book me a flight to Paris."},
{"role": "assistant", "content": "Searching for flights.", "tool_calls": [
{"id": "c1", "type": "function", "function": {
"name": "search_flights", "arguments": "{\"destination\": \"Paris\"}"
}}
]},
{"role": "tool", "content": "{\"flights\": [{\"id\": \"FL123\", \"price\": 450}]}", "tool_call_id": "c1"},
{"role": "assistant", "content": "I found a flight to Paris for $450. Shall I book it?"},
{"role": "user", "content": "Yes, book it."},
{"role": "assistant", "content": "Booking now.", "tool_calls": [
{"id": "c2", "type": "function", "function": {
"name": "book_flight", "arguments": "{\"flight_id\": \"FL123\"}"
}}
]},
{"role": "tool", "content": "{\"confirmation\": \"BK789\"}", "tool_call_id": "c2"},
{"role": "assistant", "content": "Your flight is booked! Confirmation: BK789."},
],
metadata={"agent": "travel-bot"},
)
Supports both OpenAI-style tool_calls arrays and XML-style <tool_call> tags in message content.
2. Voice recordings
Upload audio files (wav, mp3, ogg, m4a, etc.) for automatic transcription with speaker diarization, then evaluation.
result = client.submit_voice(
"recording.ogg",
speakers_expected=2,
first_speaker_role="user",
metadata={"call_type": "support"},
)
session = client.wait(result.session_id)
Parameters:
file_path— path to an audio filespeakers_expected— hint for diarization (e.g.2for a two-person call)human_speaker— override which speaker ID maps to the human rolefirst_speaker_role—"user"(default) or"assistant"for outbound/agent-initiated callsmetadata— extra tags
Supports multilingual audio with automatic language detection. When diarization detects 2+ speakers, they are mapped to user/assistant automatically. Single-speaker recordings fall back to LLM-based speaker identification.
3. Raw text transcripts
Submit unstructured conversation text. ValueHead's LLM parser identifies speakers and splits into turns before evaluation.
result = client.submit_text(
text="""
Agent: Thank you for calling Acme Support, how can I help?
Customer: Hi, my internet has been down since yesterday.
Agent: I'm sorry to hear that. Let me check your account.
Customer: My account number is 12345.
Agent: I can see there's an outage in your area. It should be resolved by 5 PM today.
Customer: Okay, thanks for letting me know.
""",
context="ISP customer support call",
metadata={"department": "technical"},
)
session = client.wait(result.session_id)
Parameters:
text— the raw conversation as a stringcontext— optional hint to help the parser (e.g. "recruitment call", "tech support chat")first_speaker_role—"user"(default) or"assistant"metadata— extra tags
4. Batch submission
Submit multiple traces at once via the API:
curl -X POST https://valuehead-production.up.railway.app/api/v1/traces/batch \
-H "Content-Type: application/json" \
-H "X-Api-Key: vh_your_key" \
-d '{"traces": [{"messages": [...], "metadata": {...}}, ...]}'
API reference
ValueHead(api_key, base_url, timeout)
Create a client. Uses valuehead.ai by default.
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key |
str |
required | Your API key (starts with vh_) |
base_url |
str |
https://valuehead-production.up.railway.app |
API base URL |
timeout |
float |
30.0 |
HTTP request timeout in seconds |
Supports context manager:
with ValueHead(api_key="vh_...") as client:
result = client.submit(messages=[...])
client.submit(messages, metadata) -> SubmitResult
Submit a structured trace for async evaluation. Returns immediately.
client.submit_text(text, context, metadata, first_speaker_role) -> SubmitResult
Submit a raw text transcript. Returns immediately.
client.submit_voice(file_path, metadata, human_speaker, speakers_expected, first_speaker_role) -> SubmitResult
Submit an audio file. Returns immediately.
client.wait(session_id, poll_interval, timeout) -> SessionDetail
Poll until evaluation completes or fails. Default timeout is 300 seconds.
client.get(session_id) -> SessionDetail
Get full session state including all judgements.
client.list(limit, offset) -> SessionsListResponse
List your evaluation sessions, most recent first.
client.delete(session_id)
Delete a session and its results.
Response models
SubmitResult
| Field | Type | Description |
|---|---|---|
session_id |
str |
Unique session identifier |
status |
str |
"pending" — evaluation hasn't started yet |
SessionDetail
| Field | Type | Description |
|---|---|---|
id |
str |
Session ID |
status |
str |
pending / running / completed / failed |
total_turns |
int |
Number of conversation turns |
judgements |
dict[str, TurnJudgement] |
Per-turn judgements keyed by turn number |
trajectory |
`TrajectoryJudgement | None` |
summary |
ScoreSummary |
Aggregated score counts |
error |
`str | None` |
metadata |
dict |
Your metadata tags |
TurnJudgement
| Field | Type | Description |
|---|---|---|
turn |
int |
Turn number (1-indexed) |
score |
int |
-1 (harmful), 0 (neutral), 1 (helpful) |
turn_reasoning |
str |
Chain-of-thought reasoning for the score |
user_summary |
str |
Summary of what the user said/asked |
assistant_summary |
str |
Summary of the assistant's response |
tool_calls |
list[ToolCallJudgement] |
Individual tool call evaluations |
ToolCallJudgement
| Field | Type | Description |
|---|---|---|
score |
int |
-1, 0, or 1 |
summary |
str |
What the tool call did |
reasoning |
str |
Why this score was assigned |
call_content |
str |
The tool call content |
TrajectoryJudgement
| Field | Type | Description |
|---|---|---|
score |
int |
Overall conversation score (-1, 0, 1) |
reasoning |
str |
Why this overall score was given |
completed |
bool |
Did the conversation reach its goal? |
early_termination |
bool |
Did the user cut the conversation short? |
failures |
list[TrajectoryFailure] |
Specific failure points identified |
ScoreSummary
| Field | Type | Description |
|---|---|---|
total_turns |
int |
Total turns in the conversation |
judged_turns |
int |
Number of turns evaluated so far |
helpful |
int |
Count of +1 scores |
neutral |
int |
Count of 0 scores |
harmful |
int |
Count of -1 scores |
net_score |
int |
Trajectory score (or sum of turn scores) |
trajectory_score |
`int | None` |
Evaluation rubric
Every turn is evaluated on these dimensions:
| Dimension | What it catches |
|---|---|
| Hallucination | Fabricated information not grounded in context or tool outputs |
| Wrong tool usage | Inappropriate tool selection, wrong parameters, or off-topic responses |
| Inconsistent reasoning | Contradicts prior steps, available evidence, or stated plan |
| Unnecessary actions | Redundant tool calls, verbose responses, or wasted steps |
| Safety | Harmful, biased, or inappropriate content |
Each dimension includes dedicated chain-of-thought reasoning. The final score synthesizes all dimensions.
The trajectory-level assessment evaluates the conversation as a whole: did the agent achieve the user's goal? Were there failure points? Did the user give up?
REST API
You can also use the API directly without the SDK.
Authentication
Every request requires either:
X-Api-Key: vh_your_keyheader (recommended for programmatic use)Authorization: Bearer <jwt_token>header (from login)
Endpoints
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/traces |
Submit a conversation trace (async) |
POST |
/api/v1/traces/voice |
Submit a voice recording |
POST |
/api/v1/traces/text |
Submit a raw text transcript |
POST |
/api/v1/traces/batch |
Submit multiple traces |
GET |
/api/v1/traces/{id} |
Get session + judgements |
GET |
/api/v1/traces/{id}/stream |
SSE stream of live judgements |
POST |
/api/v1/traces/{id}/evaluate |
Synchronous evaluation (blocks) |
GET |
/api/v1/traces |
List sessions (paginated) |
DELETE |
/api/v1/traces/{id} |
Delete a session |
DELETE |
/api/v1/traces |
Delete all sessions |
POST |
/api/v1/auth/register |
Create account |
POST |
/api/v1/auth/login |
Get JWT token |
GET |
/api/v1/auth/me |
Get profile |
POST |
/api/v1/auth/rotate-key |
Rotate API key |
Example: curl
# Submit a trace
curl -X POST https://valuehead-production.up.railway.app/api/v1/traces \
-H "Content-Type: application/json" \
-H "X-Api-Key: vh_your_key" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi! How can I help you today?"}
]
}'
# Get results
curl https://valuehead-production.up.railway.app/api/v1/traces/{session_id} \
-H "X-Api-Key: vh_your_key"
SSE streaming
Stream judgements as they are produced, one per turn:
curl -N https://valuehead-production.up.railway.app/api/v1/traces/{session_id}/stream \
-H "X-Api-Key: vh_your_key"
Events:
judgement— emitted after each turn is judged, contains the turn's judgement datadone— evaluation is complete
Dashboard
The web dashboard at valuehead.ai provides:
- Trace list — all your evaluated conversations with status, score bars, and metadata tags
- Trace detail — per-turn breakdown with expandable reasoning, tool call scores, and trajectory assessment
- Live streaming — watch judgements appear in real-time as turns are evaluated
- Score visualizations — donut charts and score bars showing helpful/neutral/harmful distribution
- Settings — manage your API key, view your profile
- Admin dashboard — (admin users) view all users, trace counts, system stats, and manage accounts
Limits
| Limit | Default |
|---|---|
| Traces per user | 200 |
| Tool calls per trace | 100 |
| Characters per trace | 500,000 |
Errors
The SDK raises typed exceptions:
| Exception | When |
|---|---|
AuthenticationError |
Invalid or missing API key (401/403) |
NotFoundError |
Session not found (404) |
EvaluationTimeoutError |
wait() exceeded its timeout |
ValueHeadError |
Any other API error |
from valuehead import ValueHead, AuthenticationError, EvaluationTimeoutError
try:
session = client.wait(result.session_id, timeout=60)
except EvaluationTimeoutError:
print("Still evaluating, check back later")
session = client.get(result.session_id)
Self-hosting
ValueHead can be self-hosted. See the main repository for setup instructions.
git clone https://github.com/Aradhye2002/valuehead-sdk.git
cd valuehead
cp .env.example .env # add your OpenAI API key
pip install -r requirements.txt
cd frontend && npm install && npm run build && cd ..
python run.py
Point the SDK to your instance:
client = ValueHead(api_key="vh_...", base_url="http://localhost:8000")
Environment variables (self-hosted)
| Variable | Required | Default | Description |
|---|---|---|---|
TRACE_OBS_LLM_API_KEY |
Yes | — | OpenAI API key for the judge LLM |
TRACE_OBS_LLM_BASE_URL |
No | https://api.openai.com/v1 |
Any OpenAI-compatible API |
TRACE_OBS_LLM_MODEL |
No | gpt-5.4-mini |
Judge model ID |
TRACE_OBS_JUDGE_TEMPERATURE |
No | 0.6 |
Judge temperature |
TRACE_OBS_ELEVENLABS_API_KEY |
For voice | — | ElevenLabs API key (speech-to-text) |
TRACE_OBS_DATABASE_URL |
No | SQLite | Database URL (supports Postgres) |
TRACE_OBS_JWT_SECRET |
Production | change-me-... |
JWT signing secret |
TRACE_OBS_ADMIN_EMAILS |
No | [] |
Emails auto-granted admin role |
TRACE_OBS_MAX_TRACES_PER_USER |
No | 200 |
Per-user trace limit |
TRACE_OBS_MAX_TOOL_CALLS_PER_TRACE |
No | 100 |
Max tool calls per trace |
TRACE_OBS_MAX_TRACE_CHARS |
No | 500000 |
Max characters per trace |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file valuehead-0.1.0.tar.gz.
File metadata
- Download URL: valuehead-0.1.0.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a69406ea4c194c5767112ca1ad9043c00d6421458a4cdf9c5ff90fe3f38bd342
|
|
| MD5 |
05613850ab6c83a037204be66c8c15ea
|
|
| BLAKE2b-256 |
4b72bc4fe54fc7c21469a7e579441bc7a7a70e33b8dafc3823edea0b14c6c806
|
File details
Details for the file valuehead-0.1.0-py3-none-any.whl.
File metadata
- Download URL: valuehead-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a30594d11a9524ddc252ae2f3b12e16ecdc02de2c2539d72219eac5acb286e3
|
|
| MD5 |
ac01922f65cf81d360845f1f5cdf7eaa
|
|
| BLAKE2b-256 |
2b1289f0aa301397dea59082a5e89ed975d7c739f30a7732da612d4e911ce0c4
|