Evaluation judges for AI voice agents (hallucination, response accuracy, intent, tool correctness, etc.)
Project description
agent-observability-sdk
The Python SDK for shipping evals + telemetry to agent-observability. Three surfaces in one install:
- LiveKit helpers — bootstrap the tag bundle the v2 server expects
(
init_observability), run judges against a session report (run_judges_on_report), resolve the upload URL (ensure_observability_url). For workers that drive LiveKit Agents directly; agent-transport'sAudioStreamServerdoes this internally. - Judges — nine LiveKit-compatible judges ported from
cx-sqs-worker (Hallucination,
Response Accuracy, Tool Correctness, Loop Detection, …) plus a
default_judges()composition helper. Plug straight intolivekit.agents.evals.JudgeGroupalongside LiveKit's built-ins. - pytest plugin — auto-registered via
pytest11entry-point. Everypytestrun becomes oneeval_runin the dashboard; every test function becomes aneval_casewith events, judgments, and failure detail. Same plumbing the deprecated standalonepytest-agent-observabilitypackage used to ship.
Install
pip install agent-observability-sdk
livekit-agents>=1.5.2,<1.6, pytest>=7.0, and httpx>=0.24 are hard
deps and installed automatically. Python ≥ 3.10.
Quick start
1. Raw LiveKit worker (text or audio, your own AgentServer)
from agent_observability.livekit import init_observability, run_judges_on_report
from livekit.agents import AgentServer, JobContext
from livekit.agents.evals import accuracy_judge, safety_judge
server = AgentServer()
async def on_session_end(ctx: JobContext) -> None:
report = ctx.make_session_report()
await run_judges_on_report(
report,
judges=[accuracy_judge(), safety_judge()],
)
@server.rtc_session(agent_name="support-bot", on_session_end=on_session_end)
async def entrypoint(ctx: JobContext) -> None:
init_observability(
ctx.tagger,
agent_id="9c2f7e3d-…", # stable opaque UUID
agent_name="support-bot",
account_id="acct-7",
transport="text",
)
# …your usual AgentSession.start(...) setup
That's the whole observability surface for a raw-LiveKit worker. No
hand-rolled tagger.add(...) calls, no JudgeGroup boilerplate, no
llm.aclose() cleanup.
2. agent-transport worker (AudioStreamServer)
Don't use the helpers — agent-transport already emits tags and runs
judges via its own EvaluationConfig. You only consume the judges
catalogue from this package:
from agent_observability.livekit.judges import (
default_judges,
IntentAccuracyJudge,
rigid_response_accuracy_judge,
)
from agent_transport import AudioStreamServer
from agent_transport.evaluation import EvaluationConfig
from livekit.agents.evals import accuracy_judge
ctx.evaluation = EvaluationConfig(
judge_llm=judge_llm,
judges=[
accuracy_judge(),
IntentAccuracyJudge(expected_intent="book_flight", actual_intent=...),
rigid_response_accuracy_judge(expected_response="...", llm=judge_llm),
*default_judges(llm=judge_llm),
],
)
3. pytest suite
The plugin is auto-discovered — install the SDK, point at the dashboard,
done. Tests with AgentSession.run(...) and .judge(...) work as-is.
export AGENT_OBSERVABILITY_URL=https://obs.example.com
export AGENT_OBSERVABILITY_AGENT_ID=9c2f7e3d-4b8a-4d2e-9f1b-…
pytest
# In a test file
import pytest
from livekit.agents import AgentSession, inference
@pytest.mark.asyncio
async def test_greeting():
async with inference.LLM(model="openai/gpt-4.1-mini") as llm, \
AgentSession(llm=llm) as sess:
await sess.start(Assistant())
result = await sess.run(user_input="Hello")
result.expect.next_event().is_message(role="assistant")
await result.expect.next_event(type="message").judge(
llm, intent="greets politely",
)
Auto-capture is on by default. Every RunResult from
AgentSession.run(...) is collected automatically and .judge(...)
calls are intercepted as first-class Judgment events in the dashboard.
The capture(result) helper is exported for RunResults produced
outside the standard .run() path:
from agent_observability.livekit.pytest import capture
Configuration
| Env var | CLI flag | Purpose |
|---|---|---|
LIVEKIT_OBSERVABILITY_URL |
— | Dashboard base URL (LiveKit-canonical name). Required by init_observability (raises if unset). |
AGENT_OBSERVABILITY_URL |
--agent-observability-url |
Same purpose; init_observability accepts this as a fallback and mirrors it into LIVEKIT_OBSERVABILITY_URL so LiveKit's upload code picks it up. |
AGENT_OBSERVABILITY_AGENT_ID |
--agent-observability-agent-id |
Stable opaque agent identifier. Strongly recommended — without it the session lands unparented on the dashboard (the server accepts the upload but has nothing to backfill the FK with). UUIDs preferred over slugs. |
AGENT_OBSERVABILITY_ACCOUNT_ID |
--agent-observability-account-id |
Multi-tenant account id. Optional. |
AGENT_OBSERVABILITY_USER / _PASS |
— | Basic-auth credentials when the server enables auth. Optional. |
AGENT_OBSERVABILITY_TIMEOUT |
--agent-observability-timeout |
Upload request timeout in seconds (default 10). |
AGENT_OBSERVABILITY_MAX_RETRIES |
--agent-observability-max-retries |
Max upload attempts before falling back (default 3). |
AGENT_OBSERVABILITY_FALLBACK_DIR |
--agent-observability-fallback-dir |
Directory for failed-upload JSON (defaults to .pytest_cache/agent-observability). |
CI metadata (GitHub / GitLab / CircleCI / Buildkite) is auto-detected by the pytest plugin from standard env vars — no configuration needed.
Judge reference
LLM-based (7 factories)
Each returns a LiveKit _LLMJudge you pass straight to a JudgeGroup:
hallucination_judge(llm=...)— fabricated info?rigid_response_accuracy_judge(*, expected_response, llm=...)— semantic match against an expected text.freeflow_response_accuracy_judge(llm=...)— contextually appropriate in an open-ended conversation?hold_requested_intent_accuracy_judge(llm=...)— was a "hold" / "wait" reply justified?variable_extraction_judge(*, expected_variables, actual_variables, llm=...)— were the right values extracted, grounded in the transcript?loop_detection_judge(llm=...)— agent repeating itself?knowledge_base_correctness_judge(*, kb_context, llm=...)— KB lookup faithfully reflected?
Programmatic (2 classes, no LLM call)
IntentAccuracyJudge(*, expected_intent, actual_intent)— case-insensitive string match.ToolCorrectnessJudge(*, expected_tools, threshold=1.0)— auto-extracts function-call events fromchat_ctx; set-membership scoring.
Composition helper
default_judges(llm=None) -> list[Judge]— the four ground-truth-free judges (Hallucination, Freeflow Response Accuracy, Hold-Requested Intent Accuracy, Loop Detection). Spread next to your own ground-truth-bound judges.
Which judges need what data?
| Judge | Required at construction | Read from chat_ctx |
|---|---|---|
hallucination_judge |
— | full conversation |
rigid_response_accuracy_judge |
expected_response |
latest assistant message |
freeflow_response_accuracy_judge |
— | full conversation |
hold_requested_intent_accuracy_judge |
— | latest assistant message + prior user turn |
variable_extraction_judge |
expected_variables, actual_variables |
full conversation (for grounding) |
loop_detection_judge |
— | latest assistant message + prior 2-3 |
knowledge_base_correctness_judge |
kb_context |
full conversation |
IntentAccuracyJudge |
expected_intent, actual_intent |
— (ignored) |
ToolCorrectnessJudge |
expected_tools |
function_call items (auto-extracted) |
Migrating from pytest-agent-observability
The standalone pytest-agent-observability package is discontinued.
The last published release (0.2.1) still installs and runs, but
predates this SDK's helpers and judges. Migrate by switching the
dependency + import:
-pytest-agent-observability
+agent-observability-sdk
-from pytest_agent_observability import capture
+from agent_observability.livekit.pytest import capture
The plugin is auto-discovered via pytest11 entry-point — no extra
config needed. Auto-capture, .judge() interception, retry / fallback
behaviour, and CI metadata extraction are byte-for-byte identical.
Not ported from cx-sqs-worker
Two judges are intentionally absent because their prompts are tightly coupled to the cx-sqs-worker flow-graph runtime (global vs. node instructions, closed available-intents list):
semi_rigid_response_accuracyintent_detection
Use rigid_response_accuracy_judge or freeflow_response_accuracy_judge
for response evaluation; use IntentAccuracyJudge for closed-set
intent checks.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_observability_sdk-0.2.1.tar.gz.
File metadata
- Download URL: agent_observability_sdk-0.2.1.tar.gz
- Upload date:
- Size: 54.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e62ad7866d863d90599443f14be5ebacfd507d8e509f479f46b53567f08f9e91
|
|
| MD5 |
ef8f4b922d67bd3cb8c1ac5aa3ea6079
|
|
| BLAKE2b-256 |
7b892339af34d3804559213c96738c1ad774b6b4093a341c711bd68b45b863ff
|
Provenance
The following attestation bundles were made for agent_observability_sdk-0.2.1.tar.gz:
Publisher:
publish-observability-sdk.yml on plivo-labs/agent-observability
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_observability_sdk-0.2.1.tar.gz -
Subject digest:
e62ad7866d863d90599443f14be5ebacfd507d8e509f479f46b53567f08f9e91 - Sigstore transparency entry: 1732268667
- Sigstore integration time:
-
Permalink:
plivo-labs/agent-observability@030901245907e8c88a22fded45beb82005fb097f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/plivo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-observability-sdk.yml@030901245907e8c88a22fded45beb82005fb097f -
Trigger Event:
workflow_run
-
Statement type:
File details
Details for the file agent_observability_sdk-0.2.1-py3-none-any.whl.
File metadata
- Download URL: agent_observability_sdk-0.2.1-py3-none-any.whl
- Upload date:
- Size: 48.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d85f2060f97fb29707cdca800310b1227c4ecefb6cc95bcbc6d5c4d47b94d9e
|
|
| MD5 |
e1c1bb47390388be523bc0dcbaf16905
|
|
| BLAKE2b-256 |
037da7498b98dfbbcac90560a6647e8d164deba71f4e60e82a7ec03e79472313
|
Provenance
The following attestation bundles were made for agent_observability_sdk-0.2.1-py3-none-any.whl:
Publisher:
publish-observability-sdk.yml on plivo-labs/agent-observability
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_observability_sdk-0.2.1-py3-none-any.whl -
Subject digest:
2d85f2060f97fb29707cdca800310b1227c4ecefb6cc95bcbc6d5c4d47b94d9e - Sigstore transparency entry: 1732268743
- Sigstore integration time:
-
Permalink:
plivo-labs/agent-observability@030901245907e8c88a22fded45beb82005fb097f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/plivo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-observability-sdk.yml@030901245907e8c88a22fded45beb82005fb097f -
Trigger Event:
workflow_run
-
Statement type: