Evaluation judges for AI voice agents (hallucination, response accuracy, intent, tool correctness, etc.)

These details have not been verified by PyPI

Project description

agent-observability-sdk

The Python SDK for shipping evals + telemetry to agent-observability. Three surfaces in one install:

LiveKit helpers — bootstrap the tag bundle the v2 server expects (init_observability), run judges against a session report (run_judges_on_report), resolve the upload URL (ensure_observability_url). For workers that drive LiveKit Agents directly; agent-transport's AudioStreamServer does this internally.
Judges — nine LiveKit-compatible judges ported from cx-sqs-worker (Hallucination, Response Accuracy, Tool Correctness, Loop Detection, …) plus a default_judges() composition helper. Plug straight into livekit.agents.evals.JudgeGroup alongside LiveKit's built-ins.
pytest plugin — auto-registered via pytest11 entry-point. Every pytest run becomes one eval_run in the dashboard; every test function becomes an eval_case with events, judgments, and failure detail. Same plumbing the deprecated standalone pytest-agent-observability package used to ship.

Install

pip install agent-observability-sdk

livekit-agents>=1.5.2,<1.6, pytest>=7.0, and httpx>=0.24 are hard deps and installed automatically. Python ≥ 3.10.

Quick start

1. Raw LiveKit worker (text or audio, your own `AgentServer`)

from agent_observability.livekit import init_observability, run_judges_on_report
from livekit.agents import AgentServer, JobContext
from livekit.agents.evals import accuracy_judge, safety_judge

server = AgentServer()

async def on_session_end(ctx: JobContext) -> None:
    report = ctx.make_session_report()
    await run_judges_on_report(
        report,
        judges=[accuracy_judge(), safety_judge()],
    )

@server.rtc_session(agent_name="support-bot", on_session_end=on_session_end)
async def entrypoint(ctx: JobContext) -> None:
    init_observability(
        ctx.tagger,
        agent_id="9c2f7e3d-…",       # stable opaque UUID
        agent_name="support-bot",
        account_id="acct-7",
        transport="text",
    )
    # …your usual AgentSession.start(...) setup

That's the whole observability surface for a raw-LiveKit worker. No hand-rolled tagger.add(...) calls, no JudgeGroup boilerplate, no llm.aclose() cleanup.

2. agent-transport worker (`AudioStreamServer`)

Don't use the helpers — agent-transport already emits tags and runs judges via its own EvaluationConfig. You only consume the judges catalogue from this package:

from agent_observability.livekit.judges import (
    default_judges,
    IntentAccuracyJudge,
    rigid_response_accuracy_judge,
)
from agent_transport import AudioStreamServer
from agent_transport.evaluation import EvaluationConfig
from livekit.agents.evals import accuracy_judge

ctx.evaluation = EvaluationConfig(
    judge_llm=judge_llm,
    judges=[
        accuracy_judge(),
        IntentAccuracyJudge(expected_intent="book_flight", actual_intent=...),
        rigid_response_accuracy_judge(expected_response="...", llm=judge_llm),
        *default_judges(llm=judge_llm),
    ],
)

3. pytest suite

The plugin is auto-discovered — install the SDK, point at the dashboard, done. Tests with AgentSession.run(...) and .judge(...) work as-is.

export AGENT_OBSERVABILITY_URL=https://obs.example.com
export AGENT_OBSERVABILITY_AGENT_ID=9c2f7e3d-4b8a-4d2e-9f1b-…
pytest

# In a test file
import pytest
from livekit.agents import AgentSession, inference

@pytest.mark.asyncio
async def test_greeting():
    async with inference.LLM(model="openai/gpt-4.1-mini") as llm, \
               AgentSession(llm=llm) as sess:
        await sess.start(Assistant())
        result = await sess.run(user_input="Hello")
        result.expect.next_event().is_message(role="assistant")
        await result.expect.next_event(type="message").judge(
            llm, intent="greets politely",
        )

Auto-capture is on by default. Every RunResult from AgentSession.run(...) is collected automatically and .judge(...) calls are intercepted as first-class Judgment events in the dashboard. The capture(result) helper is exported for RunResults produced outside the standard .run() path:

from agent_observability.livekit.pytest import capture

Configuration

Env var	CLI flag	Purpose
`LIVEKIT_OBSERVABILITY_URL`	—	Dashboard base URL (LiveKit-canonical name). Required by `init_observability` (raises if unset).
`AGENT_OBSERVABILITY_URL`	`--agent-observability-url`	Same purpose; `init_observability` accepts this as a fallback and mirrors it into `LIVEKIT_OBSERVABILITY_URL` so LiveKit's upload code picks it up.
`AGENT_OBSERVABILITY_AGENT_ID`	`--agent-observability-agent-id`	Stable opaque agent identifier. Strongly recommended — without it the session lands unparented on the dashboard (the server accepts the upload but has nothing to backfill the FK with). UUIDs preferred over slugs.
`AGENT_OBSERVABILITY_ACCOUNT_ID`	`--agent-observability-account-id`	Multi-tenant account id. Optional.
`AGENT_OBSERVABILITY_USER` / `_PASS`	—	Basic-auth credentials when the server enables auth. Optional.
`AGENT_OBSERVABILITY_TIMEOUT`	`--agent-observability-timeout`	Upload request timeout in seconds (default `10`).
`AGENT_OBSERVABILITY_MAX_RETRIES`	`--agent-observability-max-retries`	Max upload attempts before falling back (default `3`).
`AGENT_OBSERVABILITY_FALLBACK_DIR`	`--agent-observability-fallback-dir`	Directory for failed-upload JSON (defaults to `.pytest_cache/agent-observability`).

CI metadata (GitHub / GitLab / CircleCI / Buildkite) is auto-detected by the pytest plugin from standard env vars — no configuration needed.

Judge reference

LLM-based (7 factories)

Each returns a LiveKit _LLMJudge you pass straight to a JudgeGroup:

hallucination_judge(llm=...) — fabricated info?
rigid_response_accuracy_judge(*, expected_response, llm=...) — semantic match against an expected text.
freeflow_response_accuracy_judge(llm=...) — contextually appropriate in an open-ended conversation?
hold_requested_intent_accuracy_judge(llm=...) — was a "hold" / "wait" reply justified?
variable_extraction_judge(*, expected_variables, actual_variables, llm=...) — were the right values extracted, grounded in the transcript?
loop_detection_judge(llm=...) — agent repeating itself?
knowledge_base_correctness_judge(*, kb_context, llm=...) — KB lookup faithfully reflected?

Programmatic (2 classes, no LLM call)

IntentAccuracyJudge(*, expected_intent, actual_intent) — case-insensitive string match.
ToolCorrectnessJudge(*, expected_tools, threshold=1.0) — auto-extracts function-call events from chat_ctx; set-membership scoring.

Composition helper

default_judges(llm=None) -> list[Judge] — the four ground-truth-free judges (Hallucination, Freeflow Response Accuracy, Hold-Requested Intent Accuracy, Loop Detection). Spread next to your own ground-truth-bound judges.

Which judges need what data?

Judge	Required at construction	Read from `chat_ctx`
`hallucination_judge`	—	full conversation
`rigid_response_accuracy_judge`	`expected_response`	latest assistant message
`freeflow_response_accuracy_judge`	—	full conversation
`hold_requested_intent_accuracy_judge`	—	latest assistant message + prior user turn
`variable_extraction_judge`	`expected_variables`, `actual_variables`	full conversation (for grounding)
`loop_detection_judge`	—	latest assistant message + prior 2-3
`knowledge_base_correctness_judge`	`kb_context`	full conversation
`IntentAccuracyJudge`	`expected_intent`, `actual_intent`	— (ignored)
`ToolCorrectnessJudge`	`expected_tools`	`function_call` items (auto-extracted)

Migrating from `pytest-agent-observability`

The standalone pytest-agent-observability package is discontinued. The last published release (0.2.1) still installs and runs, but predates this SDK's helpers and judges. Migrate by switching the dependency + import:

-pytest-agent-observability
+agent-observability-sdk

-from pytest_agent_observability import capture
+from agent_observability.livekit.pytest import capture

The plugin is auto-discovered via pytest11 entry-point — no extra config needed. Auto-capture, .judge() interception, retry / fallback behaviour, and CI metadata extraction are byte-for-byte identical.

Not ported from cx-sqs-worker

Two judges are intentionally absent because their prompts are tightly coupled to the cx-sqs-worker flow-graph runtime (global vs. node instructions, closed available-intents list):

semi_rigid_response_accuracy
intent_detection

Use rigid_response_accuracy_judge or freeflow_response_accuracy_judge for response evaluation; use IntentAccuracyJudge for closed-set intent checks.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

Jun 5, 2026

0.0.1

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_observability_sdk-0.2.1.tar.gz (54.8 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_observability_sdk-0.2.1-py3-none-any.whl (48.8 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file agent_observability_sdk-0.2.1.tar.gz.

File metadata

Download URL: agent_observability_sdk-0.2.1.tar.gz
Upload date: Jun 5, 2026
Size: 54.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_observability_sdk-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`e62ad7866d863d90599443f14be5ebacfd507d8e509f479f46b53567f08f9e91`
MD5	`ef8f4b922d67bd3cb8c1ac5aa3ea6079`
BLAKE2b-256	`7b892339af34d3804559213c96738c1ad774b6b4093a341c711bd68b45b863ff`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_observability_sdk-0.2.1.tar.gz:

Publisher: publish-observability-sdk.yml on plivo-labs/agent-observability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_observability_sdk-0.2.1.tar.gz
- Subject digest: e62ad7866d863d90599443f14be5ebacfd507d8e509f479f46b53567f08f9e91
- Sigstore transparency entry: 1732268667
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: plivo-labs/agent-observability@030901245907e8c88a22fded45beb82005fb097f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/plivo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-observability-sdk.yml@030901245907e8c88a22fded45beb82005fb097f
- Trigger Event: workflow_run

File details

Details for the file agent_observability_sdk-0.2.1-py3-none-any.whl.

File metadata

Download URL: agent_observability_sdk-0.2.1-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 48.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_observability_sdk-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2d85f2060f97fb29707cdca800310b1227c4ecefb6cc95bcbc6d5c4d47b94d9e`
MD5	`e1c1bb47390388be523bc0dcbaf16905`
BLAKE2b-256	`037da7498b98dfbbcac90560a6647e8d164deba71f4e60e82a7ec03e79472313`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_observability_sdk-0.2.1-py3-none-any.whl:

Publisher: publish-observability-sdk.yml on plivo-labs/agent-observability

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_observability_sdk-0.2.1-py3-none-any.whl
- Subject digest: 2d85f2060f97fb29707cdca800310b1227c4ecefb6cc95bcbc6d5c4d47b94d9e
- Sigstore transparency entry: 1732268743
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: plivo-labs/agent-observability@030901245907e8c88a22fded45beb82005fb097f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/plivo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-observability-sdk.yml@030901245907e8c88a22fded45beb82005fb097f
- Trigger Event: workflow_run

agent-observability-sdk 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

agent-observability-sdk

Install

Quick start

1. Raw LiveKit worker (text or audio, your own AgentServer)

2. agent-transport worker (AudioStreamServer)

3. pytest suite

Configuration

Judge reference

LLM-based (7 factories)

Programmatic (2 classes, no LLM call)

Composition helper

Which judges need what data?

Migrating from pytest-agent-observability

Not ported from cx-sqs-worker

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

1. Raw LiveKit worker (text or audio, your own `AgentServer`)

2. agent-transport worker (`AudioStreamServer`)

Migrating from `pytest-agent-observability`