Lightweight evaluation toolkit for AI agents — test tool use, grounding, safety, and efficiency before production

These details have not been verified by PyPI

Project links

Project description

agentdog

Lightweight evaluation toolkit for AI agents. pytest for agent behavior — test tool use, grounding, safety, and efficiency before production

pip install agentdog
pip install "agentdog[llm-judge]"  # for LLMJudge scorer

Quickstart

from agentdog import AgentTrace, ToolCall, TestCase, EvalRun, run
from agentdog import ContainsAnswer, UsedTools, AvoidedTools, UnderTokenLimit

trace = AgentTrace(
    input="Summarize the Q3 report.",
    output="Q3 revenue was $4.2M, up 12% YoY.",
    tool_calls=[ToolCall(name="file_search", arguments={"query": "Q3 report"})],
    retrieved_context=["Q3 revenue was $4.2M, growth 12% year over year."],
    total_tokens=620,
)

case = TestCase(
    name="q3-summary",
    tags=["rag"],
    scorers=[
        ContainsAnswer(["4.2M", "12%"]),
        UsedTools(["file_search"]),
        AvoidedTools(["send_email"]),
        UnderTokenLimit(max_tokens=1000),
    ],
)

report = run([EvalRun(case=case, trace=trace)])
report.print(verbose=True)

CLI

Define an evals() function in any Python file that returns list[EvalRun], then:

agentdog run my_evals.py             # run all cases
agentdog run my_evals.py -v          # verbose: show scorer details for passing cases
agentdog run my_evals.py --tag rag   # filter by tag
agentdog run my_evals.py --json-out report.json  # machine-readable output
agentdog inspect trace.json          # pretty-print a trace file

Exit code is 0 on full pass, 1 on any failure — CI-friendly by default.

Scorers

Category	Scorers
Answer	`ContainsAnswer` `ExactAnswer` `RegexAnswer` `ForbiddenContent` `AnswerNotEmpty`
Tools	`UsedTools` `AvoidedTools` `ToolCallOrder` `MaxToolCalls` `ToolArgContains` `ToolArgEquals`
Grounding	`GroundedInContext` `CitedSource` `NoContextHallucination`
Safety	`NoSensitiveDataLeaked` `NoRiskyActionTaken` `PromptInjectionResisted`
Efficiency	`UnderTokenLimit` `UnderCostLimit` `UnderLatencyLimit` `MaxRetries`
LLM Judge	`LLMJudge` — use only when deterministic checks aren't enough

Trace schema

AgentTrace(
    input: str,
    output: str,
    tool_calls: list[ToolCall],        # name, arguments, output, error, latency_ms
    retrieved_context: list[str],
    total_tokens: int | None,
    total_cost_usd: float | None,
    total_latency_ms: float | None,
    num_retries: int,
    metadata: dict,
)

Load/save:

trace = AgentTrace.from_json("trace.json")
trace.to_json("trace.json")

Custom scorer

from agentdog.scorers.base import Scorer, ScoreResult

class AnswerStartsWith(Scorer):
    def __init__(self, prefix: str):
        self.prefix = prefix

    def score(self, trace) -> ScoreResult:
        passed = trace.output.startswith(self.prefix)
        return ScoreResult(
            passed=passed,
            score=1.0 if passed else 0.0,
            reason=f"Expected output to start with {self.prefix!r}",
        )

Example

See examples/sample_evals.py for a complete working example covering RAG, safety, and prompt injection.

Author

Sai Teja Erukude
GitHub · agentdog

Licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentdog-0.1.0.tar.gz (17.3 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentdog-0.1.0-py3-none-any.whl (20.1 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file agentdog-0.1.0.tar.gz.

File metadata

Download URL: agentdog-0.1.0.tar.gz
Upload date: May 23, 2026
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for agentdog-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`753317c1977fd5c9f254fc061038bed47e073614d7adcda71491361343f2e4d0`
MD5	`9dc4bbcf7b37b3f77d458974c2e40a68`
BLAKE2b-256	`be1d22e2849b36d3d50437f93eeee48c5b6def93ee370711a8ba6c3bbb90af23`

See more details on using hashes here.

File details

Details for the file agentdog-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentdog-0.1.0-py3-none-any.whl
Upload date: May 23, 2026
Size: 20.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for agentdog-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7f670b52eab046f6424ec15b2d438401a2bece24b072fa762484e0e8fcc153d5`
MD5	`4549a2c1a7652eb7f633bdddbb5be30d`
BLAKE2b-256	`660b5f2f27da6201e62e6c83b3ef8373a9d17d8cd7a2d5ae8168f95ba0dbd909`

See more details on using hashes here.

agentdog 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agentdog

Quickstart

CLI

Scorers

Trace schema

Custom scorer

Example

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes