Skip to main content

Lightweight evaluation toolkit for AI agents — test tool use, grounding, safety, and efficiency before production

Project description

agentdog

Lightweight evaluation toolkit for AI agents. pytest for agent behavior — test tool use, grounding, safety, and efficiency before production

pip install agentdog
pip install "agentdog[llm-judge]"  # for LLMJudge scorer

Quickstart

from agentdog import AgentTrace, ToolCall, TestCase, EvalRun, run
from agentdog import ContainsAnswer, UsedTools, AvoidedTools, UnderTokenLimit

trace = AgentTrace(
    input="Summarize the Q3 report.",
    output="Q3 revenue was $4.2M, up 12% YoY.",
    tool_calls=[ToolCall(name="file_search", arguments={"query": "Q3 report"})],
    retrieved_context=["Q3 revenue was $4.2M, growth 12% year over year."],
    total_tokens=620,
)

case = TestCase(
    name="q3-summary",
    tags=["rag"],
    scorers=[
        ContainsAnswer(["4.2M", "12%"]),
        UsedTools(["file_search"]),
        AvoidedTools(["send_email"]),
        UnderTokenLimit(max_tokens=1000),
    ],
)

report = run([EvalRun(case=case, trace=trace)])
report.print(verbose=True)

CLI

Define an evals() function in any Python file that returns list[EvalRun], then:

agentdog run my_evals.py             # run all cases
agentdog run my_evals.py -v          # verbose: show scorer details for passing cases
agentdog run my_evals.py --tag rag   # filter by tag
agentdog run my_evals.py --json-out report.json  # machine-readable output
agentdog inspect trace.json          # pretty-print a trace file

Exit code is 0 on full pass, 1 on any failure — CI-friendly by default.


Scorers

Category Scorers
Answer ContainsAnswer ExactAnswer RegexAnswer ForbiddenContent AnswerNotEmpty
Tools UsedTools AvoidedTools ToolCallOrder MaxToolCalls ToolArgContains ToolArgEquals
Grounding GroundedInContext CitedSource NoContextHallucination
Safety NoSensitiveDataLeaked NoRiskyActionTaken PromptInjectionResisted
Efficiency UnderTokenLimit UnderCostLimit UnderLatencyLimit MaxRetries
LLM Judge LLMJudge — use only when deterministic checks aren't enough

Trace schema

AgentTrace(
    input: str,
    output: str,
    tool_calls: list[ToolCall],        # name, arguments, output, error, latency_ms
    retrieved_context: list[str],
    total_tokens: int | None,
    total_cost_usd: float | None,
    total_latency_ms: float | None,
    num_retries: int,
    metadata: dict,
)

Load/save:

trace = AgentTrace.from_json("trace.json")
trace.to_json("trace.json")

Custom scorer

from agentdog.scorers.base import Scorer, ScoreResult

class AnswerStartsWith(Scorer):
    def __init__(self, prefix: str):
        self.prefix = prefix

    def score(self, trace) -> ScoreResult:
        passed = trace.output.startswith(self.prefix)
        return ScoreResult(
            passed=passed,
            score=1.0 if passed else 0.0,
            reason=f"Expected output to start with {self.prefix!r}",
        )

Example

See examples/sample_evals.py for a complete working example covering RAG, safety, and prompt injection.


Author

Sai Teja Erukude
GitHub · agentdog

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentdog-0.1.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentdog-0.1.0-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file agentdog-0.1.0.tar.gz.

File metadata

  • Download URL: agentdog-0.1.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for agentdog-0.1.0.tar.gz
Algorithm Hash digest
SHA256 753317c1977fd5c9f254fc061038bed47e073614d7adcda71491361343f2e4d0
MD5 9dc4bbcf7b37b3f77d458974c2e40a68
BLAKE2b-256 be1d22e2849b36d3d50437f93eeee48c5b6def93ee370711a8ba6c3bbb90af23

See more details on using hashes here.

File details

Details for the file agentdog-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentdog-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for agentdog-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7f670b52eab046f6424ec15b2d438401a2bece24b072fa762484e0e8fcc153d5
MD5 4549a2c1a7652eb7f633bdddbb5be30d
BLAKE2b-256 660b5f2f27da6201e62e6c83b3ef8373a9d17d8cd7a2d5ae8168f95ba0dbd909

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page