Evidence-grounded evaluator for AI agent trajectories — judge by verifying claims against real tool outputs, not LLM-judge vibes.
Project description
attest
Evidence-grounded evaluation for AI agent trajectories. Judge an agent by checking its claims against the actual tool outputs — not by asking another LLM "did this look good?"
uv tool install agent-attest # distribution name; the CLI + import are `attest`
attest run your-trajectory.json
Why
Evaluating AI agents usually means LLM-as-judge — one model grading another. Two problems attest tackles directly:
- It grades the story, not the work. A holistic "is this good?" judge reads the agent's confident narrative and can wave through specific ungrounded claims buried in an otherwise-solid answer. (See Gaming the Judge, arXiv:2601.14691.)
- The scores have no error bars. Most tools report a bare pass rate, so teams chase differences that are pure noise.
attest's approach: never trust what the model says it did. Extract the answer's claims and verify each one against the recorded tool outputs, report with confidence intervals, and back every verdict with the exact evidence span. The same "verify against real state, not narrative" primitive underpins the strongest prompt-injection defenses (AgentDojo, CaMeL) — so it's also the foundation for security checks later.
What it does
attest evaluates a trajectory (an agent run: tool calls, their real outputs, the final answer) across dimensions and returns one combined report:
- Faithfulness — extracts atomic claims from the answer and verifies each against the
tool outputs (
supported/unsupported/unverifiable), with a quoted evidence span. The verifier never sees the agent's reasoning, so a reworded narrative can't move the verdict. - Tool-use correctness — were the right tools called, with no unhandled errors? Deterministic by default (no API key); an optional LLM check judges tool choice.
- Prompt-injection flag — scans untrusted tool outputs for injection payloads
(deterministic) and, with
--deep, an effect-based check for whether the agent took an action the principal never authorized — catching novel injections, not just known phrasings like "ignore previous instructions". - One report — an
overall_score, per-dimension scores, and Wilson 95% confidence intervals, all serializable to JSON. - Framework-agnostic — a LangChain/LangGraph adapter turns any agent run into a trajectory; bring your own.
- Read-only & safe — attest only reads a recorded trajectory. It never executes tools, calls the agent, or needs your tools' credentials.
How it works
final_answer ──extract claims──▶ [atomic claims]
each claim ──verify against──▶ supported · unsupported · unverifiable (evidence = tool outputs only)
evidence
tool calls ──allowed? error-handled? appropriate?──▶ tool-use score
tool outputs ──payload scan + authorization check────▶ injection findings (suspicious / compromised)
│
▼
one TrajectoryReport (overall + per-dimension + 95% CIs)
The key design choice: the verifier sees only the claim and the evidence — never the agent's reasoning. That's what keeps it grounded.
Usage
CLI
attest stats 41 50 # a pass rate with its Wilson 95% CI (no API key)
attest tools trajectory.json # tool-use correctness — deterministic, no API key
attest injection trajectory.json # prompt-injection scan — deterministic, no API key
attest run trajectory.json # full report: faithfulness + tool-use + overall
attest demo trajectory.json # naive LLM-judge vs attest, side by side
attest models openai # list a provider's models (live if its key is set)
attest run trajectory.json --provider openai --model gpt-4o-mini # any provider
Library
from attest import Attest
judge = Attest(key="sk-ant-...") # or Attest() to read ANTHROPIC_API_KEY from the env
report = judge.evaluate(traj) # traj: a Trajectory (e.g. from the LangGraph adapter)
print(report.overall_score)
print(report.model_dump_json(indent=2))
judge.tool_use(traj) # tool-use correctness
judge.injection(traj, deep=True) # prompt-injection scan
judge.stats(41, 50) # pass rate + Wilson 95% CI (no API call)
Configure the provider, key, and model once, then evaluate many trajectories. Prefer
dependency injection? The functional API is still there — from attest import evaluate, check_tool_use.
Providers
attest runs on Anthropic, OpenAI, or Gemini behind one interface (via instructor for reliable structured output):
Attest(provider="openai", model="gpt-4o-mini") # key from OPENAI_API_KEY
Attest(provider="gemini") # key from GEMINI_API_KEY / GOOGLE_API_KEY
Attest.providers() # ['anthropic', 'openai', 'gemini']
Attest.models("openai") # live list if OPENAI_API_KEY is set, else curated
The base install ships Anthropic. OpenAI and Gemini are optional extras:
pip install agent-attest # base (Anthropic), exposes `import attest`
pip install "agent-attest[openai]" # adds the OpenAI SDK
pip install "agent-attest[gemini]" # adds the Google GenAI SDK
pip install "agent-attest[all]" # both
Each provider reads its own key (ANTHROPIC_API_KEY, OPENAI_API_KEY, or
GEMINI_API_KEY) — a local .env is picked up automatically. Verification defaults to a
small/fast model per provider: cents, not dollars.
Develop
uv run pytest # 58 tests, no API key needed (the LLM is mocked/injected)
Running the CLI from source before install: prefix with uv run (e.g. uv run attest stats 41 50).
Layout
src/attest/
├── trajectory.py # core data model — the thought-vs-tool-output distinction
├── _llm.py # Anthropic wrapper: call(output=PydanticModel) -> validated
├── cli.py # attest stats / tools / run / demo
├── checks/ # the evaluation dimensions
│ ├── verify.py # faithfulness: extract_claims + grounded_verifier
│ ├── tool_use.py # tool-use correctness (deterministic + optional LLM)
│ ├── injection.py # prompt-injection: payload scan + authorization check
│ └── judge_baseline.py # the naive LLM-as-judge attest is built to beat
├── scoring/
│ ├── report.py # evaluate() -> combined TrajectoryReport + overall_score
│ └── stats.py # Wilson CI + two-proportion significance
└── adapters/
└── langgraph.py # LangChain/LangGraph run -> Trajectory
tests/ # all offline (the LLM is mocked/injected)
examples/ # sample trajectories (clean, gamed, injection)
Status
Early but working. Faithfulness, tool-use correctness, and a prompt-injection flag (deterministic scan + effect-based authorization check) are built, tested, and validated live against a real LangGraph agent. Next up: an answer-type-aware verifier and self-contradiction. Not yet on PyPI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_attest-0.2.1.tar.gz.
File metadata
- Download URL: agent_attest-0.2.1.tar.gz
- Upload date:
- Size: 127.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
216b4fd91a89e731141398c4837095273c0eb9d178a972435c5c2aced720130f
|
|
| MD5 |
090ecdcf3e8bc278231181665ef4f8a3
|
|
| BLAKE2b-256 |
d15aebfb70fd206728d1c1bb7ddd8a02512b22e39144433a5e6e8492236e1dd7
|
File details
Details for the file agent_attest-0.2.1-py3-none-any.whl.
File metadata
- Download URL: agent_attest-0.2.1-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e3e4d20bc75567d1fa8e2372c335fc09731372d97f3cb9f11e6094bf97fadcd
|
|
| MD5 |
6705a759976276fb8718af64b8286388
|
|
| BLAKE2b-256 |
a91d737a48f21081153a11dfa4535e89deea5b7658ab3669d6768f91c2d65813
|