Production-grade agentic trajectory evaluation — score multi-step AI agent runs on goal completion, tool accuracy, step efficiency, reasoning coherence, loop detection, and faithfulness

These details have not been verified by PyPI

Project links

Homepage

Project description

agenteval

Production-grade agentic trajectory evaluation for multi-step AI agents.

Score any AI agent run on 6 built-in metrics, detect regressions, stream results, and integrate into CI/CD â€” with zero vendor lock-in.

pip install agenteval

Why agenteval?

In 2026, every team building agentic AI faces the same problem: you can't improve what you can't measure. Agents fail in subtle ways â€” they loop, misuse tools, hallucinate answers unsupported by observations, or take twice as many steps as needed. No single library evaluated full multi-step trajectories with structured, auditable metrics.

agenteval fixes this.

Quickstart

from agenteval import (
    Trajectory, TrajectoryStep, StepType,
    TrajectoryEvaluator,
)

trajectory = Trajectory(
    trajectory_id="run-001",
    task="What is the capital of France?",
    steps=[
        TrajectoryStep(step_index=0, step_type=StepType.THOUGHT,
                       content="I should look this up."),
        TrajectoryStep(step_index=1, step_type=StepType.TOOL_CALL,
                       content="search", tool_name="search",
                       tool_args={"query": "capital of France"}),
        TrajectoryStep(step_index=2, step_type=StepType.OBSERVATION,
                       content="Paris is the capital of France."),
        TrajectoryStep(step_index=3, step_type=StepType.FINAL_ANSWER,
                       content="The capital of France is Paris."),
    ],
    final_answer="The capital of France is Paris.",
    expected_tools=["search"],
)

evaluator = TrajectoryEvaluator()
score = evaluator.evaluate(trajectory)

print(f"Overall: {score.overall_score:.3f}  Passed: {score.passed}")
print(score.metric_scores)

Built-in Metrics

Metric	Description
`goal_completion`	Did the agent produce a relevant final answer?
`tool_accuracy`	Did it use the right tools? (F1 vs expected_tools)
`step_efficiency`	Did it reach the goal without unnecessary steps?
`reasoning_coherence`	Do thoughts lead logically to actions?
`loop_detection`	Did the agent repeat actions or thoughts?
`answer_faithfulness`	Is the final answer grounded in observations?

Batch & Async Evaluation

from agenteval import TrajectoryEvaluator

evaluator = TrajectoryEvaluator()

# Synchronous batch
result = evaluator.evaluate_batch(trajectories, max_workers=8)

# Async batch
import asyncio
result = asyncio.run(evaluator.aevaluate_batch(trajectories))

print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Mean score: {result.mean_overall:.3f}")

Advanced Features

Caching (LRU + TTL + SHA-256)

from agenteval.advanced import TrajectoryCache

cache = TrajectoryCache(max_size=512, ttl=600)
memoized_eval = cache.memoize(evaluator.evaluate)
score = memoized_eval(trajectory)    # cached on second call
print(cache.stats())

Evaluation Pipeline

from agenteval.advanced import EvalPipeline

pipeline = (
    EvalPipeline()
    .filter("non_empty", lambda t: len(t.steps) > 0)
    .map("tag_metadata", lambda t: t)
    .with_retry("tag_metadata", retries=2)
)
cleaned = pipeline.run(trajectories)
print(pipeline.audit_log)

# Async
import asyncio
cleaned = asyncio.run(pipeline.arun(trajectories))

Declarative Validation

from agenteval.advanced import TrajectoryValidator, TrajectoryRule

validator = (
    TrajectoryValidator()
    .add_rule(TrajectoryRule("has_steps", lambda t: len(t.steps) > 0, "Need steps"))
    .add_rule(TrajectoryRule("has_task", lambda t: bool(t.task), "Need task"))
)
violations = validator.validate(trajectory)

Rate Limiter (sync + async)

from agenteval.advanced import RateLimiter

limiter = RateLimiter(rate=10, capacity=10)  # 10 evals/s
if limiter.acquire():
    score = evaluator.evaluate(trajectory)

Budget-Controlled Evaluation

from agenteval.advanced import evaluate_with_budget
scores = evaluate_with_budget(trajectories, evaluator.evaluate, budget_seconds=5.0)

Streaming Results

from agenteval.advanced import stream_scores, scores_to_ndjson

for score in stream_scores(trajectories, evaluator.evaluate):
    print(score.trajectory_id, score.overall_score)

# NDJSON stream
for line in scores_to_ndjson(trajectories, evaluator.evaluate):
    print(line)

Diff & Regression Tracking

from agenteval.advanced import diff_results, RegressionTracker

tracker = RegressionTracker(window=10)
tracker.record(result_v1)
tracker.record(result_v2)
print(tracker.trend())          # "improving" / "declining" / "stable"
diff = tracker.latest_regression()
print(diff.summary())
print(diff.to_json())

Observability

from agenteval.advanced import EvaluationProfiler, DriftDetector, EvaluationReport

profiler = EvaluationProfiler()
scored = profiler.profile(evaluator.evaluate)(trajectory)
print(profiler.report())

detector = DriftDetector(threshold=0.05)
detector.set_baseline(result_v1)
print(detector.detect(result_v2))

report = EvaluationReport(result)
print(report.to_json())
print(report.to_csv())
print(report.to_markdown())

Audit Log & Cost Ledger

from agenteval.advanced import AuditLog, CostLedger

log = AuditLog()
log.log("eval_start", {"run_id": "ci-42"})

ledger = CostLedger()
ledger.record("t1", tokens=1200, cost_usd=0.024)
print(ledger.summary())

Live Trajectory Watcher

from agenteval import TrajectoryWatcher, TrajectoryStep, StepType

watcher = TrajectoryWatcher(
    trajectory_id="live-001",
    task="Summarize the paper",
    on_step=lambda step, idx: print(f"Step {idx}: {step.step_type}"),
)

watcher.add_step(TrajectoryStep(step_index=0, step_type=StepType.THOUGHT, content="Reading..."))
trajectory = watcher.finish("Summary complete.")
score = evaluator.evaluate(trajectory)

Installation

pip install agenteval

Python 3.8+ Â· No external dependencies (stdlib + pydantic)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.1.6

May 17, 2026

1.1.5

May 16, 2026

1.1.3

Apr 10, 2026

1.1.2

Apr 10, 2026

1.1.1

Apr 10, 2026

1.1.0

Apr 10, 2026

This version

1.0.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trajscore-1.0.0.tar.gz (20.4 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trajscore-1.0.0-py3-none-any.whl (21.4 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file trajscore-1.0.0.tar.gz.

File metadata

Download URL: trajscore-1.0.0.tar.gz
Upload date: Apr 10, 2026
Size: 20.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for trajscore-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`185fd8f69840dd8db18d4f5cb282a6804b97ed3a1d7834b60464f5b7f20a3c34`
MD5	`84392f59ca3dbcdc731c7189dd464589`
BLAKE2b-256	`1a3ea4d984f78a250b241b49b3d054892ab36f8509a5ca26314015a551a87a8f`

See more details on using hashes here.

File details

Details for the file trajscore-1.0.0-py3-none-any.whl.

File metadata

Download URL: trajscore-1.0.0-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 21.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for trajscore-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0812dcfd96a47a1e9dea08c79f99b0540bb637ce117201c8b4bf18bfc7e3447`
MD5	`3d61c44da1c3a8e0fd700a7f07129565`
BLAKE2b-256	`1fea71eb761f5c3156e6fa2a70e5d30de27b337a9561d9093473989b198f1e12`

See more details on using hashes here.

trajscore 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agenteval

Why agenteval?

Quickstart

Built-in Metrics

Batch & Async Evaluation

Advanced Features

Caching (LRU + TTL + SHA-256)

Evaluation Pipeline

Declarative Validation

Rate Limiter (sync + async)

Budget-Controlled Evaluation

Streaming Results

Diff & Regression Tracking

Observability

Audit Log & Cost Ledger

Live Trajectory Watcher

Installation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes