Skip to main content

Production-grade agentic trajectory evaluation — score multi-step AI agent runs on goal completion, tool accuracy, step efficiency, reasoning coherence, loop detection, and faithfulness

Project description

trajscore Banner

agenteval

Production-grade agentic trajectory evaluation for multi-step AI agents.

Score any AI agent run on 6 built-in metrics, detect regressions, stream results, and integrate into CI/CD — with zero vendor lock-in.

pip install agenteval

Why agenteval?

In 2026, every team building agentic AI faces the same problem: you can't improve what you can't measure. Agents fail in subtle ways — they loop, misuse tools, hallucinate answers unsupported by observations, or take twice as many steps as needed. No single library evaluated full multi-step trajectories with structured, auditable metrics.

agenteval fixes this.


Quickstart

from trajscore import (
    Trajectory, TrajectoryStep, StepType,
    TrajectoryEvaluator,
)

trajectory = Trajectory(
    trajectory_id="run-001",
    task="What is the capital of France?",
    steps=[
        TrajectoryStep(step_index=0, step_type=StepType.THOUGHT,
                       content="I should look this up."),
        TrajectoryStep(step_index=1, step_type=StepType.TOOL_CALL,
                       content="search", tool_name="search",
                       tool_args={"query": "capital of France"}),
        TrajectoryStep(step_index=2, step_type=StepType.OBSERVATION,
                       content="Paris is the capital of France."),
        TrajectoryStep(step_index=3, step_type=StepType.FINAL_ANSWER,
                       content="The capital of France is Paris."),
    ],
    final_answer="The capital of France is Paris.",
    expected_tools=["search"],
)

evaluator = TrajectoryEvaluator()
score = evaluator.evaluate(trajectory)

print(f"Overall: {score.overall_score:.3f}  Passed: {score.passed}")
print(score.metric_scores)

Built-in Metrics

Metric Description
goal_completion Did the agent produce a relevant final answer?
tool_accuracy Did it use the right tools? (F1 vs expected_tools)
step_efficiency Did it reach the goal without unnecessary steps?
reasoning_coherence Do thoughts lead logically to actions?
loop_detection Did the agent repeat actions or thoughts?
answer_faithfulness Is the final answer grounded in observations?

Batch & Async Evaluation

from trajscore import TrajectoryEvaluator

evaluator = TrajectoryEvaluator()

# Synchronous batch
result = evaluator.evaluate_batch(trajectories, max_workers=8)

# Async batch
import asyncio
result = asyncio.run(evaluator.aevaluate_batch(trajectories))

print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Mean score: {result.mean_overall:.3f}")

Advanced Features

Caching (LRU + TTL + SHA-256)

from trajscore.advanced import TrajectoryCache

cache = TrajectoryCache(max_size=512, ttl=600)
memoized_eval = cache.memoize(evaluator.evaluate)
score = memoized_eval(trajectory)    # cached on second call
print(cache.stats())

Evaluation Pipeline

from trajscore.advanced import EvalPipeline

pipeline = (
    EvalPipeline()
    .filter("non_empty", lambda t: len(t.steps) > 0)
    .map("tag_metadata", lambda t: t)
    .with_retry("tag_metadata", retries=2)
)
cleaned = pipeline.run(trajectories)
print(pipeline.audit_log)

# Async
import asyncio
cleaned = asyncio.run(pipeline.arun(trajectories))

Declarative Validation

from trajscore.advanced import TrajectoryValidator, TrajectoryRule

validator = (
    TrajectoryValidator()
    .add_rule(TrajectoryRule("has_steps", lambda t: len(t.steps) > 0, "Need steps"))
    .add_rule(TrajectoryRule("has_task", lambda t: bool(t.task), "Need task"))
)
violations = validator.validate(trajectory)

Rate Limiter (sync + async)

from trajscore.advanced import RateLimiter

limiter = RateLimiter(rate=10, capacity=10)  # 10 evals/s
if limiter.acquire():
    score = evaluator.evaluate(trajectory)

Budget-Controlled Evaluation

from trajscore.advanced import evaluate_with_budget
scores = evaluate_with_budget(trajectories, evaluator.evaluate, budget_seconds=5.0)

Streaming Results

from trajscore.advanced import stream_scores, scores_to_ndjson

for score in stream_scores(trajectories, evaluator.evaluate):
    print(score.trajectory_id, score.overall_score)

# NDJSON stream
for line in scores_to_ndjson(trajectories, evaluator.evaluate):
    print(line)

Diff & Regression Tracking

from trajscore.advanced import diff_results, RegressionTracker

tracker = RegressionTracker(window=10)
tracker.record(result_v1)
tracker.record(result_v2)
print(tracker.trend())          # "improving" / "declining" / "stable"
diff = tracker.latest_regression()
print(diff.summary())
print(diff.to_json())

Observability

from trajscore.advanced import EvaluationProfiler, DriftDetector, EvaluationReport

profiler = EvaluationProfiler()
scored = profiler.profile(evaluator.evaluate)(trajectory)
print(profiler.report())

detector = DriftDetector(threshold=0.05)
detector.set_baseline(result_v1)
print(detector.detect(result_v2))

report = EvaluationReport(result)
print(report.to_json())
print(report.to_csv())
print(report.to_markdown())

Audit Log & Cost Ledger

from trajscore.advanced import AuditLog, CostLedger

log = AuditLog()
log.log("eval_start", {"run_id": "ci-42"})

ledger = CostLedger()
ledger.record("t1", tokens=1200, cost_usd=0.024)
print(ledger.summary())

Live Trajectory Watcher

from trajscore import TrajectoryWatcher, TrajectoryStep, StepType

watcher = TrajectoryWatcher(
    trajectory_id="live-001",
    task="Summarize the paper",
    on_step=lambda step, idx: print(f"Step {idx}: {step.step_type}"),
)

watcher.add_step(TrajectoryStep(step_index=0, step_type=StepType.THOUGHT, content="Reading..."))
trajectory = watcher.finish("Summary complete.")
score = evaluator.evaluate(trajectory)

Installation

pip install agenteval

Python 3.8+ · No external dependencies (stdlib + pydantic)


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trajscore-1.1.6.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trajscore-1.1.6-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file trajscore-1.1.6.tar.gz.

File metadata

  • Download URL: trajscore-1.1.6.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for trajscore-1.1.6.tar.gz
Algorithm Hash digest
SHA256 2b0adcce2d76e17a11b4fec6b105ebacdf6d169b83616bdf4150650890ea0176
MD5 8d50db0d9280bbef4d7a1ae52176f7c1
BLAKE2b-256 04a4ae8d4c3fd6eaf574f9baa571ea2a86a791b207dbef5c71921d4e32cc2cc1

See more details on using hashes here.

File details

Details for the file trajscore-1.1.6-py3-none-any.whl.

File metadata

  • Download URL: trajscore-1.1.6-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for trajscore-1.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2c0466d4af539fcf56b8be1be53b542b9b11973f9427799aef26d8456af9edcf
MD5 f248c6e2a6731f72be903d6c3dd458cf
BLAKE2b-256 3f46093bf4af7016547ec87504f5f4fb1d23d42ed472d55cd5f058ce90b7d2f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page