A lightweight Python library for evaluating AI agents and RAG pipelines.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

agentstax-eval

A lightweight Python library for evaluating AI agents and RAG pipelines. No magic, no vendor lock-in — just explicit, readable evaluation logic.

Key Features
Installation
Why agentstax-eval
Quickstart
Metrics
Providers
Monitoring Dashboard
Core API
Framework Auto-Extraction
Caching
Multi-Agent Evaluation
Async Support
CI / Regression Testing
Further Reading
Contributing
License

Key Features

Four objects, one job each — Dataset, Task, Evaluation, Pipeline.
Zero required dependencies — the core library installs with no third-party packages.
Bring your own LLM — pass any fn(prompt: str) -> str as a judge. No default provider.
Metrics are functions — fn(dataset_row: dict) -> float. No base classes, no decorators.
Built-in LLM-as-judge — correctness, relevance, faithfulness, completeness, rubric.
Framework auto-extraction — pass a LangGraph, Google ADK, OpenAI Agents, CrewAI, LlamaIndex, or MSAF agent to Pipeline and get topology, model, tools, and system prompt extracted automatically.
Agent fingerprinting — topology changes are hashed, so caches auto-invalidate and the monitoring dashboard detects architecture drift.
LLM response caching — DiskCache and MemoryCache keyed on judge + prompt + agent fingerprint.
Real-time monitoring — pair with agentstax-eval-monitor for a live dashboard with regression detection, metric trends, and agent network visualization.
CI-ready — assert_passing(), failures(), and JSON save/load for pytest regression tests.

Installation

pip install agentstax-eval

Requires: Python 3.9+. Zero core dependencies.

Optional extras for built-in provider functions:

Extra	Install command	Adds
`openai`	`pip install openai`	`openai` SDK
`anthropic`	`pip install anthropic`	`anthropic` SDK
`google`	`pip install google`	`google-genai` SDK

Why agentstax-eval

All the eval frameworks I have used feel more complicated than they should. They are either too heavy, too complicated, or require vendor lock-in.

I often find myself creating a simple script to test what I want, which works great in the beginning but shows its weakness later on.

I built agentstax-eval to fix that. It's simple when you need it, but flexible enough to grow with your agent architecture.

It works with whatever you're already using — you can pass your LangGraph, ADK, OpenAI Agents, CrewAI, LlamaIndex, or MSAF agent directly to Pipeline. It will walk the hierarchy, extract the topology, and fingerprint it. When your architecture changes, the cache invalidates and the monitor flags the drift.

Pair it with agentstax-eval-monitor and you get a live dashboard that shows regression detection, metric trends, and agent network visualization.

Quickstart

from openai import OpenAI
from agentstax_eval import Pipeline, Dataset, Task, Evaluation
from agentstax_eval.metrics import llm_correctness
from agentstax_eval.providers import openai_provider

client = OpenAI()

dataset = Dataset([
    {"question": "What is the capital of France?", "expected_answer": "Paris"},
    {"question": "Who wrote Hamlet?",              "expected_answer": "Shakespeare"},
    {"question": "What is 2 + 2?",                "expected_answer": "4"},
])

def get_answer(dataset_row: dict) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": dataset_row["question"]}],
    )
    return {"answer": response.choices[0].message.content}

pipeline = Pipeline(
    dataset=dataset,
    tasks=[Task(get_answer)],
    evaluation=Evaluation(metrics=[llm_correctness(llm=openai_provider())]),
)

results = pipeline.run()
results.save(directory="results", base_filename="eval")
print(results)

With agent auto-extraction

Pass your agent object directly to unlock automatic metadata extraction, topology mapping, and smart caching:

from agentstax_eval import Pipeline, Dataset, Task, Evaluation, DiskCache
from agentstax_eval.metrics import llm_correctness
from agentstax_eval.providers import openai_provider

pipeline = Pipeline(
    dataset=dataset,
    tasks=[Task(get_answer)],
    evaluation=Evaluation(metrics=[llm_correctness(llm=openai_provider())]),
    agent=my_langgraph_agent,           # auto-extracts topology, model, tools
    cache=DiskCache(".cache"),           # caches judge responses, invalidates on topology change
)

results = pipeline.run()
results.save(directory="results", base_filename="my_agent_eval")

The saved JSON now includes a topology field with the full agent graph, fingerprint, and per-node metadata — which the monitoring dashboard uses for network visualization and change detection.

Metrics

Deterministic

from agentstax_eval.metrics import exact_match, contains_answer, json_valid

Metric	What it checks	Source	LLM required
`exact_match`	`answer.strip().lower() == expected_answer.strip().lower()`	source	No
`contains_answer`	`expected_answer` appears in `answer`	source	No
`json_valid`	`answer` parses as valid JSON	source	No

LLM-as-Judge

Factory functions that take an llm callable and return a metric. The llm can be synchronous (fn(prompt: str) -> str) or asynchronous (async def).

from agentstax_eval.metrics import llm_correctness, llm_relevance, llm_faithfulness, llm_completeness, llm_rubric

Metric	Reads	What it scores	Prompt
`llm_correctness(llm=)`	`question`, `expected_answer`, `answer`	Factual correctness against reference	prompt
`llm_relevance(llm=)`	`question`, `answer`	Whether answer addresses the question (reference-free)	prompt
`llm_faithfulness(llm=)`	`answer`, `context`	Whether claims are supported by context (RAG)	prompt
`llm_completeness(llm=)`	`question`, `expected_answer`, `answer`	Whether all key points are covered	prompt
`llm_rubric(llm=, criteria=)`	`question`, `answer`	User-defined criteria in plain English	prompt

All use a six-point scale (1.0, 0.8, 0.6, 0.4, 0.2, 0.0) with chain-of-thought reasoning.

Custom Metrics

Any function with the signature fn(dataset_row: dict) -> float works as a metric. The function receives the full row dict — including question, expected_answer, answer, and any extra fields your tasks added (like context). Return a float between 0.0 and 1.0. The function's __name__ is used as the column name in results.

# Deterministic metric — no LLM needed
def answer_is_concise(dataset_row: dict) -> float:
    return 1.0 if len(dataset_row["answer"].split()) <= 20 else 0.0

# Custom LLM metric — you control the prompt and parsing
def my_judge_metric(dataset_row: dict) -> float:
    prompt = f"Is this answer polite? Answer 1 or 0.\nAnswer: {dataset_row['answer']}"
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return float(response.choices[0].message.content.strip())

evaluation = Evaluation(metrics=[exact_match, answer_is_concise, my_judge_metric])

When using factory functions or lambdas, set __name__ to control the result column name:

concise = llm_rubric(llm=judge, criteria="Two sentences or fewer.")
concise.__name__ = "rubric_concise"

Providers

Built-in fn(prompt: str) -> str wrappers that handle client setup and set judge_model metadata automatically:

from agentstax_eval.providers import openai_provider, anthropic_provider, google_provider

judge = openai_provider()                            # default: gpt-4.1, reads OPENAI_API_KEY
judge = anthropic_provider(model="claude-opus-4-6")  # reads ANTHROPIC_API_KEY
judge = google_provider(model="gemini-2.5-pro")      # reads GEMINI_API_KEY

Custom Providers

A provider is any fn(prompt: str) -> str. To get automatic judge_model tracking in result metadata, set the attribute on the function:

from openai import OpenAI

client = OpenAI()

def my_judge(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

my_judge.judge_model = "openai/gpt-4o"

The judge_model string flows into metadata.scoring.<metric_name>.judge_model in saved results. If omitted, scoring metadata will simply not include it.

Monitoring Dashboard

agentstax-eval-monitor is a companion real-time dashboard that watches a directory of agentstax-eval result files and provides:

Regression detection — agents categorized as regressed, improved, or healthy based on metric deltas
Metric trend sparklines — per-metric performance history at a glance
Per-agent deep dives — zoomable line charts with threshold lines and metadata change markers
Agent network visualization — interactive DAG of your multi-agent topology, with added/removed agents highlighted
Architecture drift detection — visual timeline of when topology or metric config changed
Live updates — WebSocket-powered, updates as new result files land

Setup

# Install
cd agentstax-eval-monitor
bun install

# Point at your results directory
bun run index.ts ./results

Opens a dashboard at http://localhost:3000.

Core API

Dataset

Wraps a list of dicts. Each row must have a question. The expected_answer field is optional at the dataset level — metrics that need it raise MissingFieldError at scoring time.

dataset = Dataset([
    {"question": "What year did WWII end?", "expected_answer": "1945"},
])

Task

Wraps a callable fn(dataset_row: dict) -> dict. The returned dict is merged into each row.

task = Task(get_answer)
answered_dataset = task.run(dataset)  # returns new Dataset, original not mutated

Evaluation

Holds a list of metrics and scores a dataset. Metrics are fn(dataset_row: dict) -> float.

evaluation = Evaluation(metrics=[exact_match, llm_correctness(llm=my_judge)])
results = evaluation.run(answered_dataset, metadata={"model": "gpt-4o"})

Pipeline

Convenience wrapper: runs tasks sequentially, then scores. Accepts agent for auto-extraction and cache for LLM judge caching.

pipeline = Pipeline(
    dataset=dataset,
    tasks=[Task(get_answer)],
    evaluation=Evaluation(metrics=[llm_correctness(llm=my_judge)]),
    agent=my_agent,                # optional: auto-extract topology
    cache=DiskCache(".cache"),     # optional: cache judge responses (requires agent)
    metadata={"experiment": "v2"}, # optional: merged into results (overrides extracted values)
)
results = pipeline.run()

Parameter	Type	Description
`dataset`	`Dataset`	Input rows to process
`tasks`	`list[Task]`	Ordered task list, run sequentially
`evaluation`	`Evaluation`	Scores completed rows
`metadata`	`dict \| None`	Merged into result metadata; overrides extracted values
`agent`	`object \| None`	Agent object for auto-extraction
`cache`	`Cache \| None`	LLM judge cache; requires `agent`

EvaluationResults

results.metadata         # dict with timestamp_utc, topology, scoring, etc.
results.rows             # list[ResultRow] — each has .data (dict) and .scores (dict)
results.assert_passing() # True if all thresholded metrics pass on every row
results.failures()       # list[Failure] — each row/metric pair below threshold

# Save and load
path = results.save(directory="results", base_filename="eval")
latest = EvaluationResults.load(directory="results", base_filename="eval")

Framework Auto-Extraction

Pass an agent object from a supported framework and Pipeline automatically extracts its metadata and topology — no manual metadata dict needed.

Supported frameworks:

Framework	Install
LangGraph	`pip install langgraph`
Google ADK	`pip install google-adk`
OpenAI Agents SDK	`pip install openai-agents`
CrewAI	`pip install crewai`
LlamaIndex	`pip install llama-index`
Microsoft Agent Framework	`pip install msaf`

pipeline = Pipeline(
    dataset=dataset,
    tasks=[Task(get_answer)],
    evaluation=Evaluation(metrics=[llm_correctness(llm=judge)]),
    agent=my_agent,  # just pass your agent
)

Caching

Cache LLM judge responses to avoid redundant API calls. Cache keys include judge model, prompt, and agent fingerprint — the cache auto-invalidates when agent topology changes.

from agentstax_eval import DiskCache, MemoryCache

# Persistent (JSON file, atomic writes, thread-safe)
cache = DiskCache(directory=".cache")

# In-process (lost when process exits, thread-safe)
cache = MemoryCache()

Pass to Pipeline with agent for automatic fingerprint extraction:

pipeline = Pipeline(
    dataset=dataset,
    tasks=[Task(get_answer)],
    evaluation=Evaluation(metrics=[llm_correctness(llm=judge)]),
    agent=my_agent,
    cache=DiskCache(".cache"),
)

results = pipeline.run()
cache.stats()  # {"hits": 0, "misses": 3, "size": 3}

# Second run with same agent topology: all hits
results = pipeline.run()
cache.stats()  # {"hits": 3, "misses": 3, "size": 3}

Async Support

All three core objects support run_async(concurrency=N) for concurrent row processing:

import asyncio
from agentstax_eval import Pipeline, Task, Evaluation
from agentstax_eval.metrics import exact_match

async def call_agent(dataset_row: dict) -> dict:
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": dataset_row["question"]}],
    )
    return {"answer": response.choices[0].message.content}

pipeline = Pipeline(
    dataset=dataset,
    tasks=[Task(call_agent)],
    evaluation=Evaluation(metrics=[exact_match]),
    agent=my_agent,
)

results = asyncio.run(pipeline.run_async(concurrency=5))

Task.run_async() and Evaluation.run_async() are also available for step-by-step use. Default concurrency is 10.

CI / Regression Testing

# test_agent_quality.py
from agentstax_eval import Pipeline, Dataset, Task, Evaluation
from agentstax_eval.metrics import llm_correctness
from agentstax_eval.providers import openai_provider

def test_agent_passes_threshold():
    results = Pipeline(
        dataset=Dataset([
            {"question": "Capital of France?", "expected_answer": "Paris"},
            {"question": "What is 2 + 2?", "expected_answer": "4"},
        ]),
        tasks=[Task(call_my_agent)],
        evaluation=Evaluation(metrics=[llm_correctness(llm=openai_provider())]),
        agent=my_agent,
    ).run()

    assert results.assert_passing(), "\n".join(str(f) for f in results.failures())

Save results for trend tracking — the monitoring dashboard can watch the same directory for live regression visibility.

Contributing

Contributions are welcome. Please open an issue to discuss before submitting a PR.

git clone https://github.com/your-org/agentstax-eval.git && cd agentstax-eval
uv sync && uv run pytest --ignore=tests/e2e -q

License

MIT — see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

brandoncate95

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentstax_eval-0.1.0.tar.gz (32.6 kB view details)

Uploaded Mar 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentstax_eval-0.1.0-py3-none-any.whl (50.2 kB view details)

Uploaded Mar 22, 2026 Python 3

File details

Details for the file agentstax_eval-0.1.0.tar.gz.

File metadata

Download URL: agentstax_eval-0.1.0.tar.gz
Upload date: Mar 22, 2026
Size: 32.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentstax_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4e410ee16c85c25e7a78766e56b75775c25dbe327cf0cc168826d33d3f71517d`
MD5	`2cff8463c78482bf006d3c8ea0dd1bfb`
BLAKE2b-256	`4e5cd5e80baff7bcefdf86a8cbb7e6cc005531ca2551c25b560978db491d0144`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentstax_eval-0.1.0.tar.gz:

Publisher: publish.yml on agentstax/eval-python-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentstax_eval-0.1.0.tar.gz
- Subject digest: 4e410ee16c85c25e7a78766e56b75775c25dbe327cf0cc168826d33d3f71517d
- Sigstore transparency entry: 1155374967
- Sigstore integration time: Mar 22, 2026
Source repository:
- Permalink: agentstax/eval-python-sdk@3aed3c3039dd814466f9ff63cfec4df0e8918d08
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/agentstax
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3aed3c3039dd814466f9ff63cfec4df0e8918d08
- Trigger Event: release

File details

Details for the file agentstax_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentstax_eval-0.1.0-py3-none-any.whl
Upload date: Mar 22, 2026
Size: 50.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentstax_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d39ade539556917cf5181fbded5960f89234f8ed07647d4d54003d48c5e1749`
MD5	`cac060df18bd61e552062099a0469081`
BLAKE2b-256	`b85b976d79748c86766da4dbb4289703e476a68df51df1229dee29ba0460544e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentstax_eval-0.1.0-py3-none-any.whl:

Publisher: publish.yml on agentstax/eval-python-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentstax_eval-0.1.0-py3-none-any.whl
- Subject digest: 1d39ade539556917cf5181fbded5960f89234f8ed07647d4d54003d48c5e1749
- Sigstore transparency entry: 1155374970
- Sigstore integration time: Mar 22, 2026
Source repository:
- Permalink: agentstax/eval-python-sdk@3aed3c3039dd814466f9ff63cfec4df0e8918d08
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/agentstax
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3aed3c3039dd814466f9ff63cfec4df0e8918d08
- Trigger Event: release

agentstax-eval 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agentstax-eval

Table of Contents

Key Features

Installation

Why agentstax-eval

Quickstart

With agent auto-extraction

Metrics

Deterministic

LLM-as-Judge

Custom Metrics

Providers

Custom Providers

Monitoring Dashboard

Setup

Core API

Dataset

Task

Evaluation

Pipeline

EvaluationResults

Framework Auto-Extraction

Caching

Async Support

CI / Regression Testing

Further Reading

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance