vigil-eval

The testing framework for AI agents. Fast, framework-agnostic, CI-ready.

These details have not been verified by PyPI

Project links

Project description

Vigil

The testing framework for AI agents. Fast, framework-agnostic, CI-ready.

Quick Start · Features · Assertions · LLM-as-Judge · CI Integration · PyPI

Why Vigil?

40% of agentic AI projects risk cancellation due to reliability issues. Only 52% of teams run any form of evaluation. The rest ship and pray.

Existing tools are either framework-specific, complex to set up, or disconnected from CI/CD. There's no "pytest for AI."

Vigil changes that. Write AI tests in plain Python. Run them in CI. Catch regressions before production.

pip install vigil-eval

Quick Start

from vigil import test, FunctionAgent, assert_contains, assert_cost_under

agent = FunctionAgent(my_chatbot)

@test()
def test_greeting():
    result = agent.run("Hello!")
    assert_contains(result, "hello")
    assert_cost_under(result, 0.01)

vigil run

That's it. No config files, no setup, no boilerplate.

Features

Feature	Description
11+ assertions	Content, cost, latency, semantic, hallucination, LLM-as-judge
3 agent types	Python functions, HTTP APIs, CLI tools
LLM-as-judge	Use GPT/Claude to grade your agent's output against criteria
Snapshot testing	Save golden outputs, detect regressions automatically
Cost tracking	Auto-calculate API costs for 20+ models across 4 providers
Plugin system	Extend with custom assertions, agents, and reporters
CI-ready	Exit codes, JSON/HTML reports, GitHub Action included
Parallel execution	Run tests concurrently with `--parallel`
Async support	Test async agents natively
Zero config	Works out of the box, configure when you need to
Built on pytest	Use everything you already know

Assertions

from vigil import (
    assert_contains,          # output contains expected text
    assert_not_contains,      # output does not contain text
    assert_json_valid,        # output is valid JSON
    assert_matches_regex,     # output matches regex pattern
    assert_cost_under,        # API cost below threshold
    assert_tokens_under,      # token usage below limit
    assert_latency_under,     # response time below threshold
    assert_semantic_match,    # semantically similar to reference
    assert_no_hallucination,  # output grounded in provided context
    assert_quality,           # LLM grades output against criteria
    assert_rubric,            # LLM grades against multiple criteria
)

LLM-as-Judge

The most powerful feature in Vigil. Instead of pattern matching, an LLM evaluates your agent's output:

from vigil import test, FunctionAgent, assert_quality, assert_rubric

agent = FunctionAgent(my_agent)

@test()
def test_explanation():
    result = agent.run("Explain quantum computing to a 5-year-old")
    assert_quality(
        result,
        criteria="age-appropriate, accurate, under 100 words, uses simple analogies",
        threshold=0.7,
    )

@test()
def test_article():
    result = agent.run("Write about climate change")
    assert_rubric(
        result,
        rubric={
            "accuracy": "All claims are factually correct",
            "clarity": "Easy to understand, no jargon",
            "completeness": "Covers causes, effects, and solutions",
        },
        threshold=0.7,
    )

Works with OpenAI and Anthropic. Auto-detects your API key.

Agent Types

from vigil import FunctionAgent, HTTPAgent, CLIAgent

# Test a Python function (sync or async)
agent = FunctionAgent(my_function)

# Test an HTTP endpoint
agent = HTTPAgent("http://localhost:8000/chat")

# Test a CLI tool
agent = CLIAgent("python my_agent.py")

Cost Tracking

Vigil auto-calculates API costs for 20+ models:

from vigil import FunctionAgent, assert_cost_under
from vigil.cost import enrich_result

def my_agent(msg):
    return {"output": "Hello!", "model": "gpt-4o-mini", "tokens_input": 50, "tokens_output": 100}

result = FunctionAgent(my_agent).run("Hi")
enrich_result(result)
print(result.cost)  # Auto-calculated from pricing table

assert_cost_under(result, 0.01)

Supports OpenAI, Anthropic, Google, and DeepSeek models.

Snapshot Testing

Save golden outputs and detect when your agent's behavior drifts:

from vigil import test, FunctionAgent
from vigil.snapshots import snapshot

@test()
def test_output_stable():
    result = FunctionAgent(my_agent).run("Summarize this document")
    snapshot(result, name="summary_output")

# First run: saves snapshot. Next runs: compares against it.
vigil run

# Accept new outputs when they intentionally change
vigil snapshot update

CI Integration

GitHub Actions

name: AI Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install vigil-eval
      - run: vigil run
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Reports

vigil run --report json > results.json   # For pipelines
vigil run --report html                  # Standalone HTML report

Parallel Execution

pip install "vigil-eval[parallel]"
vigil run --parallel 4

Configuration

Works with zero config. When you need it, create vigil.yaml:

defaults:
  cost_threshold: 0.05
  latency_threshold: 5.0
  semantic_threshold: 0.85

reporting:
  format: terminal
  verbose: true

Or add to your existing pyproject.toml:

[tool.vigil]
cost_threshold = 0.05
latency_threshold = 5.0

Plugins

Extend Vigil with custom assertions, agents, and reporters:

from vigil.plugins import register_assertion

@register_assertion("assert_polite")
def assert_polite(result, **kwargs):
    polite_words = ["please", "thank", "sorry", "appreciate"]
    if not any(word in result.output.lower() for word in polite_words):
        raise AssertionError("Output is not polite")

Plugins are auto-discovered via Python entry points.

Install

pip install vigil-eval                    # Core
pip install "vigil-eval[openai]"          # + OpenAI (LLM-as-judge, embeddings)
pip install "vigil-eval[anthropic]"       # + Anthropic
pip install "vigil-eval[all]"             # Everything

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.3

Feb 23, 2026

0.2.2

Feb 23, 2026

This version

0.2.1

Feb 23, 2026

0.2.0

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vigil_eval-0.2.1.tar.gz (38.3 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vigil_eval-0.2.1-py3-none-any.whl (35.4 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file vigil_eval-0.2.1.tar.gz.

File metadata

Download URL: vigil_eval-0.2.1.tar.gz
Upload date: Feb 23, 2026
Size: 38.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for vigil_eval-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`f878a721753c7c564c52de759d1d79290aa470896928474fac432dfd861cb34e`
MD5	`54a363c7c8d4013374d228a855431528`
BLAKE2b-256	`adf924a3a7ff55b465be303daf31ddd65dcdd58160e2f632a1ccda6f6fc4bfc1`

See more details on using hashes here.

File details

Details for the file vigil_eval-0.2.1-py3-none-any.whl.

File metadata

Download URL: vigil_eval-0.2.1-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 35.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for vigil_eval-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c827e2abdaee786c69624f4942c00efe786b49acd11df0adabbe285faa275d9`
MD5	`96e1d010b3834288ad5a203b1593d4e8`
BLAKE2b-256	`0218c217116783f8add9cc4a770e13f16b305bb04054d616ece8954d66c463af`

See more details on using hashes here.

vigil-eval 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Why Vigil?

Quick Start

Features

Assertions

LLM-as-Judge

Agent Types

Cost Tracking

Snapshot Testing

CI Integration

GitHub Actions

Reports

Parallel Execution

Configuration

Plugins

Install

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes