Skip to main content

The testing framework for AI agents. Fast, framework-agnostic, CI-ready.

Project description

Vigil

The testing framework for AI agents. Fast, framework-agnostic, CI-ready.

PyPI Python License

Quick Start · Features · Assertions · LLM-as-Judge · CI Integration · PyPI


Why Vigil?

40% of agentic AI projects risk cancellation due to reliability issues. Only 52% of teams run any form of evaluation. The rest ship and pray.

Existing tools are either framework-specific, complex to set up, or disconnected from CI/CD. There's no "pytest for AI."

Vigil changes that. Write AI tests in plain Python. Run them in CI. Catch regressions before production.

pip install vigil-eval

Quick Start

from vigil import test, FunctionAgent, assert_contains, assert_cost_under

agent = FunctionAgent(my_chatbot)

@test()
def test_greeting():
    result = agent.run("Hello!")
    assert_contains(result, "hello")
    assert_cost_under(result, 0.01)
vigil run

That's it. No config files, no setup, no boilerplate.

Features

Feature Description
11+ assertions Content, cost, latency, semantic, hallucination, LLM-as-judge
3 agent types Python functions, HTTP APIs, CLI tools
LLM-as-judge Use GPT/Claude to grade your agent's output against criteria
Snapshot testing Save golden outputs, detect regressions automatically
Cost tracking Auto-calculate API costs for 20+ models across 4 providers
Plugin system Extend with custom assertions, agents, and reporters
CI-ready Exit codes, JSON/HTML reports, GitHub Action included
Parallel execution Run tests concurrently with --parallel
Async support Test async agents natively
Zero config Works out of the box, configure when you need to
Built on pytest Use everything you already know

Assertions

from vigil import (
    assert_contains,          # output contains expected text
    assert_not_contains,      # output does not contain text
    assert_json_valid,        # output is valid JSON
    assert_matches_regex,     # output matches regex pattern
    assert_cost_under,        # API cost below threshold
    assert_tokens_under,      # token usage below limit
    assert_latency_under,     # response time below threshold
    assert_semantic_match,    # semantically similar to reference
    assert_no_hallucination,  # output grounded in provided context
    assert_quality,           # LLM grades output against criteria
    assert_rubric,            # LLM grades against multiple criteria
)

LLM-as-Judge

The most powerful feature in Vigil. Instead of pattern matching, an LLM evaluates your agent's output:

from vigil import test, FunctionAgent, assert_quality, assert_rubric

agent = FunctionAgent(my_agent)

@test()
def test_explanation():
    result = agent.run("Explain quantum computing to a 5-year-old")
    assert_quality(
        result,
        criteria="age-appropriate, accurate, under 100 words, uses simple analogies",
        threshold=0.7,
    )

@test()
def test_article():
    result = agent.run("Write about climate change")
    assert_rubric(
        result,
        rubric={
            "accuracy": "All claims are factually correct",
            "clarity": "Easy to understand, no jargon",
            "completeness": "Covers causes, effects, and solutions",
        },
        threshold=0.7,
    )

Works with OpenAI and Anthropic. Auto-detects your API key.

Agent Types

from vigil import FunctionAgent, HTTPAgent, CLIAgent

# Test a Python function (sync or async)
agent = FunctionAgent(my_function)

# Test an HTTP endpoint
agent = HTTPAgent("http://localhost:8000/chat")

# Test a CLI tool
agent = CLIAgent("python my_agent.py")

Cost Tracking

Vigil auto-calculates API costs for 20+ models:

from vigil import FunctionAgent, assert_cost_under
from vigil.cost import enrich_result

def my_agent(msg):
    return {"output": "Hello!", "model": "gpt-4o-mini", "tokens_input": 50, "tokens_output": 100}

result = FunctionAgent(my_agent).run("Hi")
enrich_result(result)
print(result.cost)  # Auto-calculated from pricing table

assert_cost_under(result, 0.01)

Supports OpenAI, Anthropic, Google, and DeepSeek models.

Snapshot Testing

Save golden outputs and detect when your agent's behavior drifts:

from vigil import test, FunctionAgent
from vigil.snapshots import snapshot

@test()
def test_output_stable():
    result = FunctionAgent(my_agent).run("Summarize this document")
    snapshot(result, name="summary_output")
# First run: saves snapshot. Next runs: compares against it.
vigil run

# Accept new outputs when they intentionally change
vigil snapshot update

CI Integration

GitHub Actions

name: AI Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install vigil-eval
      - run: vigil run
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Reports

vigil run --report json > results.json   # For pipelines
vigil run --report html                  # Standalone HTML report

Parallel Execution

pip install "vigil-eval[parallel]"
vigil run --parallel 4

Configuration

Works with zero config. When you need it, create vigil.yaml:

defaults:
  cost_threshold: 0.05
  latency_threshold: 5.0
  semantic_threshold: 0.85

reporting:
  format: terminal
  verbose: true

Or add to your existing pyproject.toml:

[tool.vigil]
cost_threshold = 0.05
latency_threshold = 5.0

Plugins

Extend Vigil with custom assertions, agents, and reporters:

from vigil.plugins import register_assertion

@register_assertion("assert_polite")
def assert_polite(result, **kwargs):
    polite_words = ["please", "thank", "sorry", "appreciate"]
    if not any(word in result.output.lower() for word in polite_words):
        raise AssertionError("Output is not polite")

Plugins are auto-discovered via Python entry points.

Install

pip install vigil-eval                    # Core
pip install "vigil-eval[openai]"          # + OpenAI (LLM-as-judge, embeddings)
pip install "vigil-eval[anthropic]"       # + Anthropic
pip install "vigil-eval[all]"             # Everything

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vigil_eval-0.2.1.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vigil_eval-0.2.1-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file vigil_eval-0.2.1.tar.gz.

File metadata

  • Download URL: vigil_eval-0.2.1.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for vigil_eval-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f878a721753c7c564c52de759d1d79290aa470896928474fac432dfd861cb34e
MD5 54a363c7c8d4013374d228a855431528
BLAKE2b-256 adf924a3a7ff55b465be303daf31ddd65dcdd58160e2f632a1ccda6f6fc4bfc1

See more details on using hashes here.

File details

Details for the file vigil_eval-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: vigil_eval-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 35.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for vigil_eval-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3c827e2abdaee786c69624f4942c00efe786b49acd11df0adabbe285faa275d9
MD5 96e1d010b3834288ad5a203b1593d4e8
BLAKE2b-256 0218c217116783f8add9cc4a770e13f16b305bb04054d616ece8954d66c463af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page