Skip to main content

The testing framework for AI agents. Fast, framework-agnostic, CI-ready.

Project description

Vigil

The testing framework for AI agents. Fast, framework-agnostic, CI-ready.

PyPI Python License

Quick Start · Features · Assertions · LLM-as-Judge · CI Integration · PyPI


Why Vigil?

40% of agentic AI projects risk cancellation due to reliability issues. Only 52% of teams run any form of evaluation. The rest ship and pray.

Existing tools are either framework-specific, complex to set up, or disconnected from CI/CD. There's no "pytest for AI."

Vigil changes that. Write AI tests in plain Python. Run them in CI. Catch regressions before production.

pip install vigil-eval

Quick Start

from vigil import test, FunctionAgent, assert_contains, assert_cost_under

agent = FunctionAgent(my_chatbot)

@test()
def test_greeting():
    result = agent.run("Hello!")
    assert_contains(result, "hello")
    assert_cost_under(result, 0.01)
vigil run

That's it. No config files, no setup, no boilerplate.

Features

Feature Description
11+ assertions Content, cost, latency, semantic, hallucination, LLM-as-judge
3 agent types Python functions, HTTP APIs, CLI tools
LLM-as-judge Use GPT/Claude to grade your agent's output against criteria
Snapshot testing Save golden outputs, detect regressions automatically
Cost tracking Auto-calculate API costs for 20+ models across 4 providers
Plugin system Extend with custom assertions, agents, and reporters
CI-ready Exit codes, JSON/HTML reports, GitHub Action included
Parallel execution Run tests concurrently with --parallel
Async support Test async agents natively
Zero config Works out of the box, configure when you need to
Built on pytest Use everything you already know

Assertions

from vigil import (
    assert_contains,          # output contains expected text
    assert_not_contains,      # output does not contain text
    assert_json_valid,        # output is valid JSON
    assert_matches_regex,     # output matches regex pattern
    assert_cost_under,        # API cost below threshold
    assert_tokens_under,      # token usage below limit
    assert_latency_under,     # response time below threshold
    assert_semantic_match,    # semantically similar to reference
    assert_no_hallucination,  # output grounded in provided context
    assert_quality,           # LLM grades output against criteria
    assert_rubric,            # LLM grades against multiple criteria
)

LLM-as-Judge

The most powerful feature in Vigil. Instead of pattern matching, an LLM evaluates your agent's output:

from vigil import test, FunctionAgent, assert_quality, assert_rubric

agent = FunctionAgent(my_agent)

@test()
def test_explanation():
    result = agent.run("Explain quantum computing to a 5-year-old")
    assert_quality(
        result,
        criteria="age-appropriate, accurate, under 100 words, uses simple analogies",
        threshold=0.7,
    )

@test()
def test_article():
    result = agent.run("Write about climate change")
    assert_rubric(
        result,
        rubric={
            "accuracy": "All claims are factually correct",
            "clarity": "Easy to understand, no jargon",
            "completeness": "Covers causes, effects, and solutions",
        },
        threshold=0.7,
    )

Works with OpenAI and Anthropic. Auto-detects your API key.

Agent Types

from vigil import FunctionAgent, HTTPAgent, CLIAgent

# Test a Python function (sync or async)
agent = FunctionAgent(my_function)

# Test an HTTP endpoint
agent = HTTPAgent("http://localhost:8000/chat")

# Test a CLI tool
agent = CLIAgent("python my_agent.py")

Cost Tracking

Vigil auto-calculates API costs for 20+ models:

from vigil import FunctionAgent, assert_cost_under
from vigil.cost import enrich_result

def my_agent(msg):
    return {"output": "Hello!", "model": "gpt-4o-mini", "tokens_input": 50, "tokens_output": 100}

result = FunctionAgent(my_agent).run("Hi")
enrich_result(result)
print(result.cost)  # Auto-calculated from pricing table

assert_cost_under(result, 0.01)

Supports OpenAI, Anthropic, Google, and DeepSeek models.

Snapshot Testing

Save golden outputs and detect when your agent's behavior drifts:

from vigil import test, FunctionAgent
from vigil.snapshots import snapshot

@test()
def test_output_stable():
    result = FunctionAgent(my_agent).run("Summarize this document")
    snapshot(result, name="summary_output")
# First run: saves snapshot. Next runs: compares against it.
vigil run

# Accept new outputs when they intentionally change
vigil snapshot update

CI Integration

GitHub Actions

name: AI Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install vigil-eval
      - run: vigil run
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Reports

vigil run --report json > results.json   # For pipelines
vigil run --report html                  # Standalone HTML report

Parallel Execution

pip install "vigil-eval[parallel]"
vigil run --parallel 4

Configuration

Works with zero config. When you need it, create vigil.yaml:

defaults:
  cost_threshold: 0.05
  latency_threshold: 5.0
  semantic_threshold: 0.85

reporting:
  format: terminal
  verbose: true

Or add to your existing pyproject.toml:

[tool.vigil]
cost_threshold = 0.05
latency_threshold = 5.0

Plugins

Extend Vigil with custom assertions, agents, and reporters:

from vigil.plugins import register_assertion

@register_assertion("assert_polite")
def assert_polite(result, **kwargs):
    polite_words = ["please", "thank", "sorry", "appreciate"]
    if not any(word in result.output.lower() for word in polite_words):
        raise AssertionError("Output is not polite")

Plugins are auto-discovered via Python entry points.

How Vigil Compares

Vigil DeepEval Promptfoo RAGAS
Setup pip install and go Requires account signup Node.js + YAML config pip install + config
Write tests in Python (pytest) Python (pytest) YAML Python
Zero config Yes No No No
Agent testing Functions, HTTP, CLI Functions only HTTP, CLI RAG only
Async support Built-in Limited N/A No
Cost tracking Auto (20+ models) Manual No No
Snapshot testing Built-in No No No
LLM-as-judge Yes Yes Yes Yes
Plugin system Yes No Yes No
CI/GitHub Action Included Separate Separate No
Cloud required No Free tier limited Optional No
Lines to first test 5 15+ 20+ (YAML) 10+

Vigil is built for developers who want pytest-like simplicity, not a platform.

Install

pip install vigil-eval                    # Core
pip install "vigil-eval[openai]"          # + OpenAI (LLM-as-judge, embeddings)
pip install "vigil-eval[anthropic]"       # + Anthropic
pip install "vigil-eval[all]"             # Everything

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vigil_eval-0.2.3.tar.gz (38.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vigil_eval-0.2.3-py3-none-any.whl (35.8 kB view details)

Uploaded Python 3

File details

Details for the file vigil_eval-0.2.3.tar.gz.

File metadata

  • Download URL: vigil_eval-0.2.3.tar.gz
  • Upload date:
  • Size: 38.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for vigil_eval-0.2.3.tar.gz
Algorithm Hash digest
SHA256 da63773c1d736d5167fc85b5187e2315c109a4111fa2ca1a64dfb476ee41a82e
MD5 51f79b56e17c90ee24335f2c57e48266
BLAKE2b-256 86948a4bb606b3bfd8e85a84d3685f8627dbe00084bac04aaf3d88e25711f09f

See more details on using hashes here.

File details

Details for the file vigil_eval-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: vigil_eval-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 35.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for vigil_eval-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d4a57f10f183100a8ed94cee0d15b3b1523cf941548694f5e10ce54c904b1ba0
MD5 31b31c6d0fd27a4ca946bb94f907ee2b
BLAKE2b-256 1bc620df2cbd6dcf3d5f8acd651dbd9293481a1f04f032b74f26f5806dc6766f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page