The testing framework for AI agents. Fast, framework-agnostic, CI-ready.
Project description
The testing framework for AI agents. Fast, framework-agnostic, CI-ready.
Quick Start · Features · Assertions · LLM-as-Judge · CI Integration · PyPI
Why Vigil?
40% of agentic AI projects risk cancellation due to reliability issues. Only 52% of teams run any form of evaluation. The rest ship and pray.
Existing tools are either framework-specific, complex to set up, or disconnected from CI/CD. There's no "pytest for AI."
Vigil changes that. Write AI tests in plain Python. Run them in CI. Catch regressions before production.
pip install vigil-eval
Quick Start
from vigil import test, FunctionAgent, assert_contains, assert_cost_under
agent = FunctionAgent(my_chatbot)
@test()
def test_greeting():
result = agent.run("Hello!")
assert_contains(result, "hello")
assert_cost_under(result, 0.01)
vigil run
That's it. No config files, no setup, no boilerplate.
Features
| Feature | Description |
|---|---|
| 11+ assertions | Content, cost, latency, semantic, hallucination, LLM-as-judge |
| 3 agent types | Python functions, HTTP APIs, CLI tools |
| LLM-as-judge | Use GPT/Claude to grade your agent's output against criteria |
| Snapshot testing | Save golden outputs, detect regressions automatically |
| Cost tracking | Auto-calculate API costs for 20+ models across 4 providers |
| Plugin system | Extend with custom assertions, agents, and reporters |
| CI-ready | Exit codes, JSON/HTML reports, GitHub Action included |
| Parallel execution | Run tests concurrently with --parallel |
| Async support | Test async agents natively |
| Zero config | Works out of the box, configure when you need to |
| Built on pytest | Use everything you already know |
Assertions
from vigil import (
assert_contains, # output contains expected text
assert_not_contains, # output does not contain text
assert_json_valid, # output is valid JSON
assert_matches_regex, # output matches regex pattern
assert_cost_under, # API cost below threshold
assert_tokens_under, # token usage below limit
assert_latency_under, # response time below threshold
assert_semantic_match, # semantically similar to reference
assert_no_hallucination, # output grounded in provided context
assert_quality, # LLM grades output against criteria
assert_rubric, # LLM grades against multiple criteria
)
LLM-as-Judge
The most powerful feature in Vigil. Instead of pattern matching, an LLM evaluates your agent's output:
from vigil import test, FunctionAgent, assert_quality, assert_rubric
agent = FunctionAgent(my_agent)
@test()
def test_explanation():
result = agent.run("Explain quantum computing to a 5-year-old")
assert_quality(
result,
criteria="age-appropriate, accurate, under 100 words, uses simple analogies",
threshold=0.7,
)
@test()
def test_article():
result = agent.run("Write about climate change")
assert_rubric(
result,
rubric={
"accuracy": "All claims are factually correct",
"clarity": "Easy to understand, no jargon",
"completeness": "Covers causes, effects, and solutions",
},
threshold=0.7,
)
Works with OpenAI and Anthropic. Auto-detects your API key.
Agent Types
from vigil import FunctionAgent, HTTPAgent, CLIAgent
# Test a Python function (sync or async)
agent = FunctionAgent(my_function)
# Test an HTTP endpoint
agent = HTTPAgent("http://localhost:8000/chat")
# Test a CLI tool
agent = CLIAgent("python my_agent.py")
Cost Tracking
Vigil auto-calculates API costs for 20+ models:
from vigil import FunctionAgent, assert_cost_under
from vigil.cost import enrich_result
def my_agent(msg):
return {"output": "Hello!", "model": "gpt-4o-mini", "tokens_input": 50, "tokens_output": 100}
result = FunctionAgent(my_agent).run("Hi")
enrich_result(result)
print(result.cost) # Auto-calculated from pricing table
assert_cost_under(result, 0.01)
Supports OpenAI, Anthropic, Google, and DeepSeek models.
Snapshot Testing
Save golden outputs and detect when your agent's behavior drifts:
from vigil import test, FunctionAgent
from vigil.snapshots import snapshot
@test()
def test_output_stable():
result = FunctionAgent(my_agent).run("Summarize this document")
snapshot(result, name="summary_output")
# First run: saves snapshot. Next runs: compares against it.
vigil run
# Accept new outputs when they intentionally change
vigil snapshot update
CI Integration
GitHub Actions
name: AI Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install vigil-eval
- run: vigil run
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Reports
vigil run --report json > results.json # For pipelines
vigil run --report html # Standalone HTML report
Parallel Execution
pip install "vigil-eval[parallel]"
vigil run --parallel 4
Configuration
Works with zero config. When you need it, create vigil.yaml:
defaults:
cost_threshold: 0.05
latency_threshold: 5.0
semantic_threshold: 0.85
reporting:
format: terminal
verbose: true
Or add to your existing pyproject.toml:
[tool.vigil]
cost_threshold = 0.05
latency_threshold = 5.0
Plugins
Extend Vigil with custom assertions, agents, and reporters:
from vigil.plugins import register_assertion
@register_assertion("assert_polite")
def assert_polite(result, **kwargs):
polite_words = ["please", "thank", "sorry", "appreciate"]
if not any(word in result.output.lower() for word in polite_words):
raise AssertionError("Output is not polite")
Plugins are auto-discovered via Python entry points.
How Vigil Compares
| Vigil | DeepEval | Promptfoo | RAGAS | |
|---|---|---|---|---|
| Setup | pip install and go |
Requires account signup | Node.js + YAML config | pip install + config |
| Write tests in | Python (pytest) | Python (pytest) | YAML | Python |
| Zero config | Yes | No | No | No |
| Agent testing | Functions, HTTP, CLI | Functions only | HTTP, CLI | RAG only |
| Async support | Built-in | Limited | N/A | No |
| Cost tracking | Auto (20+ models) | Manual | No | No |
| Snapshot testing | Built-in | No | No | No |
| LLM-as-judge | Yes | Yes | Yes | Yes |
| Plugin system | Yes | No | Yes | No |
| CI/GitHub Action | Included | Separate | Separate | No |
| Cloud required | No | Free tier limited | Optional | No |
| Lines to first test | 5 | 15+ | 20+ (YAML) | 10+ |
Vigil is built for developers who want pytest-like simplicity, not a platform.
Install
pip install vigil-eval # Core
pip install "vigil-eval[openai]" # + OpenAI (LLM-as-judge, embeddings)
pip install "vigil-eval[anthropic]" # + Anthropic
pip install "vigil-eval[all]" # Everything
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vigil_eval-0.2.3.tar.gz.
File metadata
- Download URL: vigil_eval-0.2.3.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da63773c1d736d5167fc85b5187e2315c109a4111fa2ca1a64dfb476ee41a82e
|
|
| MD5 |
51f79b56e17c90ee24335f2c57e48266
|
|
| BLAKE2b-256 |
86948a4bb606b3bfd8e85a84d3685f8627dbe00084bac04aaf3d88e25711f09f
|
File details
Details for the file vigil_eval-0.2.3-py3-none-any.whl.
File metadata
- Download URL: vigil_eval-0.2.3-py3-none-any.whl
- Upload date:
- Size: 35.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4a57f10f183100a8ed94cee0d15b3b1523cf941548694f5e10ce54c904b1ba0
|
|
| MD5 |
31b31c6d0fd27a4ca946bb94f907ee2b
|
|
| BLAKE2b-256 |
1bc620df2cbd6dcf3d5f8acd651dbd9293481a1f04f032b74f26f5806dc6766f
|