CLI-first LLM evaluation framework — like Pytest for AI agents
Project description
██████╗ ██████╗ ███████╗███╗ ██╗███████╗██╗ ██╗ █████╗ ██╗ ██╔═══██╗██╔══██╗██╔════╝████╗ ██║██╔════╝██║ ██║██╔══██╗██║ ██║ ██║██████╔╝█████╗ ██╔██╗ ██║█████╗ ██║ ██║███████║██║ ██║ ██║██╔═══╝ ██╔══╝ ██║╚██╗██║██╔══╝ ╚██╗ ██╔╝██╔══██║██║ ╚██████╔╝██║ ███████╗██║ ╚████║███████╗ ╚████╔╝ ██║ ██║███████╗ ╚═════╝ ╚═╝ ╚══════╝╚═╝ ╚═══╝╚══════╝ ╚═══╝ ╚═╝ ╚═╝╚══════╝
CLI-first LLM evaluation — like Pytest for AI agents
DeepEval × Braintrust — but CLI-first, self-hosted, and free forever
Why OpenEval?
LLM outputs are non-deterministic. You can't just assertEqual. You need specialized scorers that understand semantics, faithfulness, and tool usage.
OpenEval gives you:
- 7 built-in scorers — from exact match to LLM-as-a-Judge
- CLI-first —
openeval run eval.pywith beautiful terminal output - CI/CD native —
--fail-under 0.8breaks your build on quality drops - Self-contained HTML reports — share results without a server
- Cost tracking — know exactly how much each eval costs
- 100% self-hosted — works with Ollama for $0 local evals
- Zero vendor lock-in — your data stays on your machine
Quick Start
pip install openeval-cli
Create eval.py:
from openai import OpenAI
from openeval import Eval
from openeval.scorers import ContainsAnyScorer, FaithfulnessScorer
client = OpenAI()
def my_agent(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}]
)
return response.choices[0].message.content
result = Eval(
name="my-eval",
data=[
{"input": "What is 2+2?", "expected_output": "4"},
{"input": "Return policy?", "expected_output": "30 days", "context": ["30-day refund policy"]},
],
task=my_agent,
scorers=[
ContainsAnyScorer(keywords=["4", "four"]),
FaithfulnessScorer(client=client),
],
)
Run it:
openeval run eval.py
Output:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment: my-eval ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Scorer │ Mean │ Pass Rate │
├──────────────┼─────────┼─────────────┤
│ ContainsAny │ 1.0000 │ 100% │
│ Faithfulness │ 0.9500 │ 100% │
├──────────────┴─────────┴─────────────┤
│ Duration: 2.3s │
│ Cost: $0.00045 │
└────────────────────────────────────────┘
Why NOT DeepEval / AgentOps / Braintrust?
| OpenEval | DeepEval | AgentOps | Braintrust | |
|---|---|---|---|---|
| Price | ✅ Free forever | Freemium | Freemium | $249/mo |
| CLI-first | ✅ Native | ❌ Library-only | ❌ Dashboard-first | ❌ Web-only |
| Self-contained HTML | ✅ No server needed | ❌ Requires platform | ❌ Requires app | ❌ Web-only |
| CI/CD native | ✅ Exit codes | ⚠️ Manual | ⚠️ Manual | ❌ No |
| Local LLM support | ✅ Ollama | ❌ OpenAI only | ⚠️ Partial | ❌ No |
| Philosophy | Tool you own | Framework | Platform | SaaS |
| Best for | CI/CD quality gates | Research evals | Production monitoring | Teams |
OpenEval is a tool, not a platform. You own your data, you run it where you want.
CLI Usage
# Basic run
openeval run eval.py
# Generate HTML report
openeval run eval.py --report results.html
# Fail CI if scores below threshold
openeval run eval.py --fail-under 0.8
# Run with Ollama (free, local)
# Just set OPENAI_BASE_URL=http://localhost:11434/v1
Scorers
| Scorer | Type | What it checks |
|---|---|---|
ExactMatchScorer |
Deterministic | Output matches expected exactly |
ContainsAnyScorer |
Deterministic | Output contains at least one keyword |
ContainsAllScorer |
Deterministic | Output contains all keywords |
SimilarityScorer |
Embedding | Cosine similarity via embeddings |
LLMJudgeScorer |
LLM-as-a-Judge | Custom criteria evaluated by LLM |
FaithfulnessScorer |
LLM-as-a-Judge | Is output grounded in context? (hallucination detection) |
ToolCorrectnessScorer |
Deterministic | Did the agent call the right tools? |
Custom scorers:
from openeval.scorers.base import FunctionScorer
length_scorer = FunctionScorer(
name="OutputLength",
fn=lambda tc: min(len(tc.actual_output) / 100, 1.0),
)
Datasets
from openeval.dataset import Dataset
# Load from file
ds = Dataset.from_csv("test_cases.csv")
ds = Dataset.from_json("test_cases.json")
# Filter and sample
ds_easy = ds.filter(tags=["easy"])
ds_sample = ds.sample(50)
CI/CD Integration
# .github/workflows/llm-eval.yml
name: LLM Quality Gate
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install openeval-cli
- run: openeval run tests/eval_chatbot.py --fail-under 0.8
Exit code 1 when quality drops → PR blocked.
Cost Tracking
# Costs tracked automatically
print(f"Total cost: ${result.total_cost_usd:.6f}")
print(f"Total tokens: {result.summary['total_tokens']}")
# Breakdown by scorer
for scorer_name, stats in result.summary.items():
print(f"{scorer_name}: ${stats.get('cost_usd', 0):.6f}")
Project Structure
openeval/
├── eval.py # Eval() orchestrator
├── test_case.py # TestCase data model
├── types.py # ScoreResult, ExperimentResult
├── dataset.py # Dataset loading and filtering
├── tracing.py # @trace decorator
├── cost.py # Token and cost tracking
├── report.py # HTML report generator
├── cli.py # CLI interface
└── scorers/
├── base.py # BaseScorer interface
├── exact_match.py
├── contains.py
├── similarity.py # Embedding-based
├── llm_judge.py # LLM-as-a-Judge
├── faithfulness.py # Hallucination detection
└── tool_correctness.py
Development
git clone https://github.com/edmontecristo/openeval.git
cd openeval
pip install -e ".[dev]"
pytest tests/ -v
License
MIT © OpenEval Contributors
Built for developers who ship AI products.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openeval_cli-0.1.0.tar.gz.
File metadata
- Download URL: openeval_cli-0.1.0.tar.gz
- Upload date:
- Size: 43.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1f10f760a3262da9cf58e9708ab6421cec6eb5c752fafa54e88438389039c5e
|
|
| MD5 |
3a65fad5fab6c362d67c847e2fb0d9e6
|
|
| BLAKE2b-256 |
d5c50f6264e66abe9f1125b1b0b13514332531fea410746a629c7685dda22971
|
File details
Details for the file openeval_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: openeval_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1dd2b27c2bd4bd673bd581ca81acfedc99f50ea6fc73cee0c5c3db1b3f41f239
|
|
| MD5 |
d27eb1a332c7ea6f3ebd1c7a545c6fa0
|
|
| BLAKE2b-256 |
0dc866ff7658d377bd84c8f94cf0b496236222acaa3bb6690b1e319d5d9f7755
|