Skip to main content

CLI-first LLM evaluation framework — like Pytest for AI agents

Project description

 ██████╗ ██████╗ ███████╗███╗   ██╗███████╗██╗   ██╗ █████╗ ██╗     
██╔═══██╗██╔══██╗██╔════╝████╗  ██║██╔════╝██║   ██║██╔══██╗██║     
██║   ██║██████╔╝█████╗  ██╔██╗ ██║█████╗  ██║   ██║███████║██║     
██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║██╔══╝  ╚██╗ ██╔╝██╔══██║██║     
╚██████╔╝██║     ███████╗██║ ╚████║███████╗ ╚████╔╝ ██║  ██║███████╗
 ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝╚══════╝  ╚═══╝  ╚═╝  ╚═╝╚══════╝

CLI-first LLM evaluation — like Pytest for AI agents

PyPI Tests License Python

DeepEval × Braintrust — but CLI-first, self-hosted, and free forever


Why OpenEval?

LLM outputs are non-deterministic. You can't just assertEqual. You need specialized scorers that understand semantics, faithfulness, and tool usage.

OpenEval gives you:

  • 7 built-in scorers — from exact match to LLM-as-a-Judge
  • CLI-firstopeneval run eval.py with beautiful terminal output
  • CI/CD native--fail-under 0.8 breaks your build on quality drops
  • Self-contained HTML reports — share results without a server
  • Cost tracking — know exactly how much each eval costs
  • 100% self-hosted — works with Ollama for $0 local evals
  • Zero vendor lock-in — your data stays on your machine

Quick Start

pip install openeval-cli

Create eval.py:

from openai import OpenAI
from openeval import Eval
from openeval.scorers import ContainsAnyScorer, FaithfulnessScorer

client = OpenAI()

def my_agent(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

result = Eval(
    name="my-eval",
    data=[
        {"input": "What is 2+2?", "expected_output": "4"},
        {"input": "Return policy?", "expected_output": "30 days", "context": ["30-day refund policy"]},
    ],
    task=my_agent,
    scorers=[
        ContainsAnyScorer(keywords=["4", "four"]),
        FaithfulnessScorer(client=client),
    ],
)

Run it:

openeval run eval.py

Output:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  Experiment: my-eval                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Scorer       │ Mean    │ Pass Rate   │
├──────────────┼─────────┼─────────────┤
│ ContainsAny  │ 1.0000  │ 100%        │
│ Faithfulness │ 0.9500  │ 100%        │
├──────────────┴─────────┴─────────────┤
│ Duration: 2.3s                         │
│ Cost: $0.00045                         │
└────────────────────────────────────────┘

Why NOT DeepEval / AgentOps / Braintrust?

OpenEval DeepEval AgentOps Braintrust
Price ✅ Free forever Freemium Freemium $249/mo
CLI-first ✅ Native ❌ Library-only ❌ Dashboard-first ❌ Web-only
Self-contained HTML ✅ No server needed ❌ Requires platform ❌ Requires app ❌ Web-only
CI/CD native ✅ Exit codes ⚠️ Manual ⚠️ Manual ❌ No
Local LLM support ✅ Ollama ❌ OpenAI only ⚠️ Partial ❌ No
Philosophy Tool you own Framework Platform SaaS
Best for CI/CD quality gates Research evals Production monitoring Teams

OpenEval is a tool, not a platform. You own your data, you run it where you want.


CLI Usage

# Basic run
openeval run eval.py

# Generate HTML report
openeval run eval.py --report results.html

# Fail CI if scores below threshold
openeval run eval.py --fail-under 0.8

# Run with Ollama (free, local)
# Just set OPENAI_BASE_URL=http://localhost:11434/v1

Scorers

Scorer Type What it checks
ExactMatchScorer Deterministic Output matches expected exactly
ContainsAnyScorer Deterministic Output contains at least one keyword
ContainsAllScorer Deterministic Output contains all keywords
SimilarityScorer Embedding Cosine similarity via embeddings
LLMJudgeScorer LLM-as-a-Judge Custom criteria evaluated by LLM
FaithfulnessScorer LLM-as-a-Judge Is output grounded in context? (hallucination detection)
ToolCorrectnessScorer Deterministic Did the agent call the right tools?

Custom scorers:

from openeval.scorers.base import FunctionScorer

length_scorer = FunctionScorer(
    name="OutputLength",
    fn=lambda tc: min(len(tc.actual_output) / 100, 1.0),
)

Datasets

from openeval.dataset import Dataset

# Load from file
ds = Dataset.from_csv("test_cases.csv")
ds = Dataset.from_json("test_cases.json")

# Filter and sample
ds_easy = ds.filter(tags=["easy"])
ds_sample = ds.sample(50)

CI/CD Integration

# .github/workflows/llm-eval.yml
name: LLM Quality Gate
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install openeval-cli
      - run: openeval run tests/eval_chatbot.py --fail-under 0.8

Exit code 1 when quality drops → PR blocked.


Cost Tracking

# Costs tracked automatically
print(f"Total cost: ${result.total_cost_usd:.6f}")
print(f"Total tokens: {result.summary['total_tokens']}")

# Breakdown by scorer
for scorer_name, stats in result.summary.items():
    print(f"{scorer_name}: ${stats.get('cost_usd', 0):.6f}")

Project Structure

openeval/
├── eval.py              # Eval() orchestrator
├── test_case.py         # TestCase data model
├── types.py             # ScoreResult, ExperimentResult
├── dataset.py           # Dataset loading and filtering
├── tracing.py           # @trace decorator
├── cost.py              # Token and cost tracking
├── report.py            # HTML report generator
├── cli.py               # CLI interface
└── scorers/
    ├── base.py          # BaseScorer interface
    ├── exact_match.py
    ├── contains.py
    ├── similarity.py    # Embedding-based
    ├── llm_judge.py     # LLM-as-a-Judge
    ├── faithfulness.py  # Hallucination detection
    └── tool_correctness.py

Development

git clone https://github.com/edmontecristo/openeval.git
cd openeval
pip install -e ".[dev]"
pytest tests/ -v

License

MIT © OpenEval Contributors


Built for developers who ship AI products.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openeval_cli-0.1.0.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openeval_cli-0.1.0-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file openeval_cli-0.1.0.tar.gz.

File metadata

  • Download URL: openeval_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for openeval_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f1f10f760a3262da9cf58e9708ab6421cec6eb5c752fafa54e88438389039c5e
MD5 3a65fad5fab6c362d67c847e2fb0d9e6
BLAKE2b-256 d5c50f6264e66abe9f1125b1b0b13514332531fea410746a629c7685dda22971

See more details on using hashes here.

File details

Details for the file openeval_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: openeval_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for openeval_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1dd2b27c2bd4bd673bd581ca81acfedc99f50ea6fc73cee0c5c3db1b3f41f239
MD5 d27eb1a332c7ea6f3ebd1c7a545c6fa0
BLAKE2b-256 0dc866ff7658d377bd84c8f94cf0b496236222acaa3bb6690b1e319d5d9f7755

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page