openeval-cli

CLI-first LLM evaluation framework — like Pytest for AI agents

These details have not been verified by PyPI

Project links

Project description

 ██████╗ ██████╗ ███████╗███╗   ██╗███████╗██╗   ██╗ █████╗ ██╗     
██╔═══██╗██╔══██╗██╔════╝████╗  ██║██╔════╝██║   ██║██╔══██╗██║     
██║   ██║██████╔╝█████╗  ██╔██╗ ██║█████╗  ██║   ██║███████║██║     
██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║██╔══╝  ╚██╗ ██╔╝██╔══██║██║     
╚██████╔╝██║     ███████╗██║ ╚████║███████╗ ╚████╔╝ ██║  ██║███████╗
 ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝╚══════╝  ╚═══╝  ╚═╝  ╚═╝╚══════╝

CLI-first LLM evaluation — like Pytest for AI agents

_{DeepEval ×
Braintrust —
but CLI-first, self-hosted, and free forever}

Why OpenEval?

LLM outputs are non-deterministic. You can't just assertEqual. You need specialized scorers that understand semantics, faithfulness, and tool usage.

OpenEval gives you:

7 built-in scorers — from exact match to LLM-as-a-Judge
CLI-first — openeval run eval.py with beautiful terminal output
CI/CD native — --fail-under 0.8 breaks your build on quality drops
Self-contained HTML reports — share results without a server
Cost tracking — know exactly how much each eval costs
100% self-hosted — works with Ollama for $0 local evals
Zero vendor lock-in — your data stays on your machine

Quick Start

pip install openeval-cli

Create eval.py:

from openai import OpenAI
from openeval import Eval
from openeval.scorers import ContainsAnyScorer, FaithfulnessScorer

client = OpenAI()

def my_agent(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

result = Eval(
    name="my-eval",
    data=[
        {"input": "What is 2+2?", "expected_output": "4"},
        {"input": "Return policy?", "expected_output": "30 days", "context": ["30-day refund policy"]},
    ],
    task=my_agent,
    scorers=[
        ContainsAnyScorer(keywords=["4", "four"]),
        FaithfulnessScorer(client=client),
    ],
)

Run it:

openeval run eval.py

Output:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  Experiment: my-eval                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Scorer       │ Mean    │ Pass Rate   │
├──────────────┼─────────┼─────────────┤
│ ContainsAny  │ 1.0000  │ 100%        │
│ Faithfulness │ 0.9500  │ 100%        │
├──────────────┴─────────┴─────────────┤
│ Duration: 2.3s                         │
│ Cost: $0.00045                         │
└────────────────────────────────────────┘

Why NOT DeepEval / AgentOps / Braintrust?

	OpenEval	DeepEval	AgentOps	Braintrust
Price	✅ Free forever	Freemium	Freemium	$249/mo
CLI-first	✅ Native	❌ Library-only	❌ Dashboard-first	❌ Web-only
Self-contained HTML	✅ No server needed	❌ Requires platform	❌ Requires app	❌ Web-only
CI/CD native	✅ Exit codes	⚠️ Manual	⚠️ Manual	❌ No
Local LLM support	✅ Ollama	❌ OpenAI only	⚠️ Partial	❌ No
Philosophy	Tool you own	Framework	Platform	SaaS
Best for	CI/CD quality gates	Research evals	Production monitoring	Teams

OpenEval is a tool, not a platform. You own your data, you run it where you want.

CLI Usage

# Basic run
openeval run eval.py

# Generate HTML report
openeval run eval.py --report results.html

# Fail CI if scores below threshold
openeval run eval.py --fail-under 0.8

# Run with Ollama (free, local)
# Just set OPENAI_BASE_URL=http://localhost:11434/v1

Scorers

Scorer	Type	What it checks
`ExactMatchScorer`	Deterministic	Output matches expected exactly
`ContainsAnyScorer`	Deterministic	Output contains at least one keyword
`ContainsAllScorer`	Deterministic	Output contains all keywords
`SimilarityScorer`	Embedding	Cosine similarity via embeddings
`LLMJudgeScorer`	LLM-as-a-Judge	Custom criteria evaluated by LLM
`FaithfulnessScorer`	LLM-as-a-Judge	Is output grounded in context? (hallucination detection)
`ToolCorrectnessScorer`	Deterministic	Did the agent call the right tools?

Custom scorers:

from openeval.scorers.base import FunctionScorer

length_scorer = FunctionScorer(
    name="OutputLength",
    fn=lambda tc: min(len(tc.actual_output) / 100, 1.0),
)

Datasets

from openeval.dataset import Dataset

# Load from file
ds = Dataset.from_csv("test_cases.csv")
ds = Dataset.from_json("test_cases.json")

# Filter and sample
ds_easy = ds.filter(tags=["easy"])
ds_sample = ds.sample(50)

CI/CD Integration

# .github/workflows/llm-eval.yml
name: LLM Quality Gate
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install openeval-cli
      - run: openeval run tests/eval_chatbot.py --fail-under 0.8

Exit code 1 when quality drops → PR blocked.

Cost Tracking

# Costs tracked automatically
print(f"Total cost: ${result.total_cost_usd:.6f}")
print(f"Total tokens: {result.summary['total_tokens']}")

# Breakdown by scorer
for scorer_name, stats in result.summary.items():
    print(f"{scorer_name}: ${stats.get('cost_usd', 0):.6f}")

Project Structure

openeval/
├── eval.py              # Eval() orchestrator
├── test_case.py         # TestCase data model
├── types.py             # ScoreResult, ExperimentResult
├── dataset.py           # Dataset loading and filtering
├── tracing.py           # @trace decorator
├── cost.py              # Token and cost tracking
├── report.py            # HTML report generator
├── cli.py               # CLI interface
└── scorers/
    ├── base.py          # BaseScorer interface
    ├── exact_match.py
    ├── contains.py
    ├── similarity.py    # Embedding-based
    ├── llm_judge.py     # LLM-as-a-Judge
    ├── faithfulness.py  # Hallucination detection
    └── tool_correctness.py

Development

git clone https://github.com/edmontecristo/openeval.git
cd openeval
pip install -e ".[dev]"
pytest tests/ -v

License

MIT © OpenEval Contributors

Built for developers who ship AI products.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Feb 26, 2026

This version

0.1.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openeval_cli-0.1.0.tar.gz (43.3 kB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openeval_cli-0.1.0-py3-none-any.whl (33.1 kB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file openeval_cli-0.1.0.tar.gz.

File metadata

Download URL: openeval_cli-0.1.0.tar.gz
Upload date: Feb 26, 2026
Size: 43.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for openeval_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f1f10f760a3262da9cf58e9708ab6421cec6eb5c752fafa54e88438389039c5e`
MD5	`3a65fad5fab6c362d67c847e2fb0d9e6`
BLAKE2b-256	`d5c50f6264e66abe9f1125b1b0b13514332531fea410746a629c7685dda22971`

See more details on using hashes here.

File details

Details for the file openeval_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: openeval_cli-0.1.0-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 33.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for openeval_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1dd2b27c2bd4bd673bd581ca81acfedc99f50ea6fc73cee0c5c3db1b3f41f239`
MD5	`d27eb1a332c7ea6f3ebd1c7a545c6fa0`
BLAKE2b-256	`0dc866ff7658d377bd84c8f94cf0b496236222acaa3bb6690b1e319d5d9f7755`

See more details on using hashes here.

openeval-cli 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Why OpenEval?

Quick Start

Why NOT DeepEval / AgentOps / Braintrust?

CLI Usage

Scorers

Datasets

CI/CD Integration

Cost Tracking

Project Structure

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes