AI Evaluator CLI — evaluate your LLM agents from the command line
Project description
AI Evaluator CLI — Python
Evaluate your LLM agents from the terminal. No browser. No dashboard.
pip install aievaluator
🧭 Tutorial — From Zero to CI/CD
Every step builds on the previous one. Start wherever makes sense for you.
Level 0 — Try it without installing anything
curl -s -X POST https://api.aievaluator.dev/api/v1/playground/evaluate \
-H "Content-Type: application/json" \
-d '{"queries":["What is 2+2?"],"metrics":["faithfulness"]}' | jq .
5 free per day. No key. No install. Good enough to decide if it's useful.
Level 1 — Install and evaluate a single prompt
pip install aievaluator
# Ask a question, tell it what you expect
aievaluator quick "What is the capital of France?" --expected "Paris"
You'll see a table with the score. The --expected is optional — without it, the judge evaluates
the response on its own merits.
⚠️ Playground mode — 4/5 remaining
AI Evaluator — Results
Overall Score: 95.0% ✅ above threshold (0%)
Total rows: 1
Failed: 0
┌────┬────────────────────────────────────┬──────────┬──────┐
│ # │ Query │ Score │ Pass │
├────┼────────────────────────────────────┼──────────┼──────┤
│ 1 │ What is the capital of France? │ 95% │ ✅ │
└────┴────────────────────────────────────┴──────────┴──────┘
Level 2 — Sign up and scaffold a project
Playground is great for trying, but you'll want more than 5 evals/day.
# Get your API key at https://aievaluator.dev/settings
aievaluator login
# Check your account
aievaluator whoami
Now scaffold your project:
aievaluator init
This creates:
aievaluator.config.json— project-local configevals/smoke-test.json— sample dataset with 3 queries- Updates
.gitignore
Open evals/smoke-test.json and replace the sample queries with your own:
[
{"input": "What are your business hours?", "expected_output": "Mon-Fri 9am-6pm"},
{"input": "How do I cancel my order?", "expected_output": "Go to My Orders → Cancel"},
{"input": "Do you ship internationally?", "expected_output": "Yes, via DHL Express"}
]
Test it against the built-in agent:
aievaluator quick --dataset ./evals/smoke-test.json
Level 3 — Evaluate your own agent
Point the CLI at your agent's endpoint:
aievaluator eval \
--agent https://chatbot-staging.acme.com/api/chat \
--dataset ./evals/smoke-test.json \
--metrics faithfulness,g_eval
The CLI calls your agent with each query, then an LLM judge scores the responses.
Level 4 — Add quality gates
Not all metrics are equally important. Set different thresholds per metric:
aievaluator eval \
--agent https://chatbot-staging.acme.com/api/chat \
--dataset ./evals/smoke-test.json \
--thresholds faithfulness:0.90,g_eval:0.75
faithfulnessmust be ≥ 90% (hallucination = instant fail)g_evalmust be ≥ 75% (general quality)
If any metric fails to meet its threshold, that row is marked ❌.
Or set one bar for everything:
aievaluator eval \
--agent https://chatbot-staging.acme.com/api/chat \
--dataset ./evals/smoke-test.json \
--min-score 0.80
This works on quick too:
aievaluator quick "test prompt" --min-score 0.80
# Exit code 1 if any metric drops below 0.80
Level 5 — Create your own evaluation criteria
Sometimes the built-in metrics aren't enough. Define a custom evaluator inline:
aievaluator eval \
--agent https://chatbot-staging.acme.com/api/chat \
--dataset ./evals/smoke-test.json \
--metrics politeness,g_eval \
--custom '{"name":"politeness","prompt":"Is the response polite and professional? Answer YES or NO and explain.","threshold":0.85}'
The custom evaluator politeness is defined in the request, referenced in --metrics by name,
and evaluated alongside g_eval. No dashboard needed.
Custom evaluator with per-metric threshold override:
aievaluator eval \
--agent $URL --dataset ./tests.json \
--metrics politeness,g_eval \
--custom '{"name":"politeness","prompt":"Is the tone friendly?","threshold":0.7}' \
--thresholds politeness:0.90,g_eval:0.80
The --thresholds flag overrides whatever was set in --custom. The engine uses the
per-evaluation value.
Level 6 — CI/CD pipeline
Add this to your GitHub Actions, GitLab CI, or Jenkins:
aievaluator eval \
--agent $STAGING_AGENT \
--dataset ./evals/regression.json \
--thresholds faithfulness:0.90,g_eval:0.75 \
--min-score 0.80 \
--ci \
--format junit > report.xml
| Flag | What it does |
|---|---|
--ci |
No colors, no prompts — clean output for logs |
--format junit |
JUnit XML that CI systems understand natively |
--min-score 0.80 |
Overall score must be ≥ 80% |
--thresholds |
Per-metric quality bars |
Exit code 1 = pipeline fails = deploy blocked.
Environment variables for CI:
export AIEVALUATOR_API_KEY="sk-..." # No hardcoded keys in YAML
export AIEVALUATOR_ENGINE_URL="https://api.aievaluator.dev"
📋 Complete Command Reference
aievaluator login
aievaluator login # Interactive prompt
aievaluator login --api-key sk-xxx # Non-interactive (CI)
aievaluator login --engine-url https://custom.engine.com
aievaluator whoami
aievaluator whoami
# Tenant: acme-corp
# Tier: pro
# Evals: 42/5000 this cycle
# Tokens: ↓124,800 · ↑89,200 this cycle
aievaluator quick
# Single query
aievaluator quick "What is 2+2?" --expected "4"
# Per-metric thresholds
aievaluator quick "test" --metrics faithfulness:0.90,g_eval:0.75
# General threshold
aievaluator quick "test" --min-score 0.80
# From dataset (JSON or JSONL)
aievaluator quick --dataset ./tests.json
aievaluator quick --dataset ./tests.jsonl
# Custom judge model
aievaluator quick "test" --judge deepseek
aievaluator eval
# Basic
aievaluator eval --agent $URL --dataset ./tests.json
# With quality gates
aievaluator eval --agent $URL --dataset ./tests.json \
--thresholds faithfulness:0.90,g_eval:0.75 --min-score 0.80
# Inline rows
aievaluator eval --agent $URL \
--rows '[{"input":"Hi","expected_output":"Hello"}]'
# Custom evaluator inline
aievaluator eval --agent $URL --dataset ./tests.json \
--metrics my-eval --custom '{"name":"my-eval","prompt":"...","threshold":0.8}'
# CI mode
aievaluator eval --agent $URL --dataset ./tests.json --ci --format junit
# Different agent format
aievaluator eval --agent $URL --dataset ./tests.json --agent-format claude
aievaluator config
aievaluator config show
aievaluator config set default-metrics "faithfulness,g_eval"
aievaluator config set default-min-score 0.80
aievaluator config unset default-min-score
aievaluator init
aievaluator init
# Creates aievaluator.config.json + evals/smoke-test.json + updates .gitignore
📊 Output Formats
Table (default)
Human-readable table with scores, pass/fail icons, and token counts.
JSON (--format json)
aievaluator eval ... --format json | jq '.overall_score'
Clean JSON on stdout. All logs/warnings go to stderr.
JUnit XML (--format junit)
aievaluator eval ... --format junit > report.xml
Native CI integration. <testcase> per query, <failure> for queries below threshold.
🤖 VS Code Extension
Prefer staying in your editor? Install the VS Code extension.
- Select text → right-click → Evaluate
- Per-metric threshold editor with preset buttons
- Custom evaluator support via Command Palette
- Sidebar with evaluation history
- Dataset file evaluation (JSON + JSONL)
Requirements
- Python 3.10+
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aievaluator-1.0.1.tar.gz.
File metadata
- Download URL: aievaluator-1.0.1.tar.gz
- Upload date:
- Size: 28.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61d3e423673b01e7e1136ad8f3de0afe56276d3f1b7b404523d597113f9d5faa
|
|
| MD5 |
952ab4dba1e8536f85c854aca4ea220f
|
|
| BLAKE2b-256 |
721d42c5e0086eec258bde6089dd7aa9b5d49665417a937692c80cf269a041bf
|
File details
Details for the file aievaluator-1.0.1-py3-none-any.whl.
File metadata
- Download URL: aievaluator-1.0.1-py3-none-any.whl
- Upload date:
- Size: 16.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbab7a322f6856941099dbfa2dc28d2e2823d1b3a470e36964a61860988d7e84
|
|
| MD5 |
95b6d7035687066836c0061d3d352461
|
|
| BLAKE2b-256 |
598a7d58621b4e798528d6d22b96e0601fcbec083f46496dba765f5bc6c5f098
|