Skip to main content

AI Evaluator CLI — evaluate your LLM agents from the command line

Project description

AI Evaluator CLI — Python

PyPI Python

Evaluate your LLM agents from the terminal. No browser. No dashboard.

pip install aievaluator

🧭 Tutorial — From Zero to CI/CD

Every step builds on the previous one. Start wherever makes sense for you.


Level 0 — Try it without installing anything

curl -s -X POST https://api.aievaluator.dev/api/v1/playground/evaluate \
  -H "Content-Type: application/json" \
  -d '{"queries":["What is 2+2?"],"metrics":["faithfulness"]}' | jq .

5 free per day. No key. No install. Good enough to decide if it's useful.


Level 1 — Install and evaluate a single prompt

pip install aievaluator

# Ask a question, tell it what you expect
aievaluator quick "What is the capital of France?" --expected "Paris"

You'll see a table with the score. The --expected is optional — without it, the judge evaluates the response on its own merits.

⚠️  Playground mode — 4/5 remaining

  AI Evaluator — Results
  Overall Score:  95.0%  ✅ above threshold (0%)
  Total rows:     1
  Failed:         0

┌────┬────────────────────────────────────┬──────────┬──────┐
│  # │ Query                              │ Score    │ Pass │
├────┼────────────────────────────────────┼──────────┼──────┤
│  1 │ What is the capital of France?     │  95%     │ ✅   │
└────┴────────────────────────────────────┴──────────┴──────┘

Level 2 — Sign up and scaffold a project

Playground is great for trying, but you'll want more than 5 evals/day.

# Get your API key at https://aievaluator.dev/settings
aievaluator login

# Check your account
aievaluator whoami

Now scaffold your project:

aievaluator init

This creates:

  • aievaluator.config.json — project-local config
  • evals/smoke-test.json — sample dataset with 3 queries
  • Updates .gitignore

Open evals/smoke-test.json and replace the sample queries with your own:

[
  {"input": "What are your business hours?", "expected_output": "Mon-Fri 9am-6pm"},
  {"input": "How do I cancel my order?", "expected_output": "Go to My Orders → Cancel"},
  {"input": "Do you ship internationally?", "expected_output": "Yes, via DHL Express"}
]

Test it against the built-in agent:

aievaluator quick --dataset ./evals/smoke-test.json

Level 3 — Evaluate your own agent

Point the CLI at your agent's endpoint:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --metrics faithfulness,g_eval

The CLI calls your agent with each query, then an LLM judge scores the responses.


Level 4 — Add quality gates

Not all metrics are equally important. Set different thresholds per metric:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --thresholds faithfulness:0.90,g_eval:0.75
  • faithfulness must be ≥ 90% (hallucination = instant fail)
  • g_eval must be ≥ 75% (general quality)

If any metric fails to meet its threshold, that row is marked ❌.

Or set one bar for everything:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --min-score 0.80

This works on quick too:

aievaluator quick "test prompt" --min-score 0.80
# Exit code 1 if any metric drops below 0.80

Level 5 — Create your own evaluation criteria

Sometimes the built-in metrics aren't enough. Define a custom evaluator inline:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --metrics politeness,g_eval \
  --custom '{"name":"politeness","prompt":"Is the response polite and professional? Answer YES or NO and explain.","threshold":0.85}'

The custom evaluator politeness is defined in the request, referenced in --metrics by name, and evaluated alongside g_eval. No dashboard needed.

Custom evaluator with per-metric threshold override:

aievaluator eval \
  --agent $URL --dataset ./tests.json \
  --metrics politeness,g_eval \
  --custom '{"name":"politeness","prompt":"Is the tone friendly?","threshold":0.7}' \
  --thresholds politeness:0.90,g_eval:0.80

The --thresholds flag overrides whatever was set in --custom. The engine uses the per-evaluation value.


Level 6 — CI/CD pipeline

Add this to your GitHub Actions, GitLab CI, or Jenkins:

aievaluator eval \
  --agent $STAGING_AGENT \
  --dataset ./evals/regression.json \
  --thresholds faithfulness:0.90,g_eval:0.75 \
  --min-score 0.80 \
  --ci \
  --format junit > report.xml
Flag What it does
--ci No colors, no prompts — clean output for logs
--format junit JUnit XML that CI systems understand natively
--min-score 0.80 Overall score must be ≥ 80%
--thresholds Per-metric quality bars

Exit code 1 = pipeline fails = deploy blocked.

Environment variables for CI:

export AIEVALUATOR_API_KEY="sk-..."       # No hardcoded keys in YAML
export AIEVALUATOR_ENGINE_URL="https://api.aievaluator.dev"

📋 Complete Command Reference

aievaluator login

aievaluator login                        # Interactive prompt
aievaluator login --api-key sk-xxx       # Non-interactive (CI)
aievaluator login --engine-url https://custom.engine.com

aievaluator whoami

aievaluator whoami
# Tenant:  acme-corp
# Tier:    pro
# Evals:   42/5000 this cycle
# Tokens:  ↓124,800 · ↑89,200 this cycle

aievaluator quick

# Single query
aievaluator quick "What is 2+2?" --expected "4"

# Per-metric thresholds
aievaluator quick "test" --metrics faithfulness:0.90,g_eval:0.75

# General threshold
aievaluator quick "test" --min-score 0.80

# From dataset (JSON or JSONL)
aievaluator quick --dataset ./tests.json
aievaluator quick --dataset ./tests.jsonl

# Custom judge model
aievaluator quick "test" --judge deepseek

aievaluator eval

# Basic
aievaluator eval --agent $URL --dataset ./tests.json

# With quality gates
aievaluator eval --agent $URL --dataset ./tests.json \
  --thresholds faithfulness:0.90,g_eval:0.75 --min-score 0.80

# Inline rows
aievaluator eval --agent $URL \
  --rows '[{"input":"Hi","expected_output":"Hello"}]'

# Custom evaluator inline
aievaluator eval --agent $URL --dataset ./tests.json \
  --metrics my-eval --custom '{"name":"my-eval","prompt":"...","threshold":0.8}'

# CI mode
aievaluator eval --agent $URL --dataset ./tests.json --ci --format junit

# Different agent format
aievaluator eval --agent $URL --dataset ./tests.json --agent-format claude

aievaluator config

aievaluator config show
aievaluator config set default-metrics "faithfulness,g_eval"
aievaluator config set default-min-score 0.80
aievaluator config unset default-min-score

aievaluator init

aievaluator init
# Creates aievaluator.config.json + evals/smoke-test.json + updates .gitignore

📊 Output Formats

Table (default)

Human-readable table with scores, pass/fail icons, and token counts.

JSON (--format json)

aievaluator eval ... --format json | jq '.overall_score'

Clean JSON on stdout. All logs/warnings go to stderr.

JUnit XML (--format junit)

aievaluator eval ... --format junit > report.xml

Native CI integration. <testcase> per query, <failure> for queries below threshold.


🤖 VS Code Extension

Prefer staying in your editor? Install the VS Code extension.

  • Select text → right-click → Evaluate
  • Per-metric threshold editor with preset buttons
  • Custom evaluator support via Command Palette
  • Sidebar with evaluation history
  • Dataset file evaluation (JSON + JSONL)

Full VS Code tutorial →


Requirements

  • Python 3.10+

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aievaluator-1.0.1.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aievaluator-1.0.1-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file aievaluator-1.0.1.tar.gz.

File metadata

  • Download URL: aievaluator-1.0.1.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for aievaluator-1.0.1.tar.gz
Algorithm Hash digest
SHA256 61d3e423673b01e7e1136ad8f3de0afe56276d3f1b7b404523d597113f9d5faa
MD5 952ab4dba1e8536f85c854aca4ea220f
BLAKE2b-256 721d42c5e0086eec258bde6089dd7aa9b5d49665417a937692c80cf269a041bf

See more details on using hashes here.

File details

Details for the file aievaluator-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: aievaluator-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for aievaluator-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bbab7a322f6856941099dbfa2dc28d2e2823d1b3a470e36964a61860988d7e84
MD5 95b6d7035687066836c0061d3d352461
BLAKE2b-256 598a7d58621b4e798528d6d22b96e0601fcbec083f46496dba765f5bc6c5f098

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page