Skip to main content

AI Evaluator CLI — evaluate your LLM agents from the command line

Project description

AI Evaluator CLI — Python

PyPI Python

Evaluate your LLM agents from the terminal. No browser. No dashboard.

pip install aievaluator

🧭 Tutorial — From Zero to CI/CD

Every step builds on the previous one. Start wherever makes sense for you.


Level 0 — Try it without installing anything

curl -s -X POST https://api.aievaluator.dev/api/v1/playground/evaluate \
  -H "Content-Type: application/json" \
  -d '{"queries":["What is 2+2?"],"metrics":["faithfulness"]}' | jq .

5 free per day. No key. No install. Good enough to decide if it's useful.


Level 1 — Install and evaluate a single prompt

pip install aievaluator

# Ask a question, tell it what you expect
aievaluator quick "What is the capital of France?" --expected "Paris"

You'll see a table with the score. The --expected is optional — without it, the judge evaluates the response on its own merits.

⚠️  Playground mode — 4/5 remaining

  AI Evaluator — Results
  Overall Score:  95.0%  ✅ above threshold (0%)
  Total rows:     1
  Failed:         0

┌────┬────────────────────────────────────┬──────────┬──────┐
│  # │ Query                              │ Score    │ Pass │
├────┼────────────────────────────────────┼──────────┼──────┤
│  1 │ What is the capital of France?     │  95%     │ ✅   │
└────┴────────────────────────────────────┴──────────┴──────┘

Level 2 — Sign up and scaffold a project

Playground is great for trying, but you'll want more than 5 evals/day.

# Get your API key at https://aievaluator.dev/settings
aievaluator login

# Check your account
aievaluator whoami

Now scaffold your project:

aievaluator init

This creates:

  • aievaluator.config.json — project-local config
  • evals/smoke-test.json — sample dataset with 3 queries
  • Updates .gitignore

Open evals/smoke-test.json and replace the sample queries with your own:

[
  {"input": "What are your business hours?", "expected_output": "Mon-Fri 9am-6pm"},
  {"input": "How do I cancel my order?", "expected_output": "Go to My Orders → Cancel"},
  {"input": "Do you ship internationally?", "expected_output": "Yes, via DHL Express"}
]

Test it against the built-in agent:

aievaluator quick --dataset ./evals/smoke-test.json

Level 3 — Evaluate your own agent

Point the CLI at your agent's endpoint:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --metrics faithfulness,g_eval

The CLI calls your agent with each query, then an LLM judge scores the responses.


Level 4 — Add quality gates

Not all metrics are equally important. Set different thresholds per metric:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --thresholds faithfulness:0.90,g_eval:0.75
  • faithfulness must be ≥ 90% (hallucination = instant fail)
  • g_eval must be ≥ 75% (general quality)

If any metric fails to meet its threshold, that row is marked ❌.

Or set one bar for everything:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --min-score 0.80

This works on quick too:

aievaluator quick "test prompt" --min-score 0.80
# Exit code 1 if any metric drops below 0.80

Level 5 — Create your own evaluation criteria

Sometimes the built-in metrics aren't enough. Define a custom evaluator inline:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --metrics politeness,g_eval \
  --custom '{"name":"politeness","prompt":"Is the response polite and professional? Answer YES or NO and explain.","threshold":0.85}'

The custom evaluator politeness is defined in the request, referenced in --metrics by name, and evaluated alongside g_eval. No dashboard needed.

Custom evaluator with per-metric threshold override:

aievaluator eval \
  --agent $URL --dataset ./tests.json \
  --metrics politeness,g_eval \
  --custom '{"name":"politeness","prompt":"Is the tone friendly?","threshold":0.7}' \
  --thresholds politeness:0.90,g_eval:0.80

The --thresholds flag overrides whatever was set in --custom. The engine uses the per-evaluation value.


Level 6 — CI/CD pipeline

Add this to your GitHub Actions, GitLab CI, or Jenkins:

aievaluator eval \
  --agent $STAGING_AGENT \
  --dataset ./evals/regression.json \
  --thresholds faithfulness:0.90,g_eval:0.75 \
  --min-score 0.80 \
  --ci \
  --format junit > report.xml
Flag What it does
--ci No colors, no prompts — clean output for logs
--format junit JUnit XML that CI systems understand natively
--min-score 0.80 Overall score must be ≥ 80%
--thresholds Per-metric quality bars

Exit code 1 = pipeline fails = deploy blocked.

Environment variables for CI:

export AIEVALUATOR_API_KEY="sk-..."       # No hardcoded keys in YAML
export AIEVALUATOR_ENGINE_URL="https://api.aievaluator.dev"

📋 Complete Command Reference

aievaluator login

aievaluator login                        # Interactive prompt
aievaluator login --api-key sk-xxx       # Non-interactive (CI)
aievaluator login --engine-url https://custom.engine.com

aievaluator whoami

aievaluator whoami
# Tenant:  acme-corp
# Tier:    pro
# Evals:   42/5000 this cycle
# Tokens:  ↓124,800 · ↑89,200 this cycle

aievaluator quick

# Single query
aievaluator quick "What is 2+2?" --expected "4"

# Per-metric thresholds
aievaluator quick "test" --metrics faithfulness:0.90,g_eval:0.75

# General threshold
aievaluator quick "test" --min-score 0.80

# From dataset (JSON or JSONL)
aievaluator quick --dataset ./tests.json
aievaluator quick --dataset ./tests.jsonl

# Custom judge model
aievaluator quick "test" --judge deepseek

aievaluator eval

# Basic
aievaluator eval --agent $URL --dataset ./tests.json

# With quality gates
aievaluator eval --agent $URL --dataset ./tests.json \
  --thresholds faithfulness:0.90,g_eval:0.75 --min-score 0.80

# Inline rows
aievaluator eval --agent $URL \
  --rows '[{"input":"Hi","expected_output":"Hello"}]'

# Custom evaluator inline
aievaluator eval --agent $URL --dataset ./tests.json \
  --metrics my-eval --custom '{"name":"my-eval","prompt":"...","threshold":0.8}'

# CI mode
aievaluator eval --agent $URL --dataset ./tests.json --ci --format junit

# Different agent format
aievaluator eval --agent $URL --dataset ./tests.json --agent-format claude

aievaluator config

aievaluator config show
aievaluator config set default-metrics "faithfulness,g_eval"
aievaluator config set default-min-score 0.80
aievaluator config unset default-min-score

aievaluator init

aievaluator init
# Creates aievaluator.config.json + evals/smoke-test.json + updates .gitignore

aievaluator generate-ci

Generates a CI/CD workflow file for GitHub Actions or GitLab CI.

aievaluator generate-ci --platform github

Options:

Flag Default Description
--platform github|gitlab github CI/CD platform
--dataset ./evals/regression.json Dataset path
--output stdout Save to file
# Print GitHub Actions workflow
aievaluator generate-ci --platform github

# Save GitLab CI workflow to file
aievaluator generate-ci --platform gitlab --output .gitlab-ci.yml

📊 Output Formats

Table (default)

Human-readable table with scores, pass/fail icons, and token counts.

JSON (--format json)

aievaluator eval ... --format json | jq '.overall_score'

Clean JSON on stdout. All logs/warnings go to stderr.

JUnit XML (--format junit)

aievaluator eval ... --format junit > report.xml

Native CI integration. <testcase> per query, <failure> for queries below threshold.


Requirements

  • Python 3.10+

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aievaluator-1.1.0.tar.gz (29.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aievaluator-1.1.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file aievaluator-1.1.0.tar.gz.

File metadata

  • Download URL: aievaluator-1.1.0.tar.gz
  • Upload date:
  • Size: 29.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for aievaluator-1.1.0.tar.gz
Algorithm Hash digest
SHA256 c3f51ec8d83d28f23fb84485cee9caa1f80363eb30b9598366b9c009fd3d9fa6
MD5 c7fa2ae16432764824fc1d95dbea9da0
BLAKE2b-256 a22c9aee71cbb009b72f534669bd7a7f44bf8f04db0327695779757829e16bb4

See more details on using hashes here.

File details

Details for the file aievaluator-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: aievaluator-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for aievaluator-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b04bd376912b1f712829ee6773ce4f3a50407710241250e6dd8381cff529fd8
MD5 9aa77edd794e160e1b6d29f2374f6d3c
BLAKE2b-256 2a2e3590ab0d4114f066ef6d7e82807d8f6a680869ae6ed09ed4193f3c4b4811

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page