AI Evaluator CLI — evaluate your LLM agents from the command line

These details have not been verified by PyPI

Project links

Project description

AI Evaluator CLI — Python

Evaluate your LLM agents from the terminal. No browser. No dashboard.

pip install aievaluator

🧭 Tutorial — From Zero to CI/CD

Every step builds on the previous one. Start wherever makes sense for you.

Level 0 — Try it without installing anything

curl -s -X POST https://api.aievaluator.dev/api/v1/playground/evaluate \
  -H "Content-Type: application/json" \
  -d '{"queries":["What is 2+2?"],"metrics":["faithfulness"]}' | jq .

5 free per day. No key. No install. Good enough to decide if it's useful.

Level 1 — Install and evaluate a single prompt

pip install aievaluator

# Ask a question, tell it what you expect
aievaluator quick "What is the capital of France?" --expected "Paris"

You'll see a table with the score. The --expected is optional — without it, the judge evaluates the response on its own merits.

⚠️  Playground mode — 4/5 remaining

  AI Evaluator — Results
  Overall Score:  95.0%  ✅ above threshold (0%)
  Total rows:     1
  Failed:         0

┌────┬────────────────────────────────────┬──────────┬──────┐
│  # │ Query                              │ Score    │ Pass │
├────┼────────────────────────────────────┼──────────┼──────┤
│  1 │ What is the capital of France?     │  95%     │ ✅   │
└────┴────────────────────────────────────┴──────────┴──────┘

Level 2 — Sign up and scaffold a project

Playground is great for trying, but you'll want more than 5 evals/day.

# Get your API key at https://aievaluator.dev/settings
aievaluator login

# Check your account
aievaluator whoami

Now scaffold your project:

aievaluator init

This creates:

aievaluator.config.json — project-local config
evals/smoke-test.json — sample dataset with 3 queries
Updates .gitignore

Open evals/smoke-test.json and replace the sample queries with your own:

[
  {"input": "What are your business hours?", "expected_output": "Mon-Fri 9am-6pm"},
  {"input": "How do I cancel my order?", "expected_output": "Go to My Orders → Cancel"},
  {"input": "Do you ship internationally?", "expected_output": "Yes, via DHL Express"}
]

Test it against the built-in agent:

aievaluator quick --dataset ./evals/smoke-test.json

Level 3 — Evaluate your own agent

Point the CLI at your agent's endpoint:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --metrics faithfulness,g_eval

The CLI calls your agent with each query, then an LLM judge scores the responses.

Level 4 — Add quality gates

Not all metrics are equally important. Set different thresholds per metric:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --thresholds faithfulness:0.90,g_eval:0.75

faithfulness must be ≥ 90% (hallucination = instant fail)
g_eval must be ≥ 75% (general quality)

If any metric fails to meet its threshold, that row is marked ❌.

Or set one bar for everything:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --min-score 0.80

This works on quick too:

aievaluator quick "test prompt" --min-score 0.80
# Exit code 1 if any metric drops below 0.80

Level 5 — Create your own evaluation criteria

Sometimes the built-in metrics aren't enough. Define a custom evaluator inline:

aievaluator eval \
  --agent https://chatbot-staging.acme.com/api/chat \
  --dataset ./evals/smoke-test.json \
  --metrics politeness,g_eval \
  --custom '{"name":"politeness","prompt":"Is the response polite and professional? Answer YES or NO and explain.","threshold":0.85}'

The custom evaluator politeness is defined in the request, referenced in --metrics by name, and evaluated alongside g_eval. No dashboard needed.

Custom evaluator with per-metric threshold override:

aievaluator eval \
  --agent $URL --dataset ./tests.json \
  --metrics politeness,g_eval \
  --custom '{"name":"politeness","prompt":"Is the tone friendly?","threshold":0.7}' \
  --thresholds politeness:0.90,g_eval:0.80

The --thresholds flag overrides whatever was set in --custom. The engine uses the per-evaluation value.

Level 6 — CI/CD pipeline

Add this to your GitHub Actions, GitLab CI, or Jenkins:

aievaluator eval \
  --agent $STAGING_AGENT \
  --dataset ./evals/regression.json \
  --thresholds faithfulness:0.90,g_eval:0.75 \
  --min-score 0.80 \
  --ci \
  --format junit > report.xml

Flag	What it does
`--ci`	No colors, no prompts — clean output for logs
`--format junit`	JUnit XML that CI systems understand natively
`--min-score 0.80`	Overall score must be ≥ 80%
`--thresholds`	Per-metric quality bars

Exit code 1 = pipeline fails = deploy blocked.

Environment variables for CI:

export AIEVALUATOR_API_KEY="sk-..."       # No hardcoded keys in YAML
export AIEVALUATOR_ENGINE_URL="https://api.aievaluator.dev"

📋 Complete Command Reference

`aievaluator login`

aievaluator login                        # Interactive prompt
aievaluator login --api-key sk-xxx       # Non-interactive (CI)
aievaluator login --engine-url https://custom.engine.com

`aievaluator whoami`

aievaluator whoami
# Tenant:  acme-corp
# Tier:    pro
# Evals:   42/5000 this cycle
# Tokens:  ↓124,800 · ↑89,200 this cycle

`aievaluator quick`

# Single query
aievaluator quick "What is 2+2?" --expected "4"

# Per-metric thresholds
aievaluator quick "test" --metrics faithfulness:0.90,g_eval:0.75

# General threshold
aievaluator quick "test" --min-score 0.80

# From dataset (JSON or JSONL)
aievaluator quick --dataset ./tests.json
aievaluator quick --dataset ./tests.jsonl

# Custom judge model
aievaluator quick "test" --judge deepseek

`aievaluator eval`

# Basic
aievaluator eval --agent $URL --dataset ./tests.json

# With quality gates
aievaluator eval --agent $URL --dataset ./tests.json \
  --thresholds faithfulness:0.90,g_eval:0.75 --min-score 0.80

# Inline rows
aievaluator eval --agent $URL \
  --rows '[{"input":"Hi","expected_output":"Hello"}]'

# Custom evaluator inline
aievaluator eval --agent $URL --dataset ./tests.json \
  --metrics my-eval --custom '{"name":"my-eval","prompt":"...","threshold":0.8}'

# CI mode
aievaluator eval --agent $URL --dataset ./tests.json --ci --format junit

# Different agent format
aievaluator eval --agent $URL --dataset ./tests.json --agent-format claude

`aievaluator config`

aievaluator config show
aievaluator config set default-metrics "faithfulness,g_eval"
aievaluator config set default-min-score 0.80
aievaluator config unset default-min-score

`aievaluator init`

aievaluator init
# Creates aievaluator.config.json + evals/smoke-test.json + updates .gitignore

📊 Output Formats

Table (default)

Human-readable table with scores, pass/fail icons, and token counts.

JSON (`--format json`)

aievaluator eval ... --format json | jq '.overall_score'

Clean JSON on stdout. All logs/warnings go to stderr.

JUnit XML (`--format junit`)

aievaluator eval ... --format junit > report.xml

Native CI integration. <testcase> per query, <failure> for queries below threshold.

🤖 VS Code Extension

Prefer staying in your editor? Install the VS Code extension.

Select text → right-click → Evaluate
Per-metric threshold editor with preset buttons
Custom evaluator support via Command Palette
Sidebar with evaluation history
Dataset file evaluation (JSON + JSONL)

Full VS Code tutorial →

Requirements

Python 3.10+

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aievaluator-1.0.1.tar.gz (28.8 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aievaluator-1.0.1-py3-none-any.whl (16.0 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file aievaluator-1.0.1.tar.gz.

File metadata

Download URL: aievaluator-1.0.1.tar.gz
Upload date: Jun 25, 2026
Size: 28.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for aievaluator-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`61d3e423673b01e7e1136ad8f3de0afe56276d3f1b7b404523d597113f9d5faa`
MD5	`952ab4dba1e8536f85c854aca4ea220f`
BLAKE2b-256	`721d42c5e0086eec258bde6089dd7aa9b5d49665417a937692c80cf269a041bf`

See more details on using hashes here.

File details

Details for the file aievaluator-1.0.1-py3-none-any.whl.

File metadata

Download URL: aievaluator-1.0.1-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 16.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for aievaluator-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbab7a322f6856941099dbfa2dc28d2e2823d1b3a470e36964a61860988d7e84`
MD5	`95b6d7035687066836c0061d3d352461`
BLAKE2b-256	`598a7d58621b4e798528d6d22b96e0601fcbec083f46496dba765f5bc6c5f098`

See more details on using hashes here.

aievaluator 1.0.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

AI Evaluator CLI — Python

🧭 Tutorial — From Zero to CI/CD

Level 0 — Try it without installing anything

Level 1 — Install and evaluate a single prompt

Level 2 — Sign up and scaffold a project

Level 3 — Evaluate your own agent

Level 4 — Add quality gates

Level 5 — Create your own evaluation criteria

Level 6 — CI/CD pipeline

📋 Complete Command Reference

aievaluator login

aievaluator whoami

aievaluator quick

aievaluator eval

aievaluator config

aievaluator init

📊 Output Formats

Table (default)

JSON (--format json)

JUnit XML (--format junit)

🤖 VS Code Extension

Requirements

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`aievaluator login`

`aievaluator whoami`

`aievaluator quick`

`aievaluator eval`

`aievaluator config`

`aievaluator init`

JSON (`--format json`)

JUnit XML (`--format junit`)