Skip to main content

Open-source multi-turn AI agent simulation and evaluation

Project description

AgentEval

CI

Open-source multi-turn AI agent simulation and evaluation.

Define a test scenario in YAML, run it against your agent's HTTP endpoint, and get a structured pass/fail report — from your terminal, for free, with no mandatory cloud dependency.

pip install agent-eval-cli
agenteval run test_cases/ --mode scripted

The gap this fills

Every existing tool evaluates what your agent already said. You bring the conversation; they score it.

Nobody simulates the conversation for you — free — against any endpoint.

Tool Multi-turn simulation Free Open source
LangSmith
Braintrust
Promptfoo
DeepEval
Coval
AgentEval

Install

# Core — scripted mode only, no ML scorers
pip install agent-eval-cli

# + Groq simulation mode (free API key, no credit card)
pip install agent-eval-cli[groq]

# + ML scorers (sentence-transformers, ~500MB)
pip install agent-eval-cli[ml]

# Everything
pip install agent-eval-cli[all]

Requires Python 3.9+


Quick start

1. Write a test case:

# test_cases/booking_happy_path.yaml
meta:
  scenario_id: "booking-happy-path-001"
  name: "Patient successfully books a doctor's appointment"
  tags: ["healthcare", "booking"]

agent:
  endpoint: "http://localhost:8080/chat"
  method: "POST"
  headers:
    Content-Type: "application/json"
  request_template: |
    {
      "session_id": "${SESSION_ID}",
      "message": "${USER_MESSAGE}"
    }
  response_path: "response"
  timeout_seconds: 30

user_persona:
  name: "Sarah Chen"
  tone: "polite but slightly anxious"
  background: "First-time patient, not tech-savvy"
  opening_message: "Hi, I need to see a doctor about my knee."

outcome_type: success   # success | refusal | escalation

conversation:
  max_turns: 12
  min_turns: 3
  expected_turns: 8

evaluation:
  success_intent: "User receives a confirmed appointment with specific date, time, and doctor name"
  success_keywords:
    - "appointment confirmed"
    - "booked for"
    - "confirmation number"
  forbidden_phrases:
    - "I don't know"
    - "I cannot help"
  context_facts:
    - "Appointments available Monday through Friday"
    - "Dr. Smith specialises in orthopedics"
    - "Clinic hours are 8am to 6pm"
  thresholds:
    task_completion:       0.65
    instruction_following: 0.80
    turn_efficiency:       0.60
    aggregate:             0.70

scripted_turns:
  - "Hi, I need to see a doctor about my knee. It's been hurting for two weeks."
  - "Is Dr. Smith available this week?"
  - "Thursday at 2pm works for me."
  - "My name is Sarah Chen, date of birth March 12 1990."
  - "Yes, that email is correct."
  - "Great, thank you so much."

2. Run:

agenteval run test_cases/ --mode scripted

3. View results:

agenteval dashboard ./reports/

Simulation modes

--mode scripted — deterministic CI mode

Sends the fixed scripted_turns from your YAML in order. Zero API calls. Zero variance. After the last scripted turn, automatically signals goal completion if the agent responded without error.

Use this for CI/CD gates — results are reproducible across every run.

agenteval run test_cases/ --mode scripted --concurrency 4 --fail-on-threshold 0.70

--mode groq — realistic simulation

Llama 3.3 70B (via Groq's free API) plays the user side, guided by the persona and outcome type you defined. Has natural variance — useful for exploratory evaluation and finding edge cases scripted mode misses.

Requires a free Groq API key — no credit card.

export GROQ_API_KEY=your_key_here
agenteval run test_cases/ --mode groq

Privacy notice: when using --mode groq, your context_facts are sent to Groq's API. Use --mode scripted for sensitive or proprietary data.


Scorers

All scoring runs locally after the conversation completes. No API calls.

These are heuristic signals designed to catch obvious failures — not authoritative ground truth. Calibrate thresholds against real transcripts before using them as hard CI gates.

Scorer What it measures Requires [ml]
Task Completion Semantic similarity between agent responses and the outcome intent Yes
Instruction Following Did the agent obey forbidden/required phrase rules? No
Response Coherence Are agent responses contextually relevant to what was asked? Yes
Turn Efficiency Did the agent resolve the goal in a reasonable number of turns? No
Hallucination Risk Did the agent state facts not grounded in context_facts? Yes

Scorers 1, 3, and 5 return None when [ml] is not installed — they are excluded from the aggregate and weights are re-normalised automatically.

Default aggregate weights:

Scorer Weight
Task Completion 30%
Instruction Following 25%
Response Coherence 20%
Turn Efficiency 15%
Hallucination Risk 10%

CLI reference

# Run test cases (scripted mode — CI-safe, no API key)
agenteval run test_cases/ --mode scripted

# Run with Groq simulation
agenteval run test_cases/ --mode groq

# Filter scenarios by tag
agenteval run test_cases/ --mode scripted --tag customer-support

# Parallel execution (default: 4)
agenteval run test_cases/ --mode scripted --concurrency 8

# Fail with non-zero exit code when aggregate score is below threshold
agenteval run test_cases/ --mode scripted --fail-on-threshold 0.70

# Save reports to a directory (auto-timestamped filenames)
agenteval run test_cases/ --mode scripted --output-dir ./reports/

# Print full conversation transcript on failure
agenteval run test_cases/ --mode scripted --verbose

# Validate YAML test cases without running
agenteval validate test_cases/booking.yaml
agenteval validate test_cases/

# Open dashboard for all reports in a directory
agenteval dashboard ./reports/

# Version
agenteval --version

YAML test case schema

outcome_type

Three outcome types are supported:

  • success — the agent should complete the user's request
  • refusal — the agent should correctly decline (e.g. no availability, policy block)
  • escalation — the agent should route the user to a human

Refusal example:

outcome_type: refusal
policy_reason: "No Sunday appointments available per clinic policy"

evaluation:
  refusal_intent: "Agent clearly communicates no availability without offering to book an unavailable slot"
  refusal_keywords:
    - "unfortunately"
    - "no availability"
    - "fully booked"
  forbidden_phrases:
    - "appointment confirmed"
    - "I can book that"

scripted_turns:
  - "I need to book an appointment for next Sunday."
  - "What about Monday at 9pm?"
  - "Is there anything at all this week?"

Environment variable substitution

The following variables are substituted into request_template at runtime:

Variable Value
${SESSION_ID} UUID4 generated per scenario — prevents state leakage on stateful agents
${USER_MESSAGE} The current user turn content
${AGENT_API_KEY} Resolved from environment (set in .env or CI secrets)

Any ${VAR} pattern in the template is resolved from the environment.

response_path

Uses jmespath dot notation to extract the agent's reply from the JSON response body.

response_path: "response.text"       # {"response": {"text": "..."}}
response_path: "choices[0].message.content"  # OpenAI-style
response_path: "reply"               # {"reply": "..."}

CI/CD — GitHub Action

# .github/workflows/agenteval.yml
name: AgentEval — AI Agent Tests

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  agent-eval:
    runs-on: ubuntu-latest
    timeout-minutes: 15

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Cache pip
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('**/pyproject.toml') }}

      - name: Install AgentEval
        run: pip install "agent-eval-cli==0.1.0"

      - name: Start agent under test
        run: docker compose up -d agent   # replace 'agent' with your service name

      - name: Wait for agent readiness
        run: |
          for i in {1..12}; do
            if curl -sf http://localhost:8080/health > /dev/null 2>&1; then
              echo "Agent ready."; exit 0
            fi
            echo "Attempt $i/12 — retrying in 5s..."; sleep 5
          done
          echo "Agent did not become ready in 60s."; exit 1

      - name: Run AgentEval
        env:
          AGENT_API_KEY: ${{ secrets.AGENT_API_KEY }}
        run: |
          agenteval run test_cases/ \
            --mode scripted \
            --concurrency 4 \
            --output-dir ./reports/ \
            --fail-on-threshold 0.70

      - name: Upload reports
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: agenteval-reports-${{ github.sha }}
          path: ./reports/
          retention-days: 30

Reports are uploaded as a build artifact on every run. Use actions/download-artifact in a subsequent step to post results as a PR comment.


Demo agent (try it in 2 minutes)

A ready-made customer support agent for TableEase (a restaurant booking service) is included so you can run AgentEval end-to-end without building your own agent first.

# Install dependencies
cd demo_agent
pip install -r requirements.txt

# Set your Groq API key (free at console.groq.com)
export GROQ_API_KEY=your_key_here

# Start the agent
uvicorn main:app --reload
# Agent running at http://localhost:8000/chat

Then in a second terminal, run the included test cases:

agenteval run test_cases/ --mode scripted --output-dir ./reports/
agenteval dashboard ./reports/

The demo agent handles table bookings, cancellation policy questions, competitor refusals, and escalations to a human manager — covering all three outcome_type values.


Dashboard

agenteval dashboard ./reports/
# Opens http://localhost:8080 in your browser

Features:

  • Run summary with pass/fail counts and aggregate score
  • Radar chart showing all 5 scorer dimensions per scenario
  • Per-scenario score breakdown with pass/fail per scorer
  • Full conversation transcript replay (chat bubble view)
  • Evidence items flagged by instruction-following and hallucination scorers
  • Multi-report dropdown to switch between runs

v1 scope

v1 supports synchronous JSON over HTTP POST only.

Out of scope for v1 (planned for v2):

  • WebSocket / SSE / streaming responses
  • OAuth and cookie-based authentication
  • Multi-message response payloads
  • Non-JSON payloads

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval_cli-0.1.0.tar.gz (73.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_eval_cli-0.1.0-py3-none-any.whl (80.3 kB view details)

Uploaded Python 3

File details

Details for the file agent_eval_cli-0.1.0.tar.gz.

File metadata

  • Download URL: agent_eval_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 73.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for agent_eval_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3f37654f9e0bd8289138d59297f1fed89ce9c86b0833b64fb6bce6abce588489
MD5 e8fbf6c249532221819cf129da161cfc
BLAKE2b-256 0db97f8360c18af82cd954ba7dc85c7b76704660ccbfe603c70c404b20b63436

See more details on using hashes here.

File details

Details for the file agent_eval_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agent_eval_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 80.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for agent_eval_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 987b0e839f846701256bcff70069bf258cf2eaca7954e0a675fe470431cf6e35
MD5 e17f851c568c567799dcaf65bcf8a0e4
BLAKE2b-256 801fb790ab378565026dc6173da3ecf013e02ca87fbfa32602007363ffb7fc62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page