Skip to main content

Persona-based testing framework for AI agents

Project description

Anek

Persona-based testing framework for AI agents.

Agents are not APIs. A structured input/output test will pass for every user — but your agent will fail Maria Garcia when she writes "hola can you help me i need change my password" and succeed for the neutral English baseline. Anek finds those failures before your users do.

Anek simulates realistic user personas interacting with your agent via Gherkin .feature files and uses Claude as a judge to evaluate outcomes — no code required to write tests.

Feature: Password Reset Flow

  @all_personas
  Scenario: User initiates password reset
    When the user says "I can't log into my account"
    Then the response should contain "password"
    And  the response should not contain "SSO"

    When the user says "I forgot my password and need to reset it"
    Then the response should contain "email"
    And  the response time should be under 5000ms

    When the user says "I got the reset email, what do I do now?"
    Then the goal should be achieved with "Agent guided the user through password reset without jargon"

Run it:

$ anek run features/password_reset.feature --agent http://localhost:8080/chat

Feature: Password Reset Flow
  ──────────────────────────────────────────────────────────────

  Scenario: User initiates password reset  |  persona: maria_garcia  |  PASS
    Background
    Given  the agent is available at "http://localhost:8080/health"  ✓
    When   the user says "I can't log into my account"
           → persona: "hola no puedo entrar a mi cuenta ayuda"
           ← agent:   "Hi! I can help you get back into your account..."
    Then   the response should contain "password"              ✓
    And    the response should not contain "SSO"               ✓
    ...
    Then   the goal should be achieved with "..."
           ✓ goal_achieved (confidence=92%): Agent clearly guided...

  Scenario: User initiates password reset  |  persona: elderly_user  |  FAIL
    ...
    Then   the response should not contain "SSO"               ✗
           ✗ "SSO" unexpectedly present in response

  ─────────────────────────────────────────────────────────────
  Persona              Status    Turns  Failed Steps
  ─────────────────────────────────────────────────
  maria_garcia         ✓ PASS    3      —
  elderly_user         ✗ FAIL    3      "SSO" unexpectedly present
  gen_z_user           ✓ PASS    3      —
  indian_english       ✓ PASS    3      —
  neutral_baseline     ✓ PASS    3      —
  ─────────────────────────────────────────────────
  Passed: 4/5  Failed: 1/5

Installation

pip install anek

Requires Python 3.11+ and an Anthropic API key (for persona simulation and LLM-as-judge evaluation).


Quick Start

1. Set your API key

export ANTHROPIC_API_KEY=sk-ant-...

2. Start the demo testbed agent

The repo ships a free demo agent (TechCorp Support Bot) that uses Groq's free API — no credit card needed.

pip install "anek[testbed]"
export GROQ_API_KEY=gsk_...
python -m anek.testbed.agent
# → TechCorp Support Bot running at http://localhost:8080

3. Run a feature file

anek run features/password_reset.feature --agent http://localhost:8080/chat

Results are saved to ./anek-results/ as JSON.


Writing Tests

Tests are standard Gherkin .feature files. No step definitions needed — Anek has built-in handlers for all step patterns.

Feature file structure

Feature: Name of the feature being tested

  Background:
    Given the agent is available at "http://localhost:8080/health"

  @all_personas
  Scenario: Descriptive scenario name
    When the user says "base message in plain English"
    Then the response should contain "expected keyword"
    And  the response should not contain "forbidden word"
    And  the response time should be under 3000ms

    When the user says "follow-up message"
    Then the goal should be achieved with "description of success"
    And  the sentiment should be positive

Persona tags

Control which personas run a scenario using @ tags on the Scenario line:

Tag Behaviour
@all_personas Run against every persona in /personas
@persona:maria_garcia Run against one specific persona
@personas:elderly_user,gen_z_user Run against a named subset

Multiple @persona:X tags on one scenario each run separately.

The When step — how personas work

The message in a When step is a base intent in plain English. Anek calls Claude to rewrite it as the persona would naturally type it before sending to your agent. The feature file stays readable; the persona transformation happens at runtime.

When the user says "I can't log in"
# maria_garcia sees: "hola no puedo entrar ayuda!!"
# elderly_user sees: "Good afternoon. I am unable to access my account, I'm afraid."
# gen_z_user  sees: "cant log in?? pls 😭"

Built-in verifiers (Then steps)

Step pattern What it checks
the response should contain "text" Case-insensitive substring match
the response should not contain "text" Absence check
the response time should be under Nms Latency threshold
the sentiment should be positive|neutral|negative Claude judges sentiment
the goal should be achieved with "description" Claude judges full transcript against goal

Personas

Personas live as YAML files in the /personas directory. Five starter personas are included:

Name Description
neutral_baseline Standard American English — control group
maria_garcia Native Spanish speaker, intermediate English, Spanglish
elderly_user Formal, verbose, confused by technical terms
gen_z_user Lowercase, terse, slang, emojis
indian_english Formal Indian English, "kindly revert", "do the needful"

Persona YAML format

name: maria_garcia
description: Native Spanish speaker, mid-30s, uses Spanglish occasionally
language_background: spanish_native
english_proficiency: intermediate  # native | fluent | advanced | intermediate | basic
traits:
  - omits articles occasionally ("I need help with account")
  - mixes Spanish words naturally mid-sentence
  - minimal punctuation, mostly lowercase
sample_phrases:
  - "hola can you help me i need change my password"
  - "the app no work for me since yesterday"

Validate a persona file:

anek personas validate personas/maria_garcia.yaml

CLI Reference

# Run a feature file against all tagged personas
anek run features/password_reset.feature --agent http://localhost:8080/chat

# Run with specific personas only
anek run features/password_reset.feature \
  --agent http://localhost:8080/chat \
  --personas elderly_user,maria_garcia

# Custom response field extraction (if your agent returns {"data": {"text": "..."}})
anek run features/test.feature \
  --agent http://myagent.com/chat \
  --response-path data.text \
  --message-field query

# List available personas
anek personas list

# Validate a persona file
anek personas validate personas/my_persona.yaml

# Skip saving JSON results
anek run features/test.feature --agent http://... --no-save

Exit codes

  • 0 — all scenarios passed
  • 1 — one or more scenarios failed

Configuration

Copy anek.config.yaml.example to anek.config.yaml:

anthropic_api_key: ${ANTHROPIC_API_KEY}
default_agent_endpoint: http://localhost:8080/chat
response_path: reply          # JSONPath to extract agent reply
message_field: message        # JSON field for outgoing user message
personas_dir: ./personas
results_dir: ./anek-results

Environment variables in ${VAR} syntax are expanded automatically. The CLI --agent flag overrides default_agent_endpoint.


Testbed Agent

testbed/agent.py is a self-contained FastAPI agent for testing Anek itself. It plays the role of TechCorp Support Bot — a deliberately scoped agent that handles password reset, account inquiries, and billing, and refuses anything outside that scope.

pip install "anek[testbed]"
export GROQ_API_KEY=gsk_...   # free at console.groq.com
python testbed/agent.py

It uses Groq's free tier (llama-3.1-8b-instant) by default. Any OpenAI-compatible API works via LLM_BASE_URL:

LLM_BASE_URL=http://localhost:11434/v1 LLM_MODEL=llama3.1 python testbed/agent.py

Project Structure

anek/
├── anek/
│   ├── cli.py              # Click CLI entry point
│   ├── feature_parser.py   # Gherkin .feature file parser
│   ├── persona.py          # Persona model + loader
│   ├── llm.py              # Claude API wrapper (simulation + judge)
│   ├── simulator.py        # Test orchestration engine
│   ├── verifiers.py        # Built-in step verifiers
│   ├── reporter.py         # Rich CLI + JSON output
│   └── drivers/
│       ├── base.py         # AgentDriver protocol
│       └── http.py         # HTTP REST driver
├── personas/               # Starter persona YAML files
├── features/               # Example .feature files
├── testbed/                # Demo agent (TechCorp Support Bot)
├── tests/                  # Anek's own test suite
└── pyproject.toml

Contributing

git clone https://github.com/souratendu/anek
cd anek
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/

Pull requests welcome. If you build a new persona, verifier type, or driver, open a PR.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anek-0.1.0.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anek-0.1.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file anek-0.1.0.tar.gz.

File metadata

  • Download URL: anek-0.1.0.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for anek-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6074d0ac9a7953ce6ae66245f40faa3ac35981706d78ecc5a4cefbca269355d2
MD5 ab39445b1600433e71efa6bdfb26503d
BLAKE2b-256 77699b3c966eb5e4aab027f919c801c8d63589f23fdaeead921be65a957af2bc

See more details on using hashes here.

File details

Details for the file anek-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: anek-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for anek-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 034143919bec4fc612fdf00fd7f81a104c380f14b168099d17e824d3a5da36a9
MD5 cd9ffbf4225b117bbfd7c8525671172f
BLAKE2b-256 d0b5763476350225778a075b1330b26a2cf039cd9e52da112b53a9b191ec5ef9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page