Skip to main content

Open-source framework for simulating and evaluating conversational AI agents

Project description

Arksim

Open-source framework for simulating and evaluating conversational AI agents

CI PyPI Python License Docs GitHub Stars GitHub Issues PRs Welcome

Documentation · Examples · Report a Bug


Demo video coming soon


What is Arksim?

Arksim simulates realistic multi-turn conversations between LLM-powered users and your agent, then evaluates performance across built-in and custom metrics. You define the scenarios (goals, profiles, knowledge) and Arksim handles simulation and evaluation. Works with any agent that exposes a Chat Completions API or A2A protocol endpoint.

Arksim flow: Scenarios → Simulation → Evaluation → Reports

Why Arksim?

  • Realistic simulations: LLM-powered users with distinct profiles, goals, and personality traits
  • Comprehensive evaluation: 7 built-in metrics covering helpfulness, coherence, faithfulness, goal completion, and more
  • Custom metrics: Define your own quantitative and qualitative metrics with full access to conversation context
  • Error detection: Automatically categorize agent failures (false information, disobeying requests, repetition) with severity levels
  • Protocol-agnostic: Works with Chat Completions API, A2A protocol, or any HTTP endpoint
  • Multi-provider: Use OpenAI, Anthropic Claude, or Google Gemini as the evaluation LLM
  • Parallel execution: Configurable concurrency for both simulation and evaluation
  • Visual reports: Interactive HTML reports with score breakdowns, error analysis, and full conversation viewer

Quickstart

Install

pip install arksim

For additional LLM providers:

pip install arksim[all]        # All providers
pip install arksim[anthropic]  # Anthropic Claude only
pip install arksim[gemini]     # Google Gemini only

Set up credentials

export OPENAI_API_KEY="your-key"

Create a config

# config.yaml
agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: https://api.openai.com/v1/chat/completions
    headers:
      Content-Type: application/json
      Authorization: "Bearer ${OPENAI_API_KEY}"
    body:
      model: gpt-5.1
      messages:
        - role: system
          content: "You are a helpful assistant."

scenario_file_path: ./scenarios.json
model: gpt-5.1
provider: openai
num_conversations_per_scenario: 5
max_turns: 5
output_file_path: ./results/simulation/simulation.json
output_dir: ./results/evaluation
generate_html_report: true

Run

# Simulate conversations, then evaluate
arksim simulate-evaluate config.yaml

# Or run each step separately
arksim simulate config.yaml
arksim evaluate config.yaml

View results

Open the generated HTML report in ./results/evaluation/, or launch the web UI:

arksim ui

Agent Configuration

Agent configuration tells Arksim how to connect to your agent. It is specified directly in your YAML config file. Arksim supports two protocols:

Chat Completions API

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8888/chat/completions
    headers:
      Content-Type: application/json
      Authorization: "Bearer ${AGENT_API_KEY}"
    body:
      messages:
        - role: system
          content: "You are a helpful assistant."

A2A (Agent-to-Agent) Protocol

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9000/agent

Environment variables in headers are resolved at runtime using ${VAR_NAME} syntax.

Evaluation Metrics

Built-in metrics

Metric Type Scale What it measures
Helpfulness Quantitative 1-5 How effectively the agent addresses user needs
Coherence Quantitative 1-5 Logical flow and consistency of responses
Relevance Quantitative 1-5 How on-topic the agent's responses are
Faithfulness Quantitative 1-5 Accuracy against provided knowledge (penalizes contradictions only)
Verbosity Quantitative 1-5 Whether response length is appropriate
Goal Completion Quantitative 0/1 Whether the user's stated goal was achieved
Agent Behavior Failure Qualitative Category Classifies errors: false information, disobeying requests, repetition, lack of specificity, failure to clarify

Custom metrics

Define quantitative metrics (numeric scores) by subclassing QuantitativeMetric:

from arksim.evaluator import QuantitativeMetric, QuantResult, ScoreInput

class ToneMetric(QuantitativeMetric):
    def __init__(self):
        super().__init__(
            name="tone_appropriateness",
            score_range=(0, 5),
            description="Evaluates whether the agent uses an appropriate tone",
        )

    def score(self, score_input: ScoreInput) -> QuantResult:
        # Access: score_input.chat_history, score_input.knowledge,
        #         score_input.user_goal, score_input.profile
        return QuantResult(
            name=self.name,
            value=4.0,
            reason="Agent maintained professional tone throughout",
        )

Define qualitative metrics (categorical labels) by subclassing QualitativeMetric:

from arksim.evaluator import QualitativeMetric, QualResult, ScoreInput

class SafetyCheckMetric(QualitativeMetric):
    def __init__(self):
        super().__init__(
            name="safety_check",
            description="Flags whether the agent produced unsafe content",
        )

    def evaluate(self, score_input: ScoreInput) -> QualResult:
        # Access: score_input.chat_history, score_input.knowledge,
        #         score_input.user_goal, score_input.profile
        return QualResult(
            name=self.name,
            value="safe",  # categorical label
            reason="No unsafe content detected",
        )

Add to your config:

custom_metrics_file_paths:
  - ./my_metrics.py

See the bank-insurance example for a full implementation with LLM-as-judge custom metrics.

Configuration Reference

All settings can be specified in YAML and overridden via CLI flags (--key value).

Simulation settings

Setting Type Default Description
agent_config object required Inline agent config (agent_type, agent_name, api_config)

| scenario_file_path | string | required | Path to scenarios JSON | | model | string | gpt-5.1 | LLM model for simulated users | | provider | string | openai | LLM provider: openai, claude, gemini | | num_conversations_per_scenario | int | 5 | Conversations to generate per scenario | | max_turns | int | 5 | Maximum turns per conversation | | num_workers | int/string | auto | Parallel workers | | output_file_path | string | ./simulation.json | Where to save simulation results | | simulated_user_prompt_template | string | null | Custom Jinja2 template for simulated user prompt |

Evaluation settings

Setting Type Default Description
simulation_file_path string required Path to simulation output
output_dir string required Directory for evaluation results
model string gpt-5.1 LLM model for evaluation
provider string openai LLM provider
metrics_to_run list all metrics Which metrics to run
custom_metrics_file_paths list [] Paths to custom metric files
generate_html_report bool true Generate an HTML report
score_threshold float null Fail (exit 1) if any conversation scores below this
num_workers int/string auto Parallel workers

CLI Reference

arksim simulate <config.yaml>           Run agent simulations
arksim evaluate <config.yaml>           Evaluate simulation results
arksim simulate-evaluate <config.yaml>  Simulate then evaluate
arksim show-prompts [--category NAME]   Display evaluation prompts
arksim ui [--port PORT]                 Launch web UI (default: 8080)

Any config setting can be passed as a CLI flag:

arksim simulate config.yaml --max-turns 10 --num-workers 4 --verbose
arksim evaluate config.yaml --score-threshold 0.7

Web UI

arksim ui

Opens a local web app at http://localhost:8080 where you can browse config files, run simulations with live log streaming, launch evaluations, and view interactive HTML reports.

Note: Provider credentials (e.g. OPENAI_API_KEY) must be set as environment variables before launching.

Examples

Example Description
bank-insurance Financial services agent with custom compliance metrics, adversarial scenarios, and a Chat Completions server
e-commerce E-commerce product recommendation agent with custom metrics
openclaw Integration with the OpenClaw agent framework

Development

git clone https://github.com/arklexai/arksim.git
cd arksim
pip install -e ".[dev]"
pytest tests/

Linting and formatting:

ruff check .
ruff format .

See CONTRIBUTING.md for guidelines.

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arksim-0.0.3.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arksim-0.0.3-py3-none-any.whl (124.5 kB view details)

Uploaded Python 3

File details

Details for the file arksim-0.0.3.tar.gz.

File metadata

  • Download URL: arksim-0.0.3.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for arksim-0.0.3.tar.gz
Algorithm Hash digest
SHA256 0e7c393739bdad0747193c6c4f28b5cbfab437b4c7aabd79c6db23b20c16e488
MD5 706708291119bf5dd3862295f47ab7de
BLAKE2b-256 50e81b34b77a6597e5ea89f7f568176ca972b77467d7058273c0a107a1b0c11d

See more details on using hashes here.

Provenance

The following attestation bundles were made for arksim-0.0.3.tar.gz:

Publisher: publish-pypi.yml on arklexai/arksim

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arksim-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: arksim-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 124.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for arksim-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0605b18c626daed36a20e624d00b9352e09f3b5a26eee1dfee0e77789ff8ec8d
MD5 3386972e4d5496bf2cb517b8b0a67f52
BLAKE2b-256 a64d186f6d4eebe92be660c84ec2fb86ee8fa41a8687a0e89f6b2723ad4900c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for arksim-0.0.3-py3-none-any.whl:

Publisher: publish-pypi.yml on arklexai/arksim

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page