Open-source framework for simulating and evaluating conversational AI agents
Project description
Arksim
Open-source framework for simulating and evaluating conversational AI agents
Documentation · Examples · Report a Bug
Demo video coming soon
What is Arksim?
Arksim simulates realistic multi-turn conversations between LLM-powered users and your agent, then evaluates performance across built-in and custom metrics. You define the scenarios (goals, profiles, knowledge) and Arksim handles simulation and evaluation. Works with any agent that exposes a Chat Completions API or A2A protocol endpoint.
Why Arksim?
- Realistic simulations: LLM-powered users with distinct profiles, goals, and personality traits
- Comprehensive evaluation: 7 built-in metrics covering helpfulness, coherence, faithfulness, goal completion, and more
- Custom metrics: Define your own quantitative and qualitative metrics with full access to conversation context
- Error detection: Automatically categorize agent failures (false information, disobeying requests, repetition) with severity levels
- Protocol-agnostic: Works with Chat Completions API, A2A protocol, or any HTTP endpoint
- Multi-provider: Use OpenAI, Anthropic Claude, or Google Gemini as the evaluation LLM
- Parallel execution: Configurable concurrency for both simulation and evaluation
- Visual reports: Interactive HTML reports with score breakdowns, error analysis, and full conversation viewer
Quickstart
Install
pip install arksim
For additional LLM providers:
pip install arksim[all] # All providers
pip install arksim[anthropic] # Anthropic Claude only
pip install arksim[gemini] # Google Gemini only
Set up credentials
export OPENAI_API_KEY="your-key"
Create a config
# config.yaml
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: https://api.openai.com/v1/chat/completions
headers:
Content-Type: application/json
Authorization: "Bearer ${OPENAI_API_KEY}"
body:
model: gpt-5.1
messages:
- role: system
content: "You are a helpful assistant."
scenario_file_path: ./scenarios.json
model: gpt-5.1
provider: openai
num_conversations_per_scenario: 5
max_turns: 5
output_file_path: ./results/simulation/simulation.json
output_dir: ./results/evaluation
generate_html_report: true
Run
# Simulate conversations, then evaluate
arksim simulate-evaluate config.yaml
# Or run each step separately
arksim simulate config.yaml
arksim evaluate config.yaml
View results
Open the generated HTML report in ./results/evaluation/, or launch the web UI:
arksim ui
Agent Configuration
Agent configuration tells Arksim how to connect to your agent. It is specified directly in your YAML config file. Arksim supports two protocols:
Chat Completions API
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8888/chat/completions
headers:
Content-Type: application/json
Authorization: "Bearer ${AGENT_API_KEY}"
body:
messages:
- role: system
content: "You are a helpful assistant."
A2A (Agent-to-Agent) Protocol
agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9000/agent
Environment variables in headers are resolved at runtime using ${VAR_NAME} syntax.
Evaluation Metrics
Built-in metrics
| Metric | Type | Scale | What it measures |
|---|---|---|---|
| Helpfulness | Quantitative | 1-5 | How effectively the agent addresses user needs |
| Coherence | Quantitative | 1-5 | Logical flow and consistency of responses |
| Relevance | Quantitative | 1-5 | How on-topic the agent's responses are |
| Faithfulness | Quantitative | 1-5 | Accuracy against provided knowledge (penalizes contradictions only) |
| Verbosity | Quantitative | 1-5 | Whether response length is appropriate |
| Goal Completion | Quantitative | 0/1 | Whether the user's stated goal was achieved |
| Agent Behavior Failure | Qualitative | Category | Classifies errors: false information, disobeying requests, repetition, lack of specificity, failure to clarify |
Custom metrics
Define quantitative metrics (numeric scores) by subclassing QuantitativeMetric:
from arksim.evaluator import QuantitativeMetric, QuantResult, ScoreInput
class ToneMetric(QuantitativeMetric):
def __init__(self):
super().__init__(
name="tone_appropriateness",
score_range=(0, 5),
description="Evaluates whether the agent uses an appropriate tone",
)
def score(self, score_input: ScoreInput) -> QuantResult:
# Access: score_input.chat_history, score_input.knowledge,
# score_input.user_goal, score_input.profile
return QuantResult(
name=self.name,
value=4.0,
reason="Agent maintained professional tone throughout",
)
Define qualitative metrics (categorical labels) by subclassing QualitativeMetric:
from arksim.evaluator import QualitativeMetric, QualResult, ScoreInput
class SafetyCheckMetric(QualitativeMetric):
def __init__(self):
super().__init__(
name="safety_check",
description="Flags whether the agent produced unsafe content",
)
def evaluate(self, score_input: ScoreInput) -> QualResult:
# Access: score_input.chat_history, score_input.knowledge,
# score_input.user_goal, score_input.profile
return QualResult(
name=self.name,
value="safe", # categorical label
reason="No unsafe content detected",
)
Add to your config:
custom_metrics_file_paths:
- ./my_metrics.py
See the bank-insurance example for a full implementation with LLM-as-judge custom metrics.
Configuration Reference
All settings can be specified in YAML and overridden via CLI flags (--key value).
Simulation settings
| Setting | Type | Default | Description |
|---|---|---|---|
agent_config |
object | required | Inline agent config (agent_type, agent_name, api_config) |
| scenario_file_path | string | required | Path to scenarios JSON |
| model | string | gpt-5.1 | LLM model for simulated users |
| provider | string | openai | LLM provider: openai, claude, gemini |
| num_conversations_per_scenario | int | 5 | Conversations to generate per scenario |
| max_turns | int | 5 | Maximum turns per conversation |
| num_workers | int/string | auto | Parallel workers |
| output_file_path | string | ./simulation.json | Where to save simulation results |
| simulated_user_prompt_template | string | null | Custom Jinja2 template for simulated user prompt |
Evaluation settings
| Setting | Type | Default | Description |
|---|---|---|---|
simulation_file_path |
string | required | Path to simulation output |
output_dir |
string | required | Directory for evaluation results |
model |
string | gpt-5.1 |
LLM model for evaluation |
provider |
string | openai |
LLM provider |
metrics_to_run |
list | all metrics | Which metrics to run |
custom_metrics_file_paths |
list | [] |
Paths to custom metric files |
generate_html_report |
bool | true |
Generate an HTML report |
score_threshold |
float | null | Fail (exit 1) if any conversation scores below this |
num_workers |
int/string | auto |
Parallel workers |
CLI Reference
arksim simulate <config.yaml> Run agent simulations
arksim evaluate <config.yaml> Evaluate simulation results
arksim simulate-evaluate <config.yaml> Simulate then evaluate
arksim show-prompts [--category NAME] Display evaluation prompts
arksim ui [--port PORT] Launch web UI (default: 8080)
Any config setting can be passed as a CLI flag:
arksim simulate config.yaml --max-turns 10 --num-workers 4 --verbose
arksim evaluate config.yaml --score-threshold 0.7
Web UI
arksim ui
Opens a local web app at http://localhost:8080 where you can browse config files, run simulations with live log streaming, launch evaluations, and view interactive HTML reports.
Note: Provider credentials (e.g.
OPENAI_API_KEY) must be set as environment variables before launching.
Examples
| Example | Description |
|---|---|
| bank-insurance | Financial services agent with custom compliance metrics, adversarial scenarios, and a Chat Completions server |
| e-commerce | E-commerce product recommendation agent with custom metrics |
| openclaw | Integration with the OpenClaw agent framework |
Development
git clone https://github.com/arklexai/arksim.git
cd arksim
pip install -e ".[dev]"
pytest tests/
Linting and formatting:
ruff check .
ruff format .
See CONTRIBUTING.md for guidelines.
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arksim-0.0.3.tar.gz.
File metadata
- Download URL: arksim-0.0.3.tar.gz
- Upload date:
- Size: 2.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e7c393739bdad0747193c6c4f28b5cbfab437b4c7aabd79c6db23b20c16e488
|
|
| MD5 |
706708291119bf5dd3862295f47ab7de
|
|
| BLAKE2b-256 |
50e81b34b77a6597e5ea89f7f568176ca972b77467d7058273c0a107a1b0c11d
|
Provenance
The following attestation bundles were made for arksim-0.0.3.tar.gz:
Publisher:
publish-pypi.yml on arklexai/arksim
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arksim-0.0.3.tar.gz -
Subject digest:
0e7c393739bdad0747193c6c4f28b5cbfab437b4c7aabd79c6db23b20c16e488 - Sigstore transparency entry: 1019766853
- Sigstore integration time:
-
Permalink:
arklexai/arksim@a032b9b1979521e67d483d65406aec103ea0dec7 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/arklexai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@a032b9b1979521e67d483d65406aec103ea0dec7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file arksim-0.0.3-py3-none-any.whl.
File metadata
- Download URL: arksim-0.0.3-py3-none-any.whl
- Upload date:
- Size: 124.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0605b18c626daed36a20e624d00b9352e09f3b5a26eee1dfee0e77789ff8ec8d
|
|
| MD5 |
3386972e4d5496bf2cb517b8b0a67f52
|
|
| BLAKE2b-256 |
a64d186f6d4eebe92be660c84ec2fb86ee8fa41a8687a0e89f6b2723ad4900c6
|
Provenance
The following attestation bundles were made for arksim-0.0.3-py3-none-any.whl:
Publisher:
publish-pypi.yml on arklexai/arksim
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arksim-0.0.3-py3-none-any.whl -
Subject digest:
0605b18c626daed36a20e624d00b9352e09f3b5a26eee1dfee0e77789ff8ec8d - Sigstore transparency entry: 1019766977
- Sigstore integration time:
-
Permalink:
arklexai/arksim@a032b9b1979521e67d483d65406aec103ea0dec7 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/arklexai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@a032b9b1979521e67d483d65406aec103ea0dec7 -
Trigger Event:
push
-
Statement type: