⛵️ Know how your agent performs before it goes live.
Project description
⛵️ ArkSim
Simulate multi-turn conversations with your AI agent. Find failures before production.
Documentation · Examples · Report a Bug
https://github.com/user-attachments/assets/78706f27-cf49-41c1-8019-9dcbb8abc625
What is ArkSim?
Agents fail in ways that only show up mid-conversation. They misinterpret intent three turns in, call the wrong tool, or hallucinate a policy that does not exist. Single-turn testing misses all of this.
ArkSim generates LLM-powered synthetic users that hold realistic multi-turn conversations with your agent. Each user has a distinct profile, goal, and knowledge level. They push back, ask follow-ups, and behave like real users would.
You define scenarios, ArkSim simulates conversations, then evaluates every turn across metrics like helpfulness, faithfulness, and goal completion. The output is an interactive report showing exactly where your agent broke and why.
Quickstart
Have an agent? Test it in 3 commands:
pip install arksim
export OPENAI_API_KEY="your-key"
arksim init
# Edit my_agent.py with your agent logic, then run:
arksim simulate-evaluate config.yaml
This generates config.yaml, scenarios.json, and a starter my_agent.py.
For HTTP or A2A agents: arksim init --agent-type chat_completions or arksim init --agent-type a2a.
For Anthropic or Google as the evaluation LLM: pip install "arksim[anthropic]" or pip install "arksim[google]".
Just exploring? Try an example:
pip install arksim
export OPENAI_API_KEY="your-key"
arksim examples
cd examples/e-commerce
arksim simulate-evaluate config.yaml
What you'll see
The report tells you where your agent is strong and where it breaks. You get per-metric scores, categorized failures, and full conversation transcripts so you can read the exact turns where things went wrong.
Test Your Own Agent
Python class (default)
arksim init generates a my_agent.py with a BaseAgent subclass. Replace the execute() body with your agent logic:
from arksim.simulation_engine.agent.base import BaseAgent
from arksim.simulation_engine.tool_types import AgentResponse
class MyAgent(BaseAgent):
async def get_chat_id(self) -> str:
return "unique-id"
async def execute(self, user_query: str, **kwargs: object) -> str | AgentResponse:
# Replace with your agent logic
return "agent response"
Chat Completions endpoint
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8000/v1/chat/completions
A2A protocol
agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9999/agent
A2A agents can also surface tool calls for evaluation via the arksim tool call capture extension. See examples/customer-service/a2a_server/ for a runnable reference server.
Write scenarios that match your agent's domain. See the Scenarios documentation for how to define goals, user profiles, and knowledge.
Why ArkSim?
- Simulation, not just evaluation. Most tools score conversations you already have. ArkSim generates them with synthetic users who push back, ask follow-ups, and behave unpredictably.
- Multi-turn by default. Every test is a full conversation, not a single prompt. Context loss, tool misuse, and contradictions only show up across turns.
- Any agent, any framework. Works with 14+ frameworks through Chat Completions, A2A, or direct Python import.
- Runs in CI. Add it as a quality gate on every PR. Exits non-zero when your agent drops below threshold.
- Fully open source. Runs on your infrastructure. Your data never leaves.
Integrations
| Framework | Provider |
|---|---|
| Claude Agent SDK | Anthropic |
| OpenAI Agents SDK | OpenAI |
| Google ADK | |
| LangChain | LangChain |
| LangGraph | LangChain |
| CrewAI | CrewAI |
| Dify | Dify |
| AutoGen | Microsoft |
| LlamaIndex | LlamaIndex |
| Pydantic AI | Pydantic |
| Rasa | Rasa |
| Smolagents | Hugging Face |
| Mastra | TypeScript |
| Vercel AI SDK | TypeScript |
See examples for end-to-end projects with custom metrics and scenarios.
Learn More
| Topic | |
|---|---|
| Evaluation metrics (built-in and custom) | Metrics guide |
| CI integration (pytest and GitHub Actions) | CI setup guide |
| Configuration reference (all YAML settings) | Schema reference |
| Simulation and CLI usage | Simulation guide |
| Web UI for browsing results | Overview |
Development
git clone https://github.com/arklexai/arksim.git
cd arksim
pip install -e ".[dev]"
pytest tests/
Linting and formatting:
ruff check .
ruff format .
See CONTRIBUTING.md for guidelines.
License
Apache-2.0. See LICENSE.
Citation
@misc{shea2026sage,
title={SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation},
author={Ryan Shea and Yunan Lu and Liang Qiu and Zhou Yu},
year={2026},
eprint={2510.11997},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.11997},
}
Star History
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arksim-0.3.5.tar.gz.
File metadata
- Download URL: arksim-0.3.5.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2139c81571c8f3506351c90bb169b48083b6b421420f3b4c212fb08c12d78ced
|
|
| MD5 |
1da7f8a83fcb120411ddce709256226d
|
|
| BLAKE2b-256 |
98da877661ff048bdcce3a05fd7799d7ad5d54312f719933e9b67f4aaaacb730
|
Provenance
The following attestation bundles were made for arksim-0.3.5.tar.gz:
Publisher:
publish-pypi.yml on arklexai/arksim
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arksim-0.3.5.tar.gz -
Subject digest:
2139c81571c8f3506351c90bb169b48083b6b421420f3b4c212fb08c12d78ced - Sigstore transparency entry: 1404481871
- Sigstore integration time:
-
Permalink:
arklexai/arksim@6caffdb9ceab8ba634c0cb560391c5a69ef1295e -
Branch / Tag:
refs/tags/v0.3.5 - Owner: https://github.com/arklexai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@6caffdb9ceab8ba634c0cb560391c5a69ef1295e -
Trigger Event:
release
-
Statement type:
File details
Details for the file arksim-0.3.5-py3-none-any.whl.
File metadata
- Download URL: arksim-0.3.5-py3-none-any.whl
- Upload date:
- Size: 180.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e58f500047d3f182639e8c348c146994f2a146c173c15ece384e06644ae1d86c
|
|
| MD5 |
c1b5cecfbb476ea9c91be6ee85b3bb44
|
|
| BLAKE2b-256 |
ef7419374063e870a89ca3cf7450df9d396cfbb08ac205930ef4af83b741f2d8
|
Provenance
The following attestation bundles were made for arksim-0.3.5-py3-none-any.whl:
Publisher:
publish-pypi.yml on arklexai/arksim
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arksim-0.3.5-py3-none-any.whl -
Subject digest:
e58f500047d3f182639e8c348c146994f2a146c173c15ece384e06644ae1d86c - Sigstore transparency entry: 1404481931
- Sigstore integration time:
-
Permalink:
arklexai/arksim@6caffdb9ceab8ba634c0cb560391c5a69ef1295e -
Branch / Tag:
refs/tags/v0.3.5 - Owner: https://github.com/arklexai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@6caffdb9ceab8ba634c0cb560391c5a69ef1295e -
Trigger Event:
release
-
Statement type: