Python SDK for the Ashr Labs API
Project description
Ashr Labs Python SDK
A Python client library for evaluating AI agents against Ashr Labs test datasets.
Documentation
- Testing Your Agent — start here (includes debugging failures with transcripts and classification)
- Quick Start Guide
- Installation
- Authentication
- API Reference
- Error Handling
- Examples
Installation
pip install ashr-labs
Quick Start
from ashr_labs import AshrLabsClient, EvalRunner
# Only need your API key — base_url and tenant_id are automatic
client = AshrLabsClient(api_key="tp_your_api_key_here")
# Fetch a dataset and run your agent against it
runner = EvalRunner.from_dataset(client, dataset_id=42)
run = runner.run(my_agent)
# Submit results — grading happens server-side
created = run.deploy(client, dataset_id=42)
# Wait for grading to complete (typically 1-3 minutes)
graded = client.poll_run(created["id"])
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")
Your agent just needs two methods:
class MyAgent:
def respond(self, message: str) -> dict:
# Call your LLM, return {"text": "...", "tool_calls": [...]}
return {"text": "response", "tool_calls": []}
def reset(self) -> None:
# Clear conversation history between scenarios
pass
See Testing Your Agent for a full end-to-end guide.
Agents
Agents group your datasets and define how they should be generated and graded. Create an agent once, then generate consistent datasets for it.
# Create an agent with tool definitions and grading config
agent = client.create_agent(
name="Support Bot",
description="Spanish-language healthcare scheduling agent",
config={
"tool_definitions": [
{"name": "fetch_kareo_data", "required": True, "description": "Fetch appointment availability"},
{"name": "save_data", "required": True, "description": "Persist caller info"},
{"name": "end_session", "required": False, "description": "Close the conversation"},
],
"behavior_rules": [
{"rule": "Always fetch before quoting availability", "strictness": "required"},
{"rule": "Save caller name via save_data", "strictness": "required"},
],
"grading_config": {
"tool_strictness": {
"fetch_kareo_data": "required",
"end_session": "optional",
"await_user_response": "optional",
},
},
},
)
# Link a dataset to the agent
client.set_dataset_agent(dataset_id=42, agent_id=agent["id"])
# Submit a run and auto-link to agent
run.deploy(client, dataset_id=42, agent_id=agent["id"])
Grading behavior
The grading system uses agent config to make smarter decisions:
requiredtools: Must be called. If the agent skips a required tool, it's a failure.optionaltools: If the agent achieves the same intent via text (e.g. ends the conversation naturally instead of callingend_session), the grader recovers it as a partial match instead of a failure.expectedtools: Should be called, but a miss is a warning, not a failure.
Observability — Production Tracing
Trace your agent in production. Captures LLM calls, tool invocations, and events. Never crashes your agent — if the backend is unreachable, errors are logged silently.
# Context managers (recommended) — auto-end on exit, auto-capture errors
with client.trace("handle-ticket", user_id="user_42") as trace:
with trace.generation("classify", model="claude-sonnet-4-6",
input=[{"role": "user", "content": "help"}]) as gen:
result = call_llm(...)
gen.end(output=result, usage={"input_tokens": 50, "output_tokens": 12})
with trace.span("tool:search", input={"q": "..."}) as tool:
data = search(...)
tool.end(output=data)
# Analytics
analytics = client.get_observability_analytics(days=7)
print(f"Traces: {analytics['overview']['total_traces']}")
print(f"Tool calls: {analytics['overview']['total_tool_calls']}")
See API Reference for full Trace/Span/Generation docs.
VM Stream Logs
Attach virtual machine session logs to test results for browser-based or desktop-based agents:
test = run.add_test("checkout_flow")
test.start()
# ... run agent, add tool calls and responses ...
# Kernel browser session (first-class support)
test.set_kernel_vm(
session_id="kern_sess_abc123",
duration_ms=15000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
],
replay_id="replay_abc123",
replay_view_url="https://www.kernel.sh/replays/replay_abc123",
stealth=True,
viewport={"width": 1920, "height": 1080},
)
# Or use the generic set_vm_stream() for any provider
test.set_vm_stream(
provider="browserbase",
session_id="sess_abc123",
duration_ms=45000,
logs=[
{"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
{"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
],
)
test.complete()
Available Methods
All methods that accept tenant_id auto-resolve it from your API key if omitted.
Agents
| Method | Description |
|---|---|
list_agents() |
List all agents with dataset counts |
create_agent(name, description, config) |
Create a new agent |
update_agent(agent_id, name, description, config) |
Update an agent |
delete_agent(agent_id) |
Soft-delete an agent |
get_agent_datasets(agent_id) |
Get datasets linked to an agent |
set_dataset_agent(dataset_id, agent_id) |
Link/unlink a dataset to an agent |
Datasets
| Method | Description |
|---|---|
get_dataset(dataset_id, ...) |
Get a dataset by ID |
list_datasets(limit, cursor, ...) |
List datasets (cursor-based pagination) |
Runs
| Method | Description |
|---|---|
create_run(dataset_id, result, ...) |
Create a new test run |
get_run(run_id) |
Get a run by ID |
list_runs(dataset_id, limit) |
List runs |
delete_run(run_id) |
Delete a run |
poll_run(run_id, timeout, poll_interval) |
Wait for server-side grading to complete |
EvalRunner
| Method | Description |
|---|---|
EvalRunner.from_dataset(client, dataset_id) |
Create a runner from a dataset |
runner.run(agent, max_workers=1, on_environment=...) |
Run agent against all scenarios, return RunBuilder |
runner.run_and_deploy(agent, client, dataset_id, max_workers=1) |
Run and submit in one call |
RunBuilder
| Method | Description |
|---|---|
RunBuilder() |
Create a new run builder |
run.start() |
Mark the run as started |
run.add_test(test_id) |
Add a test and get a TestBuilder |
run.complete(status) |
Mark the run as completed |
run.build() |
Serialize to a result dict |
run.deploy(client, dataset_id, agent_id) |
Build and submit via the API |
TestBuilder
| Method | Description |
|---|---|
test.start() |
Mark the test as started |
test.add_user_file(file_path, description) |
Record a user file upload |
test.add_user_text(text, description) |
Record a user text input |
test.add_tool_call(expected, actual, match_status) |
Record an agent tool call |
test.add_agent_response(expected_response, actual_response, match_status) |
Record an agent response |
test.set_vm_stream(provider, session_id, logs, ...) |
Attach VM session logs |
test.set_kernel_vm(session_id, ...) |
Attach Kernel VM session (convenience) |
test.complete(status) |
Mark the test as completed |
Requests
| Method | Description |
|---|---|
create_request(request_name, request, ...) |
Create a new request |
get_request(request_id) |
Get a request by ID |
list_requests(status, limit, cursor) |
List requests |
Observability
| Method | Description |
|---|---|
client.trace(name, ...) |
Start a production trace (returns Trace) |
trace.span(name, ...) / trace.generation(name, ...) |
Add spans or LLM calls |
trace.end(output=...) |
Flush trace to backend (never raises) |
list_observability_traces(user_id, session_id, ...) |
List traces |
get_observability_trace(trace_id) |
Get trace with full observation tree |
get_observability_analytics(days) |
Analytics: tokens, latency, errors, tool perf |
get_observability_errors(days, limit, page) |
Traces with errors |
get_observability_tool_errors(days, limit, page) |
Traces with tool failures |
API Keys & Session
| Method | Description |
|---|---|
init() |
Validate credentials and get user/tenant info |
list_api_keys(include_inactive) |
List API keys for your tenant |
revoke_api_key(api_key_id) |
Revoke an API key |
health_check() |
Check if the API is reachable |
Error Handling
from ashr_labs import AshrLabsClient, NotFoundError, AuthenticationError
client = AshrLabsClient(api_key="tp_...")
try:
dataset = client.get_dataset(dataset_id=999)
except AuthenticationError:
print("Invalid API key")
except NotFoundError:
print("Dataset not found")
Configuration
# All defaults — just pass API key
client = AshrLabsClient(api_key="tp_...")
# From environment (reads ASHR_LABS_API_KEY)
client = AshrLabsClient.from_env()
# Custom timeout
client = AshrLabsClient(api_key="tp_...", timeout=60)
# Custom base URL (for self-hosted)
client = AshrLabsClient(api_key="tp_...", base_url="https://your-api.example.com")
Requirements
- Python 3.10+
- No external dependencies (uses only standard library)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ashr_labs-0.2.1.tar.gz.
File metadata
- Download URL: ashr_labs-0.2.1.tar.gz
- Upload date:
- Size: 62.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f01b2f8926c8341f92bdb0b6b5b6952860c988be71ba119908f0d9e83b94eb47
|
|
| MD5 |
1670c56c09ea025f67ba98d0706abebd
|
|
| BLAKE2b-256 |
159f5a9593c21968064c745ec7bbb7f7fd8abc76131d78b0aa28dc6b67f4fd8c
|
File details
Details for the file ashr_labs-0.2.1-py3-none-any.whl.
File metadata
- Download URL: ashr_labs-0.2.1-py3-none-any.whl
- Upload date:
- Size: 51.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7259d2944756ce696f9fdc9d01b5498f8bcfabb5087ce7ec59b6d57c66c665b8
|
|
| MD5 |
c6f8759d52ffd5b7d06dfe899104abb5
|
|
| BLAKE2b-256 |
793eaa19e4370f9a7b280ed0ff868450fdd402f50da045b0bc2a1f48cedbe1c7
|