AI agent reliability testing platform
Project description
AI agent reliability testing infrastructure. Stress test your agents against 8 attack categories—prompt injection, tool abuse, jailbreaks, data exfiltration, and more—before they hit production.
Observability, evaluation, and adversarial testing platform for LLM agents.
Ship reliable, secure, and debuggable agentic systems.
Installation
Install the official Python SDK via pip:
pip install watchllm
Authentication
WatchLLM requires an API key to securely log your simulations to your dashboard.
Option A: Interactive CLI Login (Recommended for Local Dev)
watchllm auth login
This will prompt you for your API key and securely save it to ~/.watchllm/config.
Option B: Environment Variables (Recommended for CI/CD)
Set the following environment variable in your pipeline or .env file:
export WATCHLLM_API_KEY="your_api_key_here"
Running Simulations via CLI
The fastest way to test an agent is using the WatchLLM CLI.
# Run a specific attack category
watchllm simulate --agent my_module.my_agent --categories prompt_injection
# Run multiple categories simultaneously
watchllm simulate --agent my_module.my_agent --categories prompt_injection,tool_abuse,hallucination
# Run all 8 attack categories
watchllm simulate --agent my_module.my_agent --categories all
Note: Replace my_module.my_agent with the Python import path to your agent function. Your agent function must accept a string and return a string.
Native Python Integration (CI/CD)
For automated testing inside your test suite or CI/CD pipelines, wrap your agent with the @watchllm.test decorator. This exposes your agent's signature to the WatchLLM runner.
import watchllm
@watchllm.test(
categories=["prompt_injection", "data_exfiltration"],
threshold=0.3, # Raises WatchLLMThresholdError if severity >= 0.3
)
def my_agent(user_input: str) -> str:
# Your agent logic here
return response
If the agent exhibits a vulnerability during the test, the SDK will automatically fail the pipeline and exit with a non-zero status code.
Analyzing Results
Once a simulation completes, the CLI will output the terminal state:
Simulation ID: sim_abc123
Status: completed
Severity: 0.90
Verdict: Critical vulnerability detected
To view the complete execution graph, scrub through the node lineage, and replay specific forks, visit the dashboard: https://dashboard.watchllm.dev
Command Reference
| Command | Description |
|---|---|
watchllm doctor |
Verifies your setup, API key validity, and agent reachability. |
watchllm simulate |
Launches a new attack simulation. |
watchllm replay |
Prints a text-based tree of the execution graph directly in your terminal. |
watchllm status |
Returns the current completion percentage and live severity scores of a running simulation. |
Requirements
- Python 3.9 or higher
- The
requestslibrary (installed automatically)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file watchllm-0.1.1.tar.gz.
File metadata
- Download URL: watchllm-0.1.1.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5e1fb6b69ad88d1b355a61b909f0eac77320bc5a44f2b7586556622d43dc541
|
|
| MD5 |
a07509e5567fad459f897e201cd14f25
|
|
| BLAKE2b-256 |
c8fa46f2000749e480a1dc7c6ede509d1ea8257fa5fd9b91f3922b631cb2a8cf
|
File details
Details for the file watchllm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: watchllm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1e0cade54561d10eac5e8aada78a59732a46abf61cd9960283a7243b51ad63a
|
|
| MD5 |
997d4729ea907a4a471512a1e13f17eb
|
|
| BLAKE2b-256 |
573e52466efe5e1298c213a86cc24f1105d1e83707684aad32c4428f02619dd8
|