Pytest plugin for testing chatbots and LLM apps — prompt injection, jailbreaks, system-prompt leaks, hallucinations, brand drift.
Project description
pytest-wardenbot
Pytest plugin for testing chatbots and LLM apps — prompt injection, jailbreaks, system-prompt leaks, hallucinations, brand drift.
📖 Documentation: pytest-wardenbot.wardenbot.ai
Status: pre-release. v0.1.2 is in active development. APIs may change before the first stable release. The v0.2 roadmap is tracked in GitHub Issues.
What it does
Run pytest against your chatbot and find out if it leaks its system prompt, complies with known jailbreaks, hallucinates business facts, or drifts from your brand voice.
- Black-box. Tests run against your live chatbot via HTTP, OpenAI API, Anthropic API, or any object you write a small adapter for.
- Deterministic-first. v0.1 ships 29 tests that need zero LLM API spend — regex, substring, and schema checks. Optional LLM-judge tests (DeepEval) ship as an extra for semantic checks.
- Agent-ready failures. When a test fails, the failure message includes a structured Markdown remediation prompt you can paste into Cursor or Claude Code.
- Verified adapters. The bundled OpenAI and Anthropic adapters are smoke-tested weekly against the live vendor APIs in CI (live-api-smoke) — a real round-trip stays known-good, not just mocked.
What "passing" means (and doesn't)
A green run means your chatbot didn't fail any of the bundled 29 attacks in the most overt way. It's a useful smoke test and a regression detector — if a deploy turns a green test red, that's a real signal to investigate.
A green run does not mean your chatbot is secure. Frontier-grade attacks are multi-turn, novel, and adapted to your specific bot — no fixed corpus catches all of them. Treat the shipped suite as a starter set: pair it with periodic red-team exercises (or our Continuous Monitoring service) for the always-on adversarial coverage CI alone can't provide.
Install
pip install pytest-wardenbot
Optional extras for LLM-judge tests or vendor-native adapters:
pip install "pytest-wardenbot[judge]" # adds DeepEval for semantic checks
pip install "pytest-wardenbot[openai]" # adds OpenAI Chat + Assistants adapters (sync + async)
pip install "pytest-wardenbot[anthropic]" # adds Anthropic Messages adapter (sync + async)
Note: the OpenAI Assistants API is deprecated (sunset 2026-08-26).
OpenAIAssistantsAdapteris a stopgap for teams still on it — it emits aDeprecationWarning; preferOpenAIChatAdapterfor new work.
Quickstart (under 60 seconds)
pip install pytest-wardenbot
pytest --wardenbot-quickstart # generates conftest.py + test_my_bot.py
export CHATBOT_URL=https://your-chatbot.example.com/chat
export CHATBOT_TOKEN=sk-... # optional
pytest # runs all shipped tests against your bot
--wardenbot-quickstart accepts an industry template:
pytest --wardenbot-quickstart=ecommerce # adds refund/shipping fact placeholders
pytest --wardenbot-quickstart=saas-support # adds plan/trial fact placeholders
pytest --wardenbot-quickstart=generic # default; minimal placeholders
Then edit conftest.py to replace the TODO placeholders with your real
business facts and re-run pytest. Worked examples in examples/
cover the basic HTTP setup, a custom OpenAI adapter, and a GitHub Actions
workflow.
Manual setup (if you prefer)
Add this to your project's conftest.py:
import os
import pytest
from pytest_wardenbot.adapters.http import HTTPChatbotAdapter
@pytest.fixture
def chatbot():
return HTTPChatbotAdapter(
url="https://your-chatbot.example.com/chat",
headers={"Authorization": f"Bearer {os.environ['CHATBOT_TOKEN']}"},
request_field="message", # the JSON key your bot reads the prompt from
response_field="response", # the JSON key your bot returns the text in
)
Then run the shipped tests with pytest --pyargs pytest_wardenbot.tests.
When a test fails, read the failure message, paste the agent-ready Markdown into Cursor / Claude Code, ship the fix.
What's in v0.1
| Category | Count | Grading | Requires API key? |
|---|---|---|---|
| Prompt-injection / jailbreak resistance | 5 prompts × 2 checks = 10 | deterministic | no |
| System-prompt leak elicitation (dedicated extraction prompts) | 3 | deterministic | no |
| Refusal-bypass (roleplay / pretext / hypothetical framings) | 3 | deterministic | no |
| Off-topic deflection (scoped bots) | 2 | deterministic | no |
| Indirect / cross-prompt injection (XPIA) | 4 | deterministic | no |
| Encoded-payload jailbreak (Base64 / ROT13 / leet / hex) | 4 | deterministic | no |
| Multi-turn jailbreak (priming + payload, needs session-aware adapter) | 3 | deterministic | no |
| Canary-token leak (opt-in; you plant the token) | 1 | deterministic | no |
| Business-truth verification (parametrized over your facts) | user-supplied | deterministic | no |
| Semantic checks via DeepEval (5 factories: equivalence, brand, hallucination, off-policy, refusal quality) | user-supplied | LLM-judge | yes, with [judge] extra |
That's 29 deterministic tests out-of-the-box (plus the opt-in canary leak test, plus your business-truth and judge lists). Tests run in under a second against a real chatbot with zero LLM API spend unless you've opted into the [judge] extra.
The v0.2 roadmap (RAMPART for tool-using agents, LangChain/MCP adapters, ensemble judging, and more) is tracked in GitHub Issues.
How it's different from related tools
- vs Promptfoo (acquired by OpenAI in Feb 2026): Promptfoo is a developer testing CLI. We're a pytest plugin — same tool your existing test suite uses, same CI integration you already have.
- vs DeepEval: DeepEval focuses on evaluation metrics (faithfulness, relevancy). We focus on adversarial security probes (jailbreak, system-prompt leak, refusal-bypass) — different problem, complementary tool. (We use DeepEval under the hood for our optional semantic checks.)
- vs Garak / PyRIT: Garak and PyRIT are research-grade attack libraries. We package a curated subset as everyday pytest tests with clear failure messages.
License
Apache 2.0. See LICENSE.md.
Powered by
WardenBot AI — continuous external monitoring for AI chatbots.
The pytest plugin is the free, open-source slice of our test corpus. Want continuous monitoring across all your bots with daily probes and a dashboard? Tell us about your setup — we open invites in small batches.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_wardenbot-0.1.3.tar.gz.
File metadata
- Download URL: pytest_wardenbot-0.1.3.tar.gz
- Upload date:
- Size: 83.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01393606fb9ebc097b98cfc05253e7cf9025b8aaa27f62cb4ab27715905af0ed
|
|
| MD5 |
51281692e6eebcc3f1a682d209194669
|
|
| BLAKE2b-256 |
00905153b3bab2dc5920e7e1ac44948903154234a855a59f09caf6b3c2892df4
|
Provenance
The following attestation bundles were made for pytest_wardenbot-0.1.3.tar.gz:
Publisher:
release.yml on pardamike/pytest-wardenbot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_wardenbot-0.1.3.tar.gz -
Subject digest:
01393606fb9ebc097b98cfc05253e7cf9025b8aaa27f62cb4ab27715905af0ed - Sigstore transparency entry: 1633129715
- Sigstore integration time:
-
Permalink:
pardamike/pytest-wardenbot@358d7d0d4e33b3ed3211b3a06eef3431b8294a39 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/pardamike
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@358d7d0d4e33b3ed3211b3a06eef3431b8294a39 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pytest_wardenbot-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pytest_wardenbot-0.1.3-py3-none-any.whl
- Upload date:
- Size: 76.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dd2db768b28c413e45cba727c27d61995d2f85fb8479458fc2918d5ad34f8e0
|
|
| MD5 |
1936463808e258180a430b311e7d21c6
|
|
| BLAKE2b-256 |
1b9afdd5d0dd6091f0251ee191ca58e1f67a62b84138c03b7beeb1430e123506
|
Provenance
The following attestation bundles were made for pytest_wardenbot-0.1.3-py3-none-any.whl:
Publisher:
release.yml on pardamike/pytest-wardenbot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_wardenbot-0.1.3-py3-none-any.whl -
Subject digest:
8dd2db768b28c413e45cba727c27d61995d2f85fb8479458fc2918d5ad34f8e0 - Sigstore transparency entry: 1633129772
- Sigstore integration time:
-
Permalink:
pardamike/pytest-wardenbot@358d7d0d4e33b3ed3211b3a06eef3431b8294a39 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/pardamike
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@358d7d0d4e33b3ed3211b3a06eef3431b8294a39 -
Trigger Event:
push
-
Statement type: