Multi-turn behavioral drift detection for LLMs — tone, sycophancy, refusal sensitivity, persona stability
Project description
PromptPressure
multi-turn behavioral drift detection for LLMs. the things benchmarks don't test.
most eval frameworks measure accuracy on known-answer datasets. PromptPressure measures how models behave over sustained interaction. does the model's tone drift at turn 8? does it cave to sycophancy after 3 rounds of pressure? does persona stability degrade as context fills up?
190 active prompts across 11 behavioral categories, tiered for CI speed. run against any model. get a per-turn behavioral report.
install
pip install promptpressure-evals
distribution name is promptpressure-evals (the promptpressure slot on PyPI is held by an unrelated red-team scanner). import name and CLI entry points are unchanged: import promptpressure, pp, promptpressure.
source install for hacking:
git clone https://github.com/StressTestor/PromptPressure.git
cd PromptPressure
pip install -e .
quick start in 60 seconds
pip install promptpressure-evals
cp .env.example .env # if you cloned the repo; otherwise create one
# add your API keys (see .env.example for which adapters need what)
promptpressure --quick --multi-config configs/config_mock.yaml
--quick runs 3 sequences (~18 turns) in under 10 minutes. results land in outputs/<timestamp>/ with CSVs, metrics JSON, and an HTML report.
for a real eval against a cloud model:
promptpressure --tier full --multi-config configs/config_openrouter_gpt_oss_20b_free.yaml
launcher
one command. three dropdowns. one button.
pip install promptpressure-evals
pp
pp starts the API on 127.0.0.1 (first free port in 8000-8019) and opens a browser. pick a provider, model, and an eval set. hit Run. output streams into the status panel.
v1 runs only the first selected eval set if you check more than one; multi-set support is on the v2 list.
binds 127.0.0.1 only. for remote access, run uvicorn promptpressure.api:app --host 0.0.0.0 with PROMPTPRESSURE_API_SECRET set.
stop with Ctrl-C in the terminal that started pp. the server subprocess gets SIGTERM, then SIGKILL after 5s if it doesn't exit cleanly.
known v1 limitation: if you reload the browser mid-run, the EventSource auto-reconnects to the same run_id and resumes - but only within 5 minutes after the run completes. after that, the run state has been cleaned up. check /evaluations/{run_id} for completed runs.
pp --help and pp --version work as expected.
what it evaluates
| category | count | what it tests |
|---|---|---|
| sycophancy detection | 25 | wrong premises, answer changes when challenged, multi-turn pressure sequences |
| instruction following under conflict | 25 | contradictory instructions, system vs user prompt conflicts |
| tone consistency | 20 | same request in formal/casual/hostile/pleading. does helpfulness change? |
| psychological reasoning | 25 | theory of mind, emotional inference, manipulation detection |
| model drift detection | 20 | calibration questions with known answers, knowledge cutoff awareness |
| persona stability | 15 | system prompt adherence under adversarial messages |
| output format compliance | 15 | JSON output, structured extraction, word count constraints |
| multilingual consistency | 15 | same eval in EN/ES/ZH/AR. does behavior change across languages? |
| context window stress | 15 | long-context faithfulness, needle-in-haystack, buried instructions |
190 active prompts. 30 adversarial refusal sensitivity prompts archived separately. each prompt has expected behavior, grading criteria, and tier/difficulty tags.
how it compares
| feature | PromptPressure | promptfoo | Inspect | lm-eval-harness |
|---|---|---|---|---|
| refusal sensitivity gradient | yes | no | no | no |
| tone-dependent behavior testing | yes | no | no | no |
| sycophancy detection | yes | no | no | no |
| persona stability testing | yes | no | no | no |
| psychological reasoning evals | yes | no | no | no |
| multilingual behavior consistency | yes | partial | no | partial |
| accuracy benchmarks | no | yes | yes | yes |
| custom eval datasets | yes | yes | yes | yes |
| multi-model comparison | yes | yes | yes | yes |
| built-in grading pipeline | yes | yes | yes | no |
PromptPressure is not trying to replace accuracy benchmarks. it tests the behavioral layer that accuracy benchmarks miss.
run tiers
every eval entry is tagged with a tier. tiers are cumulative: --tier quick runs both smoke and quick entries.
| tier | entries | turns | time (fast models) | use case |
|---|---|---|---|---|
smoke |
0* | ~0 | <60s | CI gate (sequences coming in v3.2) |
quick |
3 | ~18 | <10 min | local dev, default |
full |
190 | ~500+ | ~1 hr | pre-release |
deep |
190 | ~500+ | 2+ hrs | quarterly audit (20-turn sequences coming in v3.2) |
*smoke and deep tier sequences are planned for v3.2 when multi-turn content is generated.
promptpressure --quick --multi-config config.yaml # 3 sequences, fast
promptpressure --tier full --multi-config config.yaml # all 190 sequences
promptpressure --smoke --multi-config config.yaml # CI mode (needs smoke-tagged entries)
the default tier is quick. entries without a tier field default to full.
per-turn metrics
multi-turn sequences automatically compute behavioral metrics after each turn:
- response_length_ratio:
len(response) / len(user_message). detects terse/verbose drift across turns. a model that starts with detailed responses and shrinks to one-liners is drifting.
metrics are attached to each turn in the JSON output under turn_responses[].metrics and aggregated at result_data.per_turn_metrics.
archived adversarial suite
30 refusal sensitivity prompts are archived separately at archive/adversarial/refusal_sensitivity.json. these test how models handle requests that could be interpreted as harmful but are actually benign (academic research, creative writing, historical analysis).
archived because hosted API providers may flag or rate-limit accounts running adversarial-adjacent prompts at scale.
run them explicitly:
promptpressure --dataset archive/adversarial/refusal_sensitivity.json --multi-config config.yaml
adapters
| adapter | type | what you need |
|---|---|---|
| LiteLLM | proxy | litellm proxy on localhost:4000 (routes to any provider) |
| Claude Code | CLI | claude CLI installed (subscription) |
| OpenCode Zen | CLI | opencode CLI installed (subscription) |
| OpenRouter | cloud | OPENROUTER_API_KEY |
| Groq | cloud | GROQ_API_KEY |
| OpenAI | cloud | OPENAI_API_KEY |
| Ollama | local | ollama running on localhost |
| LM Studio | local | LM Studio running on localhost |
| Mock | test | nothing. synthetic responses for CI |
switch adapters with one line in your config YAML:
adapter: litellm
model: claude-sonnet-4-6
litellm proxy (recommended for multi-provider evals)
litellm runs as a local proxy on localhost:4000, routing to anthropic, deepseek, and google APIs through a single OpenAI-compatible endpoint. one adapter, any model. reasoning token capture works for deepseek-r1 through the proxy.
pip install 'litellm[proxy]'
# set your provider keys
export ANTHROPIC_API_KEY=sk-ant-...
export DEEPSEEK_API_KEY=sk-...
export GOOGLE_API_KEY=AI...
# start the proxy
scripts/start-litellm.sh
# run eval
promptpressure --tier full --multi-config configs/config_litellm_sonnet.yaml
available models via litellm: claude-sonnet-4-6, claude-opus-4-6, deepseek-r1, deepseek-chat, gemini-2.5-flash, gemini-2.5-pro, grok-4.20-reasoning, grok-4.20-multi-agent, grok-4.20-fast, gpt-4o, gpt-4o-mini, llama-3.3-70b. config lives in litellm_config.yaml at project root.
custom adapters
adapters are async functions. add one by creating a file in promptpressure/adapters/:
# promptpressure/adapters/your_adapter.py
import httpx
async def generate_response(prompt: str, model_name: str = "your-model", config: dict = None) -> str:
api_key = config.get("your_api_key") if config else None
async with httpx.AsyncClient(timeout=60.0) as client:
response = await client.post(
"https://api.example.com/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={"model": model_name, "messages": [{"role": "user", "content": prompt}]}
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
register it in promptpressure/adapters/__init__.py:
from .your_adapter import generate_response as your_generate_response
# in load_adapter():
if name_lower == "your_adapter":
return lambda text, config: your_generate_response(text, config.get("model_name"), config)
zero-cost adapters
Claude Code and OpenCode run through their respective CLI tools. no API keys, no per-token costs. if you have a subscription, the eval runs are free.
Claude Code uses claude -p in non-interactive mode. supports --continue for multi-turn sycophancy sequences and --model for model selection.
promptpressure --multi-config configs/config_claude_code.yaml
adapter: claude-code
model: sonnet
OpenCode Zen uses opencode run in non-interactive mode. auto-selects the best model via Zen for each prompt.
promptpressure --multi-config configs/config_opencode_zen.yaml
adapter: opencode-zen
both adapters check if the CLI tool is installed before running and give a clear error with install instructions if not found.
batch mode
batch is the default for full and deep tier runs through the litellm adapter. single-turn entries route through the provider's batch API automatically (50% off for anthropic and google). real-time is the exception, not the default.
# batch is automatic for full/deep + litellm
promptpressure --tier full --multi-config configs/config_litellm_sonnet.yaml
# force real-time for debugging
promptpressure --no-batch --tier full --multi-config configs/config_litellm_sonnet.yaml
# smoke/quick tiers always use real-time (no batch overhead for small runs)
promptpressure --quick --multi-config configs/config_litellm_sonnet.yaml
entries that always use real-time regardless of flags:
- multi-turn sequences (each turn depends on the previous response)
- deepseek R1 (reasoning tokens don't survive batch responses)
- providers without batch support (deepseek-chat, groq, ollama)
- providers without batch API (openrouter, groq, ollama)
| entry type | anthropic | google/gemini | xai/grok | deepseek R1 | deepseek-chat | openrouter |
|---|---|---|---|---|---|---|
| single-turn | batch (50% off) | batch (50% off) | batch (50% off) | real-time | real-time | real-time |
| multi-turn | real-time | real-time | real-time | real-time | real-time | real-time |
cost tracking: litellm responses include token usage. the eval runner computes per-model cost via litellm.completion_cost() and saves to outputs/<timestamp>/cost.json.
{"per_model": {"Claude Sonnet 4.6 (litellm)": {"cost_usd": 0.0234, "requests": 200}}, "total_cost_usd": 0.0234}
post-analysis (automated grading)
score responses automatically after evaluation:
promptpressure --multi-config configs/config.yaml --post-analyze openrouter
the grading pipeline uses XML boundary tags to prevent the evaluated model's response from influencing its own score (prompt injection defense).
override the scoring model:
scoring_model_name: anthropic/claude-3-haiku
CI mode
promptpressure --multi-config configs/config_mock.yaml --ci
outputs a machine-readable JSON summary to stdout. exits 0 if all prompts pass, exits 1 on any failure.
{"total": 200, "passed": 200, "failed": 0, "errors": 0, "success": true}
CLI reference
$ promptpressure --help
usage: promptpressure [-h] [--multi-config MULTI_CONFIG [MULTI_CONFIG ...]]
[--post-analyze {groq,openrouter}] [--schema] [--ci]
[--tier {smoke,quick,full,deep}] [--smoke] [--quick]
{plugins} ...
options:
--multi-config YAML config file(s)
--tier run tier: smoke, quick, full, deep (default: quick)
--smoke shortcut for --tier smoke
--quick shortcut for --tier quick
--no-batch force real-time (batch is default for litellm + full/deep)
--post-analyze post-eval grading via groq or openrouter
--schema dump JSON Schema for configuration
--ci machine-readable output + exit codes
plugins list list available plugins
plugins install install a plugin by name
configuration
configs live in configs/:
adapter: openrouter
model: openai/gpt-oss-20b:free
model_name: GPT-OSS 20B
dataset: evals_dataset.json
output: results.csv
output_dir: outputs
temperature: 0.7
tier: quick # smoke | quick | full | deep
max_workers: 5
collect_metrics: true
run multiple configs in one pass:
promptpressure --multi-config configs/a.yaml configs/b.yaml
project structure
promptpressure/
adapters/ # model connectors (openrouter, groq, ollama, claude code, etc)
plugins/ # scorer plugin system
monitoring/ # prometheus metrics + docker-compose
templates/ # jinja2 report templates (html, markdown)
api.py # fastapi server (optional, for programmatic access)
cli.py # main eval runner
config.py # pydantic settings
tier.py # tier filtering (smoke/quick/full/deep)
per_turn_metrics.py # automated per-turn behavioral metrics
database.py # sqlalchemy models
metrics.py # metrics collector
rate_limit.py # async token bucket rate limiter
reporting.py # report generator
configs/ # yaml eval configs per model
evals_dataset.json # 190 behavioral eval prompts (tiered)
archive/adversarial/ # 30 archived refusal sensitivity prompts
schema.json # JSON Schema for dataset entry format
results/ # saved eval results (per-model JSON)
examples/ # sample reports and comparison data
tests/ # pytest suite (50 tests)
sample report
see examples/sample_report.html for what the output looks like.
security
- API keys loaded from
.env(gitignored), never persisted to database - API server binds to
127.0.0.1by default - CORS restricted to localhost (override with
--cors-origins) - bearer token auth on all API endpoints (set
PROMPTPRESSURE_API_SECRET) - grading pipeline uses XML boundaries to prevent prompt injection
- plugin install requires authentication
- no telemetry
contributing
- tests pass:
pytest tests/ - no unnecessary dependencies
- document changes
license
MIT. see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file promptpressure_evals-3.2.0.tar.gz.
File metadata
- Download URL: promptpressure_evals-3.2.0.tar.gz
- Upload date:
- Size: 89.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84808a17ca1bd136da837c73c95d56ee6272209df8d22d3d92f49309bed4f70a
|
|
| MD5 |
7b8c3718433c756667e5f7c3b9096321
|
|
| BLAKE2b-256 |
5b4b8dfa7e7f8b3df7f822de44942634a189619b19b48ad592a4483a48e31571
|
File details
Details for the file promptpressure_evals-3.2.0-py3-none-any.whl.
File metadata
- Download URL: promptpressure_evals-3.2.0-py3-none-any.whl
- Upload date:
- Size: 76.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfdc4adb98667fd037734ea4db56e2ac0d58531986d9099e80be3f1d1d5a6068
|
|
| MD5 |
1eb38481e662fdb16db752d530ce41e6
|
|
| BLAKE2b-256 |
5b330564ecbf951f291758946e6bdc33cc994531c1882b5fa5f5a553e7b16420
|