Harness for measuring LLM agent resistance to indirect prompt injection and comparing defense effectiveness.
Project description
AgentProbe: Defense Evaluation Harness for LLM Agents
What This Is
A testing framework for measuring your LLM agent's resistance to indirect prompt injection and comparing defense effectiveness. Tests your own systems or those you have permission to test.
NOT an attack generator or bypass toolkit. NOT for probing other people's systems.
Key Findings from Our Research
Our testing on gpt-4o-mini and claude-haiku-4-5 reveals three things:
-
Surface-level linguistic transforms don't work on modern models
- Pragmatic implicature, register shifts, code-switching: ~0% success rate
- Modern LLMs aren't fooled by just changing speech act or tone
-
Indirect injection through data IS a real vulnerability
- Information hidden in tool outputs (emails, documents, web pages) bypasses prompt-level defenses
- Separation at prompt level is not enough
-
Asymmetry: Models leak data more readily than execute unauthorized actions
- Defending against information leakage != defending against tool abuse
- Different threat models need different defenses
Results: Defense Effectiveness
gpt-4o-mini
Defense names below match the defense column in the CSV outputs (data/) and JSON reports.
| Defense (code name) | Leak Rate | N |
|---|---|---|
none (baseline) |
29.8% | 84 |
delimited (delimiter wrap) |
25.0% | 84 |
instr_hierarchy (privilege-level instruction) |
31.0% | 84 |
sandwich (recency reinforcement) |
15.5% | 84 |
spotlight (datamarking) |
6.0% | 84 |
llm_filter (separate screening pass) |
0% | 84 |
For reference, the same battery on gpt-4o leaks much less (baseline 10.7%, delimited/llm_filter 0%), and claude-haiku-4-5 holds 0% across every defense — so absolute numbers are model-specific; treat them as relative defense rankings, not universal constants.
claude-haiku-4-5 holds baseline at 0% leak rate across all test conditions; defense differentiation is not measurable on this model.
Key Finding: Screening (and datamarking) beat prompt-level defenses
The separate verification pass (llm_filter) achieved 0 successful leaks in 84 test runs on gpt-4o-mini. The next best is spotlight (datamarking) at 6.0%. By contrast, prompt-level instruction (instr_hierarchy, 31.0%) was no better than baseline (29.8%).
This suggests: prompt-level instructions and delimiters are incomplete; either token-level datamarking or a separate, independent judgment pass is required to reliably catch injection.
How To Use
Test Your Own Agent
Note: The PyPI package is named
agentprobe-injection(the plainagentprobename was already taken). The import package and CLI command are stillagentprobe.
# Install from PyPI
pip install agentprobe-injection
# Or install the latest from GitHub
pip install git+https://github.com/Samgar-kz/agentprobe.git
# Or clone for development
git clone https://github.com/Samgar-kz/agentprobe.git
cd agentprobe && pip install -e .
export OPENAI_API_KEY="..."
agentprobe scan \
--target dummy \
--oracle semantic \
--json-report results.json
# Check results
cat results.json | jq '.statistics'
Available Defenses to Test
The harness measures effectiveness of these defenses:
none— baseline (no defense applied)delimited— wrap data in<<<UNTRUSTED_DATA_BEGIN>>>...<<<UNTRUSTED_DATA_END>>>markersspotlight— datamarking: mark every data token so the model separates data from instructionssandwich— repeat the do-not-obey rule after the data (recency effect)instr_hierarchy— tag data with an explicit low privilege level; assert system instructions outrank tool/data contentllm_filter— separate LLM verification pass to detect/strip injection before execution
Test each against YOUR agent. See which work, which break utility.
How It Works
- Injection Generator: Creates test payloads (carriers: email, document, web page) with hidden canary instructions
- Defense Applicator: Wraps the data with each defense mechanism
- Target Adapter: Sends to your agent, captures response
- Semantic Oracle: Uses gpt-4o-mini to judge: did agent leak data or follow the hidden instruction?
- Utility Harness: Runs benign legitimate tasks to ensure defenses don't break normal functionality
- Report: Table showing defense effectiveness + utility cost
Defense vs Utility Trade-off
Result: All 5 defenses preserve utility on legitimate tasks (120/120 runs, 0% false-positive rate).
Tested on 8 benign tasks (extract dates, risks, budget, sentiment, action items, meeting notes, legitimately forward to internal address) with 3 repeats each:
| Defense | False-Positive Rate | Status |
|---|---|---|
none |
0% | baseline |
delimited |
0% | safe to use |
spotlight |
0% | safe to use |
sandwich |
0% | safe to use |
instr_hierarchy |
0% | safe to use |
llm_filter |
0% | safe to use |
Conclusion: Defenses do not break legitimate agent functionality (in current test suite). Task success rate remains 100% across all defenses, making the injection effectiveness/defense trade-off directly comparable (both measured under same utility constraints).
Run your own: python run_utility_harness.py --repeats=3 --temp=0.7 --out=utility_results.csv
Responsible Use
- Only test systems you own or have written permission to test
- Destination: understanding YOUR defenses, not generating portable bypasses
- Disclose findings responsibly (if testing third-party systems with permission)
- The framework measures vulnerability, it's not a jailbreak toolkit
Architecture
agentprobe/
├── oracle_semantic.py # LLM-as-judge using gpt-4o-mini
├── oracle_legacy.py # Fallback: substring matching
├── oracle.py # Oracle interface
├── adapters/
│ ├── dummy.py # Built-in intentionally-vulnerable agent simulator
│ ├── http.py # Test any HTTP-accessible agent (sync)
│ └── http_async.py # Async HTTP adapter for concurrent scans
├── injection/
│ ├── carriers.py # Email, document, web page wrappers
│ ├── defenses.py # Defense mechanisms to evaluate
│ ├── benign_tasks.py # Utility harness tasks
│ └── screening.py # Screening defense (separate LLM pass)
├── engine.py # Synchronous scan
├── engine_async.py # Async scan
├── metrics.py # Statistical analysis (Wilson CI, effect sizes)
├── report.py # Report generation
├── logging_config.py # Structured logging, cost tracking
└── cli.py # Command-line interface
Command-Line Usage
Basic scan
# Test dummy agent
agentprobe scan --target dummy
# Test HTTP agent
agentprobe scan --target http \
--endpoint http://localhost:8000/chat \
--input-field message \
--output-field reply
Control oracle
# Use semantic oracle (default, requires OPENAI_API_KEY)
agentprobe scan --target dummy --oracle semantic
# Use legacy oracle (offline, pattern matching)
agentprobe scan --target dummy --oracle legacy
# Set confidence threshold
agentprobe scan --target dummy --oracle semantic --min-confidence 0.85
Reports
# JSON report with statistics
agentprobe scan --target dummy --json-report results.json
# Verbose logging
agentprobe scan --target dummy --verbose 2
Measurement Infrastructure
- Oracle: gpt-4o-mini with Structured Outputs (semantic judgment)
- Test Harness: Carriers simulate real data flows (email, document, web page)
- Utility Harness: Measures task success rate per defense on benign tasks (see Defense vs Utility Trade-off above)
- Benchmarking: Latency / throughput available via
--async --concurrency Non HTTP targets
All numbers above are from actual test runs (CSV in /data/).
Testing Your Own Code
# Run all tests
pytest tests/ -v
# Test a specific component
pytest tests/test_oracle_semantic.py -v
# Run with coverage
pytest tests/ --cov=agentprobe
# Benchmark async performance
agentprobe scan --target dummy --async --concurrency 15
What's NOT Included
- Evasion techniques or obfuscation tooling (intentionally)
- Zero-day exploits or novel vulnerabilities
- Portable bypass payloads designed to be transferable across different systems
Note on linguistic transforms: The harness does include pragmatic, register, discourse and code-switching (ru-en) categories — but as measurement probes, not as attack tooling. Our data shows surface-level linguistic transforms have ~0% success on modern frontier models, which is itself a useful finding for defenders deciding where to invest.
This is a defensive measurement tool, not an offensive toolkit.
Citation
If you use this in research, cite as:
@misc{agentprobe2026,
title={AgentProbe: Evaluating LLM Agent Defenses Against Indirect Injection},
author={Samgar},
year={2026},
url={https://github.com/Samgar-kz/agentprobe}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentprobe_injection-0.2.0a1.tar.gz.
File metadata
- Download URL: agentprobe_injection-0.2.0a1.tar.gz
- Upload date:
- Size: 414.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4865d0c078ae65e8e0d7df54f4ed89ed2e5bf7f733c674021e055a05d818ab91
|
|
| MD5 |
b4274614201c013b235a22ac5f31246e
|
|
| BLAKE2b-256 |
f03e709625e7b8b3283679f90d580b070e61f2b9e20176c20628d54bbd300170
|
File details
Details for the file agentprobe_injection-0.2.0a1-py3-none-any.whl.
File metadata
- Download URL: agentprobe_injection-0.2.0a1-py3-none-any.whl
- Upload date:
- Size: 52.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81487012ad5a59b00420d5216d8c96929541e3499c99d7478def6ed9ea67166b
|
|
| MD5 |
408a32f5cb9b9c3a5706f62f2f9656ec
|
|
| BLAKE2b-256 |
07fece6c8d2f7fbc3c219e4a6a496988c505ae19293ba49deee9c0ca91aff56a
|