Harness for measuring LLM agent resistance to indirect prompt injection and comparing defense effectiveness.

These details have not been verified by PyPI

Project links

Project description

AgentProbe: Defense Evaluation Harness for LLM Agents

What This Is

A testing framework for measuring your LLM agent's resistance to indirect prompt injection and comparing defense effectiveness. Tests your own systems or those you have permission to test.

NOT an attack generator or bypass toolkit. NOT for probing other people's systems.

Key Findings from Our Research

Our testing on gpt-4o-mini and claude-haiku-4-5 reveals three things:

Surface-level linguistic transforms don't work on modern models
- Pragmatic implicature, register shifts, code-switching: ~0% success rate
- Modern LLMs aren't fooled by just changing speech act or tone
Indirect injection through data IS a real vulnerability
- Information hidden in tool outputs (emails, documents, web pages) bypasses prompt-level defenses
- Separation at prompt level is not enough
Asymmetry: Models leak data more readily than execute unauthorized actions
- Defending against information leakage != defending against tool abuse
- Different threat models need different defenses

Results: Defense Effectiveness

gpt-4o-mini

Defense names below match the defense column in the CSV outputs (data/) and JSON reports.

Defense (code name)	Leak Rate	N
`none` (baseline)	29.8%	84
`delimited` (delimiter wrap)	25.0%	84
`instr_hierarchy` (privilege-level instruction)	31.0%	84
`sandwich` (recency reinforcement)	15.5%	84
`spotlight` (datamarking)	6.0%	84
`llm_filter` (separate screening pass)	0%	84

For reference, the same battery on gpt-4o leaks much less (baseline 10.7%, delimited/llm_filter 0%), and claude-haiku-4-5 holds 0% across every defense — so absolute numbers are model-specific; treat them as relative defense rankings, not universal constants.

claude-haiku-4-5 holds baseline at 0% leak rate across all test conditions; defense differentiation is not measurable on this model.

Key Finding: Screening (and datamarking) beat prompt-level defenses

The separate verification pass (llm_filter) achieved 0 successful leaks in 84 test runs on gpt-4o-mini. The next best is spotlight (datamarking) at 6.0%. By contrast, prompt-level instruction (instr_hierarchy, 31.0%) was no better than baseline (29.8%).

This suggests: prompt-level instructions and delimiters are incomplete; either token-level datamarking or a separate, independent judgment pass is required to reliably catch injection.

How To Use

Test Your Own Agent

Note: The PyPI package is named agentprobe-injection (the plain agentprobe name was already taken). The import package and CLI command are still agentprobe.

# Install from PyPI
pip install agentprobe-injection

# Or install the latest from GitHub
pip install git+https://github.com/Samgar-kz/agentprobe.git

# Or clone for development
git clone https://github.com/Samgar-kz/agentprobe.git
cd agentprobe && pip install -e .

export OPENAI_API_KEY="..."

agentprobe scan \
  --target dummy \
  --oracle semantic \
  --json-report results.json

# Check results
cat results.json | jq '.statistics'

Available Defenses to Test

The harness measures effectiveness of these defenses:

none — baseline (no defense applied)
delimited — wrap data in <<<UNTRUSTED_DATA_BEGIN>>>...<<<UNTRUSTED_DATA_END>>> markers
spotlight — datamarking: mark every data token so the model separates data from instructions
sandwich — repeat the do-not-obey rule after the data (recency effect)
instr_hierarchy — tag data with an explicit low privilege level; assert system instructions outrank tool/data content
llm_filter — separate LLM verification pass to detect/strip injection before execution

Test each against YOUR agent. See which work, which break utility.

How It Works

Injection Generator: Creates test payloads (carriers: email, document, web page) with hidden canary instructions
Defense Applicator: Wraps the data with each defense mechanism
Target Adapter: Sends to your agent, captures response
Semantic Oracle: Uses gpt-4o-mini to judge: did agent leak data or follow the hidden instruction?
Utility Harness: Runs benign legitimate tasks to ensure defenses don't break normal functionality
Report: Table showing defense effectiveness + utility cost

Defense vs Utility Trade-off

Result: All 5 defenses preserve utility on legitimate tasks (120/120 runs, 0% false-positive rate).

Tested on 8 benign tasks (extract dates, risks, budget, sentiment, action items, meeting notes, legitimately forward to internal address) with 3 repeats each:

Defense	False-Positive Rate	Status
`none`	0%	baseline
`delimited`	0%	safe to use
`spotlight`	0%	safe to use
`sandwich`	0%	safe to use
`instr_hierarchy`	0%	safe to use
`llm_filter`	0%	safe to use

Conclusion: Defenses do not break legitimate agent functionality (in current test suite). Task success rate remains 100% across all defenses, making the injection effectiveness/defense trade-off directly comparable (both measured under same utility constraints).

Run your own: python run_utility_harness.py --repeats=3 --temp=0.7 --out=utility_results.csv

Responsible Use

Only test systems you own or have written permission to test
Destination: understanding YOUR defenses, not generating portable bypasses
Disclose findings responsibly (if testing third-party systems with permission)
The framework measures vulnerability, it's not a jailbreak toolkit

Architecture

agentprobe/
├── oracle_semantic.py          # LLM-as-judge using gpt-4o-mini
├── oracle_legacy.py            # Fallback: substring matching
├── oracle.py                   # Oracle interface
├── adapters/
│   ├── dummy.py               # Built-in intentionally-vulnerable agent simulator
│   ├── http.py                # Test any HTTP-accessible agent (sync)
│   └── http_async.py          # Async HTTP adapter for concurrent scans
├── injection/
│   ├── carriers.py            # Email, document, web page wrappers
│   ├── defenses.py            # Defense mechanisms to evaluate
│   ├── benign_tasks.py        # Utility harness tasks
│   └── screening.py           # Screening defense (separate LLM pass)
├── engine.py                  # Synchronous scan
├── engine_async.py            # Async scan
├── metrics.py                 # Statistical analysis (Wilson CI, effect sizes)
├── report.py                  # Report generation
├── logging_config.py          # Structured logging, cost tracking
└── cli.py                     # Command-line interface

Command-Line Usage

Basic scan

# Test dummy agent
agentprobe scan --target dummy

# Test HTTP agent
agentprobe scan --target http \
  --endpoint http://localhost:8000/chat \
  --input-field message \
  --output-field reply

Control oracle

# Use semantic oracle (default, requires OPENAI_API_KEY)
agentprobe scan --target dummy --oracle semantic

# Use legacy oracle (offline, pattern matching)
agentprobe scan --target dummy --oracle legacy

# Set confidence threshold
agentprobe scan --target dummy --oracle semantic --min-confidence 0.85

Reports

# JSON report with statistics
agentprobe scan --target dummy --json-report results.json

# Verbose logging
agentprobe scan --target dummy --verbose 2

Measurement Infrastructure

Oracle: gpt-4o-mini with Structured Outputs (semantic judgment)
Test Harness: Carriers simulate real data flows (email, document, web page)
Utility Harness: Measures task success rate per defense on benign tasks (see Defense vs Utility Trade-off above)
Benchmarking: Latency / throughput available via --async --concurrency N on HTTP targets

All numbers above are from actual test runs (CSV in /data/).

Testing Your Own Code

# Run all tests
pytest tests/ -v

# Test a specific component
pytest tests/test_oracle_semantic.py -v

# Run with coverage
pytest tests/ --cov=agentprobe

# Benchmark async performance
agentprobe scan --target dummy --async --concurrency 15

What's NOT Included

Evasion techniques or obfuscation tooling (intentionally)
Zero-day exploits or novel vulnerabilities
Portable bypass payloads designed to be transferable across different systems

Note on linguistic transforms: The harness does include pragmatic, register, discourse and code-switching (ru-en) categories — but as measurement probes, not as attack tooling. Our data shows surface-level linguistic transforms have ~0% success on modern frontier models, which is itself a useful finding for defenders deciding where to invest.

This is a defensive measurement tool, not an offensive toolkit.

Citation

If you use this in research, cite as:

@misc{agentprobe2026,
  title={AgentProbe: Evaluating LLM Agent Defenses Against Indirect Injection},
  author={Samgar},
  year={2026},
  url={https://github.com/Samgar-kz/agentprobe}
}

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0a1 pre-release

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprobe_injection-0.2.0a1.tar.gz (414.1 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentprobe_injection-0.2.0a1-py3-none-any.whl (52.6 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file agentprobe_injection-0.2.0a1.tar.gz.

File metadata

Download URL: agentprobe_injection-0.2.0a1.tar.gz
Upload date: Jun 11, 2026
Size: 414.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentprobe_injection-0.2.0a1.tar.gz
Algorithm	Hash digest
SHA256	`4865d0c078ae65e8e0d7df54f4ed89ed2e5bf7f733c674021e055a05d818ab91`
MD5	`b4274614201c013b235a22ac5f31246e`
BLAKE2b-256	`f03e709625e7b8b3283679f90d580b070e61f2b9e20176c20628d54bbd300170`

See more details on using hashes here.

File details

Details for the file agentprobe_injection-0.2.0a1-py3-none-any.whl.

File metadata

Download URL: agentprobe_injection-0.2.0a1-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 52.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentprobe_injection-0.2.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`81487012ad5a59b00420d5216d8c96929541e3499c99d7478def6ed9ea67166b`
MD5	`408a32f5cb9b9c3a5706f62f2f9656ec`
BLAKE2b-256	`07fece6c8d2f7fbc3c219e4a6a496988c505ae19293ba49deee9c0ca91aff56a`

See more details on using hashes here.

agentprobe-injection 0.2.0a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentProbe: Defense Evaluation Harness for LLM Agents

What This Is

Key Findings from Our Research

Results: Defense Effectiveness

Key Finding: Screening (and datamarking) beat prompt-level defenses

How To Use

Test Your Own Agent

Available Defenses to Test

How It Works

Defense vs Utility Trade-off

Responsible Use

Architecture

Command-Line Usage

Basic scan

Control oracle

Reports

Measurement Infrastructure

Testing Your Own Code

What's NOT Included

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes