Skip to main content

Sacrificial LLM instances as behavioral probes for prompt injection detection

Project description

Little Canary

Sacrificial LLM instances as behavioral probes for prompt injection detection

License: Apache 2.0 Python 3.8+ CI

What it does

  • Runs a fast structural filter (regex + decode/recheck for base64, hex, ROT13, reverse encodings)
  • Probes raw input with a small sacrificial "canary" model and checks for behavioral compromise
  • Returns either block, flag + advisory, or pass depending on mode

When to use

  • You run an LLM app or agent and want a lightweight pre-check for prompt injection
  • You can tolerate ~250ms additional latency per input
  • You want a model-agnostic layer that works with your existing stack

When not to use

  • You need formal security guarantees or audited benchmark comparability
  • You cannot accept pass-through behavior when the canary is unavailable (see Fail-open design)

Results snapshot

  • 93.8% combined detection on 400 human-written TensorTrust attacks (external benchmark)
  • 50% cost reduction on attack traffic — canary blocks before the production LLM is called
  • +19.5pp improvement for local models (Mistral 7B: 70.2% → 89.7% with canary)
  • 0% false positives on 40 realistic customer chatbot prompts
  • ~250ms latency per check on consumer hardware

See Benchmarks and Limitations for full methodology, external validation, and caveats.


Table of Contents


Quick Start

# 1. Install Ollama and pull a canary model
ollama pull qwen2.5:1.5b

# 2. Install Little Canary
pip install little-canary
from little_canary import SecurityPipeline

pipeline = SecurityPipeline(canary_model="qwen2.5:1.5b", mode="full")
verdict = pipeline.check(user_input)

if not verdict.safe:
    return "Sorry, I couldn't process that request."

# Prepend advisory to your existing system prompt
system = verdict.advisory.to_system_prefix() + "\n" + your_system_prompt
response = your_llm(system=system, messages=[{"role": "user", "content": user_input}])

That's it. Your LLM, your app, your logic. The canary adds a security layer in front.

How It Works

User Input --> Structural Filter (1ms) --> Canary Probe (250ms) --> Your LLM
                   |                            |
              Known patterns              Behavioral analysis
              (regex + encoding)          (did the canary get owned?)

Layer 1: Structural Filter (~1ms) Regex-based detection of known attack patterns, plus decode-then-recheck for base64, hex, ROT13, and reverse-encoded payloads.

Layer 2: Canary Probe (~250ms) Feeds raw input to a small sacrificial LLM (qwen2.5:1.5b by default). Temperature=0 for deterministic output. The canary's response is analyzed for signs of compromise: persona adoption, instruction compliance, system prompt leakage, refusal collapse, hijack target phrases, and response length anomalies.

Analysis Layer (pluggable)

  • Default: regex-based BehavioralAnalyzer — fast, zero dependencies
  • Experimental: LLMJudge — a second model classifies the canary's output as SAFE/UNSAFE

Advisory System Suspicious inputs that aren't hard-blocked generate a SecurityAdvisory prepended to your production LLM's system prompt, warning it about detected signals.

Why a sacrificial model?

Every existing defense classifies inputs. Little Canary observes what attacks do to a model and reads the aftermath:

  • Llama Guard evaluates content against safety categories. Little Canary detects behavioral compromise, not content safety violations.
  • Prompt Guard detects injection patterns in input text. Little Canary uses actual LLM behavioral response rather than input-side classification.
  • NeMo Guardrails uses rules and LLM calls to control dialogue flow. Little Canary works with any LLM stack, no framework required.

The canary is deliberately small and weak. It gets compromised by attacks that your production LLM might resist. That's the point — a compromised canary is a strong signal.

Deployment Modes

Mode Behavior Best For
block Hard-blocks detected attacks Customer chatbots, zero-tolerance systems
advisory Never blocks, flags for production LLM Zero-downtime systems, monitoring
full Blocks obvious attacks, flags ambiguous ones Agents, email processors, hybrid workflows

Fail-open Design

[!NOTE] If Ollama is unavailable, the pipeline passes all inputs through unscreened. This is a deliberate availability-over-security tradeoff.

How to operate safely:

  • Call pipeline.health_check() at startup to verify the canary model is reachable
  • Monitor the canary_available field in health check output
  • Alert if the canary becomes unavailable in production

Benchmark Results

External Validation — TensorTrust (Human-Written Attacks)

Tested against TensorTrust, a UC Berkeley dataset of human-written prompt injection attacks collected from a competitive adversarial game. These are real attacks that succeeded against production models — not AI-generated.

Opus 4.6 + Canary Pipeline (n=400)

Metric Value
Combined catch rate 93.8% (375/400)
Canary pre-filter blocked 201/400 (50.2%)
Opus refused (of 199 that passed canary) 174/199 (87.4%)
Combined missed 25/400 (6.2%)
Opus API calls saved 201 (50.2%)
Length Bucket Total Combined Caught Combined Rate
Short 100 93 93.0%
Medium 200 182 91.0%
Long 100 100 100.0%

Mistral 7B + Canary (n=80, exploratory)

Path Catch Rate
Mistral 7B alone 48.7%
Mistral 7B + Canary ~80.8%
Improvement +32pp

Internal Benchmark (AI-Generated Prompts)

160 adversarial prompts across 9 attack categories, plus 40 false-positive prompts. Compliance judged by Claude Sonnet 4.5.

Metric Value
False positive rate 0/40 on realistic chatbot traffic
Latency ~250ms per check

Model comparison (160 prompts, 9 categories, refusal rate excluding errors):

Model Baseline + Canary
Claude Haiku 4.5 99.4%
Claude Opus 4.6 98.0%
Claude Sonnet 4.5 98.6% 98.7%
GPT-4o-mini 98.1%
Mistral 7B 70.2% 89.7%

[!NOTE] The canary provides the largest benefit for weaker/local models. Top-tier models already refuse 98%+ of AI-generated attacks without a canary. The canary's value for frontier models is cost reduction (50% fewer API calls on attack traffic) and defense-in-depth.

Integration Examples

Customer Chatbot (Block Mode)

from little_canary import SecurityPipeline

pipeline = SecurityPipeline(canary_model="qwen2.5:1.5b", mode="block")

def handle_message(user_input):
    verdict = pipeline.check(user_input)
    if not verdict.safe:
        return "I'm sorry, I couldn't process that. Could you rephrase?"
    return call_your_llm(user_input)

Email Agent (Full Mode)

from little_canary import SecurityPipeline

pipeline = SecurityPipeline(canary_model="qwen2.5:1.5b", mode="full")

def process_email(email_body, sender):
    verdict = pipeline.check(email_body)
    if not verdict.safe:
        quarantine(email_body, sender, verdict.summary)
        return
    system = verdict.advisory.to_system_prefix() + "\n" + agent_prompt
    agent.process(system=system, content=email_body)

See examples/ for complete integration code.

API Quick Reference

from little_canary import SecurityPipeline

# Initialize
pipeline = SecurityPipeline(
    canary_model="qwen2.5:1.5b",   # any Ollama model
    mode="full",                     # "block", "advisory", or "full"
    ollama_url="http://localhost:11434",
    canary_timeout=10.0,
)

# Check input
verdict = pipeline.check(user_input)
verdict.safe              # bool — is input safe to forward?
verdict.blocked_by        # str or None — "structural_filter" or "canary_probe"
verdict.advisory          # SecurityAdvisory — flagged signals
verdict.advisory.flagged  # bool — were suspicious signals detected?
verdict.advisory.to_system_prefix()  # str — prepend to your system prompt
verdict.total_latency     # float — seconds

# Health check
health = pipeline.health_check()
health["canary_available"]  # bool

Running the Benchmarks

# Red team suite (160 adversarial + 20 safe prompts, live dashboard)
cd benchmarks
python3 red_team_runner.py --canary qwen2.5:1.5b
# Dashboard at http://localhost:8899

# False positive test (40 realistic prompts)
python3 run_fp_test.py

# Full pipeline test (canary + production LLM)
python3 full_pipeline_test.py --canary qwen2.5:1.5b --production gemma3:27b --attacks-only

Project Structure

little-canary/
├── little_canary/                 # Core package (pip install .)
│   ├── __init__.py
│   ├── py.typed                   # PEP 561 type marker
│   ├── structural_filter.py       # Layer 1: regex + encoding detection
│   ├── canary.py                  # Layer 2: sacrificial LLM probe
│   ├── analyzer.py                # Behavioral analysis (regex-based)
│   ├── judge.py                   # LLM judge (experimental, replaces regex)
│   └── pipeline.py                # Orchestration + three deployment modes
├── tests/                         # Unit tests (pytest, 98%+ coverage)
├── examples/                      # Integration examples
├── benchmarks/                    # Test suites and dashboard
├── .github/                       # CI, issue templates, dependabot
├── pyproject.toml
└── requirements.txt

Troubleshooting

"Cannot connect to Ollama"

  • Ensure Ollama is running: ollama serve (or check with pgrep ollama)
  • Verify the URL: default is http://localhost:11434
  • Test connectivity: curl http://localhost:11434/api/tags

"Model not found"

  • Pull the model first: ollama pull qwen2.5:1.5b
  • The model name must match exactly (e.g., qwen2.5:1.5b, not qwen2.5)

High false positive rate

  • Use mode="full" instead of mode="block" to flag ambiguous inputs as advisories rather than hard-blocking
  • Check benchmarks/run_fp_test.py against your traffic patterns

Slow response times

  • The default qwen2.5:1.5b targets ~250ms. Set a lower canary_timeout to fail fast.
  • Use enable_structural_filter=True, enable_canary=False for structural-only mode (~1ms, no LLM required).

Limitations

  • TensorTrust is one external benchmark. Validated against human-written attacks, but not yet tested on Garak or HarmBench.
  • Single canary model tested. Other models may perform differently.
  • Regex-based behavioral analysis. The experimental LLMJudge is included for higher accuracy.
  • No production deployment data. All results are from controlled testing.
  • Ollama-only. No abstraction layer for other backends yet.
  • Internal benchmark uses AI-generated prompts. May not reflect real-world attack distribution. TensorTrust validation addresses this partially.

Roadmap

  • Benchmark against TensorTrust (400 human-written attacks, 93.8% combined)
  • Benchmark against Garak and HarmBench attack suites
  • LLM judge to replace regex analyzer (higher accuracy)
  • Backend abstraction layer (vLLM, llama.cpp, OpenAI-compatible APIs)
  • Fine-tuned canary model (increased susceptibility = stronger signal)
  • Multi-canary ensemble for higher detection rates
  • Agent integration SDK (MCP, LangChain, CrewAI)

Contributing

See CONTRIBUTING.md for development setup and contribution guidelines.

Citation

@software{little_canary,
  author = {Bosch, Rolando},
  title = {Little Canary: Sacrificial LLM Instances as Behavioral Probes for Prompt Injection Detection},
  year = {2026},
  url = {https://github.com/roli-lpci/little-canary},
  license = {Apache-2.0}
}

License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

little_canary-0.2.0.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

little_canary-0.2.0-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file little_canary-0.2.0.tar.gz.

File metadata

  • Download URL: little_canary-0.2.0.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for little_canary-0.2.0.tar.gz
Algorithm Hash digest
SHA256 263e3b6dcd5369d52d9ab95871d1b92d61c093f44195a917d7ab88929abf6850
MD5 2b3c62f7c40d40c482bc0c199fffd306
BLAKE2b-256 ab91d3cdc808ddf0b3b5a890e164bbc049536210629497961a56197f260d151f

See more details on using hashes here.

File details

Details for the file little_canary-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: little_canary-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for little_canary-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c659744d96104755bcabc5647f017f5d0c1b942406a193e91b8bd15bbbd3463c
MD5 5abf67ca58d225af120d37beea296d3a
BLAKE2b-256 7081e2b404f78ce520ef28980beeb153b0454b80e45ef02d0ca6d9f51ef6f339

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page