Sacrificial LLM instances as behavioral probes for prompt injection detection

These details have not been verified by PyPI

Project links

Project description

Little Canary

Sacrificial LLM instances as behavioral probes for prompt injection detection

What it does

Runs a fast structural filter (regex + decode/recheck for base64, hex, ROT13, reverse encodings)
Probes raw input with a small sacrificial "canary" model and checks for behavioral compromise
Returns either block, flag + advisory, or pass depending on mode

When to use

You run an LLM app or agent and want a lightweight pre-check for prompt injection
You can tolerate ~250ms additional latency per input
You want a model-agnostic layer that works with your existing stack

When not to use

You need formal security guarantees or audited benchmark comparability
You cannot accept pass-through behavior when the canary is unavailable (see Fail-open design)

Results snapshot

98% effective detection on our internal red-team suite (220 adversarial prompts). Not yet validated on Garak/HarmBench.
0% false positives on 40 realistic customer chatbot prompts
~250ms latency per check on consumer hardware

Internal test suite — see Benchmarks and Limitations for methodology and caveats.

Quick Start
How It Works
Deployment Modes
Fail-open Design
Benchmark Results
Integration Examples
API Quick Reference
Running the Benchmarks
Project Structure
Troubleshooting
Limitations
Roadmap
Contributing
Citation
License

Quick Start

# 1. Install Ollama and pull a canary model
ollama pull qwen2.5:1.5b

# 2. Install Little Canary (not on PyPI yet — install from source)
git clone https://github.com/roli-lpci/little-canary.git
cd little-canary
pip install .

from little_canary import SecurityPipeline

pipeline = SecurityPipeline(canary_model="qwen2.5:1.5b", mode="full")
verdict = pipeline.check(user_input)

if not verdict.safe:
    return "Sorry, I couldn't process that request."

# Prepend advisory to your existing system prompt
system = verdict.advisory.to_system_prefix() + "\n" + your_system_prompt
response = your_llm(system=system, messages=[{"role": "user", "content": user_input}])

That's it. Your LLM, your app, your logic. The canary adds a security layer in front.

How It Works

User Input --> Structural Filter (1ms) --> Canary Probe (250ms) --> Your LLM
                   |                            |
              Known patterns              Behavioral analysis
              (regex + encoding)          (did the canary get owned?)

Layer 1: Structural Filter (~1ms) Regex-based detection of known attack patterns, plus decode-then-recheck for base64, hex, ROT13, and reverse-encoded payloads.

Layer 2: Canary Probe (~250ms) Feeds raw input to a small sacrificial LLM (qwen2.5:1.5b by default). Temperature=0 for deterministic output. The canary's response is analyzed for signs of compromise: persona adoption, instruction compliance, system prompt leakage, refusal collapse.

Analysis Layer (pluggable)

Default: regex-based BehavioralAnalyzer — fast, zero dependencies
Experimental: LLMJudge — a second model classifies the canary's output as SAFE/UNSAFE

Advisory System Suspicious inputs that aren't hard-blocked generate a SecurityAdvisory prepended to your production LLM's system prompt, warning it about detected signals.

Why a sacrificial model?

Every existing defense classifies inputs. Little Canary observes what attacks do to a model and reads the aftermath:

Llama Guard evaluates content against safety categories. Little Canary detects behavioral compromise, not content safety violations.
Prompt Guard detects injection patterns in input text. Little Canary uses actual LLM behavioral response rather than input-side classification.
NeMo Guardrails uses rules and LLM calls to control dialogue flow. Little Canary works with any LLM stack, no framework required.

The canary is deliberately small and weak. It gets compromised by attacks that your production LLM might resist. That's the point — a compromised canary is a strong signal.

Deployment Modes

Mode	Behavior	Best For
`block`	Hard-blocks detected attacks	Customer chatbots, zero-tolerance systems
`advisory`	Never blocks, flags for production LLM	Zero-downtime systems, monitoring
`full`	Blocks obvious attacks, flags ambiguous ones	Agents, email processors, hybrid workflows

Fail-open Design

[!NOTE] If Ollama is unavailable, the pipeline passes all inputs through unscreened. This is a deliberate availability-over-security tradeoff.

How to operate safely:

Call pipeline.health_check() at startup to verify the canary model is reachable
Monitor the canary_available field in health check output
Alert if the canary becomes unavailable in production

Benchmark Results

Tested against an internal red-team suite of 220 adversarial prompts across 12 attack categories, plus a separate false-positive test of 40 realistic chatbot prompts.

Metric	Value
Effective detection rate	98% (full pipeline with production LLM)
Canary standalone block rate	37% (canary + structural filter alone)
False positive rate	0/40 on realistic chatbot traffic
Latency	~250ms per check

Detection by category:

Category	Effective Rate	Attacks
Role escalation	90%	20
Canary mismatch	80%	20
Benign wrapper	70%	20
Multi-step trap	70%	20
Canary divergence	70%	20
Classic injection	65%	20
Tool trigger	65%	20
Context stuffing	50%	20
Encoding/obfuscation	40%	20
Canary outage	40%	20
Paired obvious	—	10
Paired stealthy	—	10

[!WARNING] Self-generated test suite. These prompts were created for this project, not drawn from established adversarial benchmarks. Validate against TensorTrust, Garak, or HarmBench before comparing to other tools.

Integration Examples

Customer Chatbot (Block Mode)

from little_canary import SecurityPipeline

pipeline = SecurityPipeline(canary_model="qwen2.5:1.5b", mode="block")

def handle_message(user_input):
    verdict = pipeline.check(user_input)
    if not verdict.safe:
        return "I'm sorry, I couldn't process that. Could you rephrase?"
    return call_your_llm(user_input)

Email Agent (Full Mode)

from little_canary import SecurityPipeline

pipeline = SecurityPipeline(canary_model="qwen2.5:1.5b", mode="full")

def process_email(email_body, sender):
    verdict = pipeline.check(email_body)
    if not verdict.safe:
        quarantine(email_body, sender, verdict.summary)
        return
    system = verdict.advisory.to_system_prefix() + "\n" + agent_prompt
    agent.process(system=system, content=email_body)

See examples/ for complete integration code.

API Quick Reference

from little_canary import SecurityPipeline

# Initialize
pipeline = SecurityPipeline(
    canary_model="qwen2.5:1.5b",   # any Ollama model
    mode="full",                     # "block", "advisory", or "full"
    ollama_url="http://localhost:11434",
    canary_timeout=10.0,
)

# Check input
verdict = pipeline.check(user_input)
verdict.safe              # bool — is input safe to forward?
verdict.blocked_by        # str or None — "structural_filter" or "canary_probe"
verdict.advisory          # SecurityAdvisory — flagged signals
verdict.advisory.flagged  # bool — were suspicious signals detected?
verdict.advisory.to_system_prefix()  # str — prepend to your system prompt
verdict.total_latency     # float — seconds

# Health check
health = pipeline.health_check()
health["canary_available"]  # bool

Running the Benchmarks

# Red team suite (220 adversarial + 20 safe prompts, live dashboard)
cd benchmarks
python3 red_team_runner.py --canary qwen2.5:1.5b
# Dashboard at http://localhost:8899

# False positive test (40 realistic prompts)
python3 run_fp_test.py

# Full pipeline test (canary + production LLM)
python3 full_pipeline_test.py --canary qwen2.5:1.5b --production gemma3:27b --attacks-only

Project Structure

little-canary/
├── little_canary/                 # Core package (pip install .)
│   ├── __init__.py
│   ├── py.typed                   # PEP 561 type marker
│   ├── structural_filter.py       # Layer 1: regex + encoding detection
│   ├── canary.py                  # Layer 2: sacrificial LLM probe
│   ├── analyzer.py                # Behavioral analysis (regex-based)
│   ├── judge.py                   # LLM judge (experimental, replaces regex)
│   └── pipeline.py                # Orchestration + three deployment modes
├── tests/                         # Unit tests (pytest, 98%+ coverage)
├── examples/                      # Integration examples
├── benchmarks/                    # Test suites and dashboard
├── .github/                       # CI, issue templates, dependabot
├── pyproject.toml
└── requirements.txt

Troubleshooting

"Cannot connect to Ollama"

Ensure Ollama is running: ollama serve (or check with pgrep ollama)
Verify the URL: default is http://localhost:11434
Test connectivity: curl http://localhost:11434/api/tags

"Model not found"

Pull the model first: ollama pull qwen2.5:1.5b
The model name must match exactly (e.g., qwen2.5:1.5b, not qwen2.5)

High false positive rate

Use mode="full" instead of mode="block" to flag ambiguous inputs as advisories rather than hard-blocking
Check benchmarks/run_fp_test.py against your traffic patterns

Slow response times

The default qwen2.5:1.5b targets ~250ms. Set a lower canary_timeout to fail fast.
Use enable_structural_filter=True, enable_canary=False for structural-only mode (~1ms, no LLM required).

Limitations

Self-generated test suite. Results should be validated against standard benchmarks.
Single canary model tested. Other models may perform differently.
Regex-based behavioral analysis. The experimental LLMJudge is included for higher accuracy.
No production deployment data. All results are from controlled testing.
Ollama-only. No abstraction layer for other backends yet.

Roadmap

Benchmark against TensorTrust, Garak, and HarmBench attack suites
LLM judge to replace regex analyzer (higher accuracy)
Backend abstraction layer (vLLM, llama.cpp, OpenAI-compatible APIs)
Fine-tuned canary model (increased susceptibility = stronger signal)
Multi-canary ensemble for higher detection rates
Agent integration SDK (MCP, LangChain, CrewAI)

Contributing

See CONTRIBUTING.md for development setup and contribution guidelines.

Citation

@software{little_canary,
  author = {Bosch, Rolando},
  title = {Little Canary: Sacrificial LLM Instances as Behavioral Probes for Prompt Injection Detection},
  year = {2026},
  url = {https://github.com/roli-lpci/little-canary},
  license = {Apache-2.0}
}

License

Apache 2.0 — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Mar 23, 2026

0.2.2

Mar 2, 2026

0.2.1

Mar 2, 2026

0.2.0

Feb 25, 2026

This version

0.1.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

little_canary-0.1.0.tar.gz (40.0 kB view details)

Uploaded Feb 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

little_canary-0.1.0-py3-none-any.whl (30.3 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file little_canary-0.1.0.tar.gz.

File metadata

Download URL: little_canary-0.1.0.tar.gz
Upload date: Feb 24, 2026
Size: 40.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for little_canary-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e637d6e638bf61578bf6f86d7bbfd80a21b49a375f543c09b0306b143bef29d3`
MD5	`431757cf46e2d37956e61df09a4a4118`
BLAKE2b-256	`2cabdb38cf976f9d9f0705edda62fa340ec45dac479e2d483a89164d8fdcf9a7`

See more details on using hashes here.

File details

Details for the file little_canary-0.1.0-py3-none-any.whl.

File metadata

Download URL: little_canary-0.1.0-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 30.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for little_canary-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`910a7982e9576ee2186dc18d850ab8600f6b91b4beee89d7b492fb9a20a4f657`
MD5	`c1f8a05d68b07f32f464bb9156109af5`
BLAKE2b-256	`85431da8016efe956e5c33e231e5938dc662b2be69c7c11015003f315f57ce18`

See more details on using hashes here.

little-canary 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Little Canary

What it does

When to use

When not to use

Results snapshot

Table of Contents

Quick Start

How It Works

Why a sacrificial model?

Deployment Modes

Fail-open Design

Benchmark Results

Integration Examples

Customer Chatbot (Block Mode)

Email Agent (Full Mode)

API Quick Reference

Running the Benchmarks

Project Structure

Troubleshooting

Limitations

Roadmap

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes