Skip to main content

Arc Sentry — prompt injection detection for open source LLMs

Project description

Bendex Arc Sentry

White-box pre-generation behavioral guardrail for open source LLMs.

Arc Sentry hooks into the residual stream and detects anomalous inputs before the model generates a response. If flagged, generate() is never called.

This is different from standard monitoring tools, which operate on outputs, latency, or API-level signals.

Validated results

Model Architecture FP rate Detection Prompts Date
Mistral 7B Instruct v0.2 Mistral 0% 100% 195 April 2026
Qwen 2.5 7B Instruct Qwen 0% 100% 195 April 2026
Llama 3.1 8B Instruct Llama 0% 100% 195 April 2026

Zero false positives. Zero missed injections. Across three architectures, 585 total prompts. Detection happens before model.generate() is called.

Benchmark structure

Two-session benchmark per model:

  • Session 1: 80 normal prompts (customer support, general knowledge, technical support, medical/legal/finance)
  • Session 2: 115 injection prompts (10 attack categories: direct, indirect, persona hijack, jailbreak classics, social engineering, instruction injection via content, authority claims, philosophical manipulation, multi-turn style, encoding/obfuscation, gaslighting)

Detection layers

  1. Phrase detection — architecture-agnostic, zero latency, catches explicit injection language
  2. Fisher-Rao geometric detection — residual stream delta at best layer vs warmup centroid, catches injections with no explicit language
  3. Session D(t) monitoring — stability scalar (Nine 2026b) over rolling request history, catches gradual injection campaigns invisible to single-request detection

Core mechanism

  1. Extract residual stream transition: Δh = h[L] − h[L-1]
  2. L2-normalize: Δh_hat = Δh / ‖Δh‖
  3. Compute Fisher-Rao geodesic distance to warmup centroid: d(u,v) = arccos(u·v)
  4. Threshold set from probe separation during calibration
  5. If distance exceeds threshold — block. generate() never runs.

Fisher-Rao geodesic distance is used throughout — not cosine distance. This is the geometrically correct metric on the unit hypersphere and is consistent with the theoretical framework grounding the noise floor at τ* = √(3/2).

Key finding

Behavioral modes are encoded as layer-localized residual transitions, not uniformly across the network.

Different behaviors localize at different depths:

  • Injection (control hijack): ~93% depth
  • Refusal drift (policy shift): ~93% depth
  • Verbosity drift (style/format): ~64% depth

Arc Sentry automatically identifies the most informative layers per model during calibration. Warmup required: 10 prompts, no labeled data.

Install

pip install bendex

# whitebox dependencies
pip install bendex[whitebox]

Usage

v1 (single file)

from transformers import AutoTokenizer, AutoModelForCausalLM
from bendex.whitebox import ArcSentry
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

sentry = ArcSentry(model, tokenizer)
sentry.calibrate(warmup_prompts)

response, result = sentry.observe_and_block(user_prompt)
if result["blocked"]:
    pass  # model.generate() was never called

v2 (modular, recommended)

from arc_sentry_v2.core.pipeline import ArcSentryV2
from arc_sentry_v2.models.mistral_adapter import MistralAdapter  # or QwenAdapter, LlamaAdapter

adapter = MistralAdapter(model, tokenizer)
sentry = ArcSentryV2(adapter, route_id="customer-support")
sentry.calibrate(warmup_prompts)
response, result = sentry.observe_and_block(prompt)

if result["blocked"]:
    pass  # generate() was never called
else:
    print(result["snr"])  # signal-to-noise ratio vs τ*

Honest constraints

Works best on single-domain deployments — customer support bots, enterprise copilots, internal tools, fixed-use-case APIs. The warmup baseline should reflect your deployment's normal traffic. Cross-domain universal detection requires larger warmup or domain routing.

Theoretical foundation

Built on the second-order Fisher manifold H² × H² with Ricci scalar R = −4. The phase transition at τ* = √(3/2) ≈ 1.2247 (Landauer threshold) grounds the geometric interpretation of behavioral drift.

Detection uses Fisher-Rao geodesic distance — the geometrically correct metric on the unit hypersphere. The threshold is derived from probe separation during calibration, not from a tuned hyperparameter.

Blind predictions from the framework:

  • αs(MZ) = 0.1171 vs PDG 0.1179 ± 0.0010 (0.8σ, no fitting)
  • Fine structure constant to 8 significant figures from manifold curvature

Papers: bendexgeometry.com

Proxy Sentry (API-based models)

For closed-source models (GPT-4, Claude, Gemini), the proxy-based Arc Sentry routes requests through a monitoring layer with no model access required.

Dashboard: web-production-6e47f.up.railway.app/dashboard

License

Bendex Source Available License. Patent Pending. 2026 Hannah Nine / Bendex Geometry LLC

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bendex-2.4.0.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bendex-2.4.0-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file bendex-2.4.0.tar.gz.

File metadata

  • Download URL: bendex-2.4.0.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for bendex-2.4.0.tar.gz
Algorithm Hash digest
SHA256 fd3ae025d8ff41f943ac806c2a66143c78abd06a36d6edefcbb51d542556fb66
MD5 5a94c0e9bab2f19a0986c9f69d0b11ce
BLAKE2b-256 8c17d27161ce5b87c4813cff3d77b172a54c9c4382f249a27c1aa4891d998036

See more details on using hashes here.

File details

Details for the file bendex-2.4.0-py3-none-any.whl.

File metadata

  • Download URL: bendex-2.4.0-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for bendex-2.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f15bfaa6a8275b3e6eab7b863949f9d714346fb0e2609193c673028c2f19b703
MD5 e244bbe94c31f212c2e4cbf90bda934d
BLAKE2b-256 e7585ceae5cea922229477396a6bb264e636d93260c5cc0456ec406d8e0bd29a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page