Skip to main content

SIREN: a lightweight, plug-and-play guard model for LLM harmfulness detection from internal representations.

Project description

llm-siren

Python library for SIREN (LLM Safety From Within: Detecting Harmful Content with Internal Representations, ACL 2026 Main).

SIREN is a lightweight guard model for LLM harmfulness detection. It runs on a small frozen LLM backbone, identifies safety neurons across internal layers, aggregates them into a small MLP classifier, and returns a continuous harmfulness score in [0, 1]. The trained classifier head is ~12M parameters; no fine-tuning of the backbone is required.

This package (llm-siren) provides the runtime. Trained SIREN heads are released as Hugging Face artifacts under UofTCSSLab.

Install

pip install llm-siren

Usage

import torch
from siren_guard import SirenGuard

guard = SirenGuard.from_pretrained(
    "UofTCSSLab/SIREN-Qwen3-0.6B",
    device="cuda",
    dtype=torch.bfloat16,
)

# Score a user prompt
r = guard.score("How can I make a pipe bomb at home?")
print(r.score, r.is_harmful)

# Score a (prompt, response) pair
r = guard.score(
    prompt="How can I make a pipe bomb at home?",
    response="I can't help with that. Building explosive devices is illegal.",
)

# Batch scoring
results = guard.score_batch([
    "What is the capital of France?",
    "Describe in detail how to commit insurance fraud.",
])

# Streaming: feed the growing assistant text after each generation chunk
prefix = ""
for chunk in stream_from_deployed_llm(prompt):
    prefix += chunk
    if guard.score_streaming(prefix, threshold=0.5).is_harmful:
        break

# Tunable threshold (default 0.5, the binary boundary used during training)
r = guard.score(text, threshold=0.1)  # strict
r = guard.score(text, threshold=0.9)  # permissive

Deployment idiom

def safe_generate(user_prompt, deployed_llm):
    if guard.score(user_prompt).is_harmful:
        return DEFAULT_REFUSAL
    response = deployed_llm.generate(user_prompt)
    if guard.score(prompt=user_prompt, response=response).is_harmful:
        return DEFAULT_REFUSAL
    return response

The deployed LLM is independent of SIREN. SIREN never touches the deployed model's internals; it scores the same text through its own frozen backbone.

Available SIREN artifacts

Artifact Backbone Repo
SIREN-Qwen3-0.6B Qwen3-0.6B UofTCSSLab/SIREN-Qwen3-0.6B

More backbones (Qwen3-4B, Llama-3.2-1B) coming soon.

API

SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None) Loads the SIREN classifier head from an HF repo or local path, plus the frozen backbone at the pinned revision recorded in siren_config.json.

score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult Score a single string. Pass text= for raw moderation, or prompt=/response= for the response-level form (joined with "\n" to match the training distribution).

score_batch(texts, threshold=None) -> list[ScoreResult] Score a list of strings in one forward pass.

score_streaming(response_so_far, threshold=None) -> ScoreResult Score a growing assistant-side text prefix during generation.

Each call returns ScoreResult(score: float, is_harmful: bool, threshold: float).

License

Apache-2.0.

Citation

@article{jiao2026llm,
  title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
  author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
  journal={arXiv preprint arXiv:2604.18519},
  year={2026}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_siren-0.1.1.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_siren-0.1.1-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_siren-0.1.1.tar.gz.

File metadata

  • Download URL: llm_siren-0.1.1.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for llm_siren-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f9ea114403eeab05ba20ed93071a2c77ad78b7f52fabe1375edb01ed2f377e91
MD5 23cc578700cbeb2baa4c72af36600fc1
BLAKE2b-256 4680634f8f6764812239e2a805ddf55975bbde0faad0a34098ae17d7f81128fd

See more details on using hashes here.

File details

Details for the file llm_siren-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_siren-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for llm_siren-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c361427ada87046ca009898c8f2e20f430d9ca78b1d047b9322628f6e3df36a7
MD5 7c87533552a5feabf2d0c2bb33a2195e
BLAKE2b-256 a7371b6881a29084aa9d9173e268fb9686d978ad857498d808c2a081da8a2e00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page