Skip to main content

SIREN: a lightweight, plug-and-play guard model for LLM harmfulness detection from internal representations.

Project description

llm-siren

Python library for SIREN (LLM Safety From Within: Detecting Harmful Content with Internal Representations, ACL 2026 Main).

SIREN is a lightweight guard model for LLM harmfulness detection. It runs on a small frozen LLM backbone, identifies safety neurons across internal layers, aggregates them into a small MLP classifier, and returns a continuous harmfulness score in [0, 1]. The trained classifier head is ~12M parameters; no fine-tuning of the backbone is required.

This package (llm-siren) provides the runtime. Trained SIREN heads are released as Hugging Face artifacts under UofTCSSLab.

Install

pip install llm-siren

Usage

import torch
from siren_guard import SirenGuard

guard = SirenGuard.from_pretrained(
    "UofTCSSLab/SIREN-Qwen3-0.6B",
    device="cuda",
    dtype=torch.bfloat16,
)

# Score a user prompt
r = guard.score("How can I make a pipe bomb at home?")
print(r.score, r.is_harmful)

# Score a (prompt, response) pair
r = guard.score(
    prompt="How can I make a pipe bomb at home?",
    response="I can't help with that. Building explosive devices is illegal.",
)

# Batch scoring
results = guard.score_batch([
    "What is the capital of France?",
    "Describe in detail how to commit insurance fraud.",
])

# Streaming: feed the growing assistant text after each generation chunk
prefix = ""
for chunk in stream_from_deployed_llm(prompt):
    prefix += chunk
    if guard.score_streaming(prefix, threshold=0.5).is_harmful:
        break

# Tunable threshold (default 0.5, the binary boundary used during training)
r = guard.score(text, threshold=0.1)  # strict
r = guard.score(text, threshold=0.9)  # permissive

Deployment idiom

def safe_generate(user_prompt, deployed_llm):
    if guard.score(user_prompt).is_harmful:
        return DEFAULT_REFUSAL
    response = deployed_llm.generate(user_prompt)
    if guard.score(prompt=user_prompt, response=response).is_harmful:
        return DEFAULT_REFUSAL
    return response

The deployed LLM is independent of SIREN. SIREN never touches the deployed model's internals; it scores the same text through its own frozen backbone.

Available SIREN artifacts

Artifact Backbone Repo
SIREN-Qwen3-0.6B Qwen3-0.6B UofTCSSLab/SIREN-Qwen3-0.6B

More backbones (Qwen3-4B, Llama-3.2-1B) coming soon.

API

SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None) Loads the SIREN classifier head from an HF repo or local path, plus the frozen backbone at the pinned revision recorded in siren_config.json.

score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult Score a single string. Pass text= for raw moderation, or prompt=/response= for the response-level form (joined with "\n" to match the training distribution).

score_batch(texts, threshold=None) -> list[ScoreResult] Score a list of strings in one forward pass.

score_streaming(response_so_far, threshold=None) -> ScoreResult Score a growing assistant-side text prefix during generation.

Each call returns ScoreResult(score: float, is_harmful: bool, threshold: float).

License

Apache-2.0.

Citation

@article{jiao2026llm,
  title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
  author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
  journal={arXiv preprint arXiv:2604.18519},
  year={2026}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_siren-0.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_siren-0.1.0-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file llm_siren-0.1.0.tar.gz.

File metadata

  • Download URL: llm_siren-0.1.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for llm_siren-0.1.0.tar.gz
Algorithm Hash digest
SHA256 813d62cbe03d75ac670c568cf3ab6c7294590b08214d96f6d65f023fb0dc1df6
MD5 59a5a0e82525aff0c849070c4be50466
BLAKE2b-256 5749585c0c86ac3558f96da2d790c3c10a0ea493e4fca56753f45243e36088a7

See more details on using hashes here.

File details

Details for the file llm_siren-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_siren-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for llm_siren-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5c945907d0eb8c08a7562ab7a99adbf803ad2f7affa41a5774e1407a0838373
MD5 d15e0da6db1fff0f4d314b1e02371af1
BLAKE2b-256 588673fee70d6c0f8f2c8615f6168c89a5fa39c63df0c44edb39468d5982631a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page