SIREN: a lightweight, plug-and-play guard model for LLM harmfulness detection from internal representations.
Project description
llm-siren
Python library for SIREN (LLM Safety From Within: Detecting Harmful Content with Internal Representations, ACL 2026 Main).
SIREN is a lightweight guard model for LLM harmfulness detection. It runs on a small frozen LLM backbone, identifies safety neurons across internal layers, aggregates them into a small MLP classifier, and returns a continuous harmfulness score in [0, 1]. The trained classifier head is ~12M parameters; no fine-tuning of the backbone is required.
This package (llm-siren) provides the runtime. Trained SIREN heads are released as Hugging Face artifacts under UofTCSSLab.
Install
pip install llm-siren
Usage
import torch
from siren_guard import SirenGuard
guard = SirenGuard.from_pretrained(
"UofTCSSLab/SIREN-Qwen3-0.6B",
device="cuda",
dtype=torch.bfloat16,
)
# Score a user prompt
r = guard.score("How can I make a pipe bomb at home?")
print(r.score, r.is_harmful)
# Score a (prompt, response) pair
r = guard.score(
prompt="How can I make a pipe bomb at home?",
response="I can't help with that. Building explosive devices is illegal.",
)
# Batch scoring
results = guard.score_batch([
"What is the capital of France?",
"Describe in detail how to commit insurance fraud.",
])
# Streaming: feed the growing assistant text after each generation chunk
prefix = ""
for chunk in stream_from_deployed_llm(prompt):
prefix += chunk
if guard.score_streaming(prefix, threshold=0.5).is_harmful:
break
# Tunable threshold (default 0.5, the binary boundary used during training)
r = guard.score(text, threshold=0.1) # strict
r = guard.score(text, threshold=0.9) # permissive
Deployment idiom
def safe_generate(user_prompt, deployed_llm):
if guard.score(user_prompt).is_harmful:
return DEFAULT_REFUSAL
response = deployed_llm.generate(user_prompt)
if guard.score(prompt=user_prompt, response=response).is_harmful:
return DEFAULT_REFUSAL
return response
The deployed LLM is independent of SIREN. SIREN never touches the deployed model's internals; it scores the same text through its own frozen backbone.
Available SIREN artifacts
| Artifact | Backbone | Repo |
|---|---|---|
SIREN-Qwen3-0.6B |
Qwen3-0.6B | UofTCSSLab/SIREN-Qwen3-0.6B |
More backbones (Qwen3-4B, Llama-3.2-1B) coming soon.
API
SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None)
Loads the SIREN classifier head from an HF repo or local path, plus the frozen backbone at the pinned revision recorded in siren_config.json.
score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult
Score a single string. Pass text= for raw moderation, or prompt=/response= for the response-level form (joined with "\n" to match the training distribution).
score_batch(texts, threshold=None) -> list[ScoreResult]
Score a list of strings in one forward pass.
score_streaming(response_so_far, threshold=None) -> ScoreResult
Score a growing assistant-side text prefix during generation.
Each call returns ScoreResult(score: float, is_harmful: bool, threshold: float).
License
Apache-2.0.
Citation
@article{jiao2026llm,
title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
journal={arXiv preprint arXiv:2604.18519},
year={2026}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_siren-0.1.1.tar.gz.
File metadata
- Download URL: llm_siren-0.1.1.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9ea114403eeab05ba20ed93071a2c77ad78b7f52fabe1375edb01ed2f377e91
|
|
| MD5 |
23cc578700cbeb2baa4c72af36600fc1
|
|
| BLAKE2b-256 |
4680634f8f6764812239e2a805ddf55975bbde0faad0a34098ae17d7f81128fd
|
File details
Details for the file llm_siren-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llm_siren-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c361427ada87046ca009898c8f2e20f430d9ca78b1d047b9322628f6e3df36a7
|
|
| MD5 |
7c87533552a5feabf2d0c2bb33a2195e
|
|
| BLAKE2b-256 |
a7371b6881a29084aa9d9173e268fb9686d978ad857498d808c2a081da8a2e00
|