SIREN: a lightweight, plug-and-play guard model for LLM harmfulness detection from internal representations.

These details have not been verified by PyPI

Project links

Project description

llm-siren

Python library for SIREN (LLM Safety From Within: Detecting Harmful Content with Internal Representations, ACL 2026 Main).

SIREN is a lightweight guard model for LLM harmfulness detection. It runs on a small frozen LLM backbone, identifies safety neurons across internal layers, aggregates them into a small MLP classifier, and returns a continuous harmfulness score in [0, 1]. The trained classifier head is ~12M parameters; no fine-tuning of the backbone is required.

This package (llm-siren) provides the runtime. Trained SIREN heads are released as Hugging Face artifacts under UofTCSSLab.

Install

pip install llm-siren

Usage

import torch
from siren_guard import SirenGuard

guard = SirenGuard.from_pretrained(
    "UofTCSSLab/SIREN-Qwen3-0.6B",
    device="cuda",
    dtype=torch.bfloat16,
)

# Score a user prompt
r = guard.score("How can I make a pipe bomb at home?")
print(r.score, r.is_harmful)

# Score a (prompt, response) pair
r = guard.score(
    prompt="How can I make a pipe bomb at home?",
    response="I can't help with that. Building explosive devices is illegal.",
)

# Batch scoring
results = guard.score_batch([
    "What is the capital of France?",
    "Describe in detail how to commit insurance fraud.",
])

# Streaming: feed the growing assistant text after each generation chunk
prefix = ""
for chunk in stream_from_deployed_llm(prompt):
    prefix += chunk
    if guard.score_streaming(prefix, threshold=0.5).is_harmful:
        break

# Tunable threshold (default 0.5, the binary boundary used during training)
r = guard.score(text, threshold=0.1)  # strict
r = guard.score(text, threshold=0.9)  # permissive

Deployment idiom

def safe_generate(user_prompt, deployed_llm):
    if guard.score(user_prompt).is_harmful:
        return DEFAULT_REFUSAL
    response = deployed_llm.generate(user_prompt)
    if guard.score(prompt=user_prompt, response=response).is_harmful:
        return DEFAULT_REFUSAL
    return response

The deployed LLM is independent of SIREN. SIREN never touches the deployed model's internals; it scores the same text through its own frozen backbone.

Available SIREN artifacts

Artifact	Backbone	Repo
`SIREN-Qwen3-0.6B`	Qwen3-0.6B	UofTCSSLab/SIREN-Qwen3-0.6B

More backbones (Qwen3-4B, Llama-3.2-1B) coming soon.

API

SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None) Loads the SIREN classifier head from an HF repo or local path, plus the frozen backbone at the pinned revision recorded in siren_config.json.

score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult Score a single string. Pass text= for raw moderation, or prompt=/response= for the response-level form (joined with "\n" to match the training distribution).

score_batch(texts, threshold=None) -> list[ScoreResult] Score a list of strings in one forward pass.

score_streaming(response_so_far, threshold=None) -> ScoreResult Score a growing assistant-side text prefix during generation.

Each call returns ScoreResult(score: float, is_harmful: bool, threshold: float).

License

Apache-2.0.

Citation

@article{jiao2026llm,
  title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
  author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
  journal={arXiv preprint arXiv:2604.18519},
  year={2026}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Apr 26, 2026

0.1.0

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_siren-0.1.1.tar.gz (11.0 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_siren-0.1.1-py3-none-any.whl (11.3 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file llm_siren-0.1.1.tar.gz.

File metadata

Download URL: llm_siren-0.1.1.tar.gz
Upload date: Apr 26, 2026
Size: 11.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for llm_siren-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f9ea114403eeab05ba20ed93071a2c77ad78b7f52fabe1375edb01ed2f377e91`
MD5	`23cc578700cbeb2baa4c72af36600fc1`
BLAKE2b-256	`4680634f8f6764812239e2a805ddf55975bbde0faad0a34098ae17d7f81128fd`

See more details on using hashes here.

File details

Details for the file llm_siren-0.1.1-py3-none-any.whl.

File metadata

Download URL: llm_siren-0.1.1-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 11.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for llm_siren-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c361427ada87046ca009898c8f2e20f430d9ca78b1d047b9322628f6e3df36a7`
MD5	`7c87533552a5feabf2d0c2bb33a2195e`
BLAKE2b-256	`a7371b6881a29084aa9d9173e268fb9686d978ad857498d808c2a081da8a2e00`

See more details on using hashes here.

llm-siren 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-siren

Install

Usage

Deployment idiom

Available SIREN artifacts

API

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes