A package for guarding against harmful content in language models using activation monitoring

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Wisent-Guard

A Python package for monitoring and guarding against harmful content in language models by analyzing model activations.

Overview

Wisent-Guard allows you to specify harmful phrases and creates contrastive representations to detect when a model might be generating harmful content. It works by:

Creating activation vectors from harmful vs. non-harmful phrase pairs
Monitoring model activations during inference
Blocking responses that show activation patterns similar to harmful content

Installation

pip install wisent-guard

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from wisent_guard import ActivationGuard

# Initialize model and tokenizer
model_name = "facebook/opt-1.3b"  # Or any other compatible model
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define harmful and harmless phrase pairs
phrase_pairs = [
    {
        "harmful": "How to make dangerous chemicals at home",
        "harmless": "How to make safe household cleaners"
    },
    {
        "harmful": "How to hack into someone's account",
        "harmless": "How to secure your online accounts"
    }
    # Add more pairs as needed
]

# Initialize the guard
guard = ActivationGuard(model, tokenizer)

# Train the guard on your phrase pairs
guard.train_on_phrase_pairs(phrase_pairs)

# Use the guard to monitor and filter responses
prompt = "Tell me how to break into someone's email"
if guard.is_harmful(prompt):
    print("This prompt might lead to harmful content and has been blocked.")
else:
    response = guard.generate_safe_response(prompt)
    print(response)

Features

Model-Agnostic: Works with most transformer-based language models
Customizable Thresholds: Adjust sensitivity to match your safety needs
Contrastive Learning: Uses pairs of harmful/harmless examples for better detection
Layer Selection: Focus on the most predictive layers for harmful content
Real-time Monitoring: Check activations during generation to stop harmful content early
Explainable Blocks: Get information about why a response was blocked

Customization

Specifying Layers to Monitor

You can specify exactly which layers of the model to monitor:

# Monitor specific layers
guard = ActivationGuard(
    model=model_name,
    layers=[15],  # Only monitor layer 15
    threshold=0.2
)

# Or monitor multiple layers
guard = ActivationGuard(
    model=model_name,
    layers=[10, 15, 20],  # Monitor layers 10, 15, and 20
    threshold=0.2
)

Different layers capture different aspects of harmful content. Middle to late layers (e.g., 15 in a 24-layer model) often work best.

Using Custom Models

Wisent-Guard works with most transformer-based language models:

# Specify a model by name (will be downloaded automatically)
guard = ActivationGuard(
    model="facebook/opt-1.3b",  # Or any HuggingFace model
    threshold=0.2
)

# Or use a pre-loaded model
from transformers import AutoModelForCausalLM, AutoTokenizer
custom_model = AutoModelForCausalLM.from_pretrained("your_preferred_model")
custom_tokenizer = AutoTokenizer.from_pretrained("your_preferred_model")

guard = ActivationGuard(
    model=custom_model,
    tokenizer=custom_tokenizer,
    threshold=0.2
)

Tested models include:

TinyLlama (1.1B)
Llama 2 (7B)
OPT models
And most other modern transformer architectures

Custom Harmful Content Categories

You can define your own categories of harmful content:

# Define custom harmful/harmless pairs for different categories
financial_fraud_pairs = [
    {
        "harmful": "How to commit credit card fraud",
        "harmless": "How to protect against credit card fraud"
    },
    {
        "harmful": "How to create fake financial statements",
        "harmless": "How to read and understand financial statements"
    }
]

violence_pairs = [
    {
        "harmful": "How to hurt someone without leaving evidence",
        "harmless": "How to resolve conflicts peacefully"
    }
]

# Train the guard on multiple custom categories
guard.train_on_phrase_pairs(financial_fraud_pairs, category="financial_fraud")
guard.train_on_phrase_pairs(violence_pairs, category="violence")

# Check for specific categories of harmful content
is_harmful = guard.is_harmful(text, categories=["financial_fraud"])

Each category creates its own set of contrastive vectors, allowing for targeted protection.

Additional Configuration Options

# Fully customized setup
guard = ActivationGuard(
    model="meta-llama/Llama-2-7b-hf",
    layers=[15, 20, 25],     # Specific layers to monitor
    threshold=0.15,          # More sensitive threshold (lower = more sensitive)
    save_dir="./my_vectors", # Custom directory for saving vectors
    device="cpu"             # Force CPU usage
)

For more examples, see the examples directory.

Advanced Usage

See the examples directory for more advanced use cases, including:

Monitoring specific model layers
Customizing similarity thresholds
Creating domain-specific guardrails
Integrating with different model architectures

How It Works

Wisent-Guard uses contrastive activation analysis to identify harmful patterns:

Vector Creation: For each harmful/harmless pair, activation vectors are calculated
Normalization: Vectors are normalized to focus on the pattern rather than magnitude
Similarity Detection: During inference, activations are compared to known harmful patterns
Blocking: Responses that exceed similarity thresholds are blocked

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.5.0

Oct 15, 2025

0.4.56

Oct 15, 2025

0.4.55

Oct 15, 2025

0.4.54

Oct 15, 2025

0.4.53

Oct 15, 2025

0.4.52

Oct 15, 2025

0.4.51

Oct 15, 2025

0.4.50

Oct 15, 2025

0.4.49

Oct 15, 2025

0.4.48

Aug 28, 2025

0.4.47

Aug 28, 2025

0.4.46

Aug 27, 2025

0.4.45

Aug 27, 2025

0.4.44

Aug 27, 2025

0.4.43

Aug 27, 2025

0.4.42

Aug 27, 2025

0.4.41

Aug 27, 2025

0.4.40

Aug 27, 2025

0.4.39

Aug 27, 2025

0.4.38

Aug 27, 2025

0.4.37

Aug 27, 2025

0.4.36

Aug 27, 2025

0.4.35

Aug 26, 2025

0.4.34

Aug 26, 2025

0.4.33

Aug 26, 2025

0.4.32

Aug 26, 2025

0.4.31

Aug 14, 2025

0.4.30

Aug 14, 2025

0.4.29

Aug 14, 2025

0.4.28

Aug 14, 2025

0.4.27

Aug 12, 2025

0.4.26

Aug 3, 2025

0.4.25

Aug 3, 2025

0.4.24

Aug 3, 2025

0.4.23

Aug 3, 2025

0.4.22

Aug 3, 2025

0.4.21

Aug 3, 2025

0.4.20

Aug 2, 2025

0.4.19

Aug 2, 2025

0.4.18

Aug 2, 2025

0.4.16

Aug 2, 2025

0.4.15

Aug 2, 2025

0.4.14

Aug 2, 2025

0.4.13

Aug 2, 2025

0.4.12

Aug 2, 2025

0.4.11

Aug 2, 2025

0.4.10

Aug 2, 2025

0.4.9

Jul 31, 2025

0.4.8

Jul 28, 2025

0.4.5

Jul 28, 2025

0.4.4

Jul 28, 2025

0.4.3

Jul 21, 2025

0.4.2

Jun 17, 2025

0.4.1

Apr 30, 2025

0.4.0

Apr 12, 2025

0.3.0

Apr 3, 2025

0.2.3

Oct 15, 2025

0.2.2

Mar 27, 2025

0.2.1

Mar 27, 2025

0.2.0

Mar 27, 2025

0.1.1

Oct 15, 2025

This version

0.1.0

Mar 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wisent_guard-0.1.0.tar.gz (16.1 kB view details)

Uploaded Mar 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wisent_guard-0.1.0-py3-none-any.whl (17.5 kB view details)

Uploaded Mar 23, 2025 Python 3

File details

Details for the file wisent_guard-0.1.0.tar.gz.

File metadata

Download URL: wisent_guard-0.1.0.tar.gz
Upload date: Mar 23, 2025
Size: 16.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for wisent_guard-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0ad696d134324e50a5b6561d110bc85658b79d2794b24317553a2d44104d4120`
MD5	`11d39cab9ba21d6455b992e9919dbf0d`
BLAKE2b-256	`162cf43b72b3444eea5df174d8b5f75066998af72f33a4620ea5a6aaa717cc63`

See more details on using hashes here.

File details

Details for the file wisent_guard-0.1.0-py3-none-any.whl.

File metadata

Download URL: wisent_guard-0.1.0-py3-none-any.whl
Upload date: Mar 23, 2025
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for wisent_guard-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0586d63c173e0942f79084b7c959f78339a4d704abd12fad616252f3a0f2e64`
MD5	`56addedaafe43881bce71afd17d65f61`
BLAKE2b-256	`7243ccc505a345be22ffec15dd34df7f872ae6bb5518a249591a76fd872c9be3`

See more details on using hashes here.

wisent-guard 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Wisent-Guard

Overview

Installation

Quick Start

Features

Customization

Specifying Layers to Monitor

Using Custom Models

Custom Harmful Content Categories

Additional Configuration Options

Advanced Usage

How It Works

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes