Skip to main content

Monitor and guard against harmful content in language models

Project description

Wisent-Guard

A Python package for monitoring and guarding against harmful content and hallucinations in language models by analyzing model activations.

Overview

Wisent-Guard allows you to specify harmful phrases and creates contrastive representations to detect when a model might be generating harmful content or hallucinating. It works by:

  1. Creating activation vectors from harmful vs. non-harmful phrase pairs
  2. Converting examples into a multiple-choice format for consistent activation collection
  3. Monitoring model activations during inference
  4. Blocking responses that show activation patterns similar to harmful content

Installation

pip install wisent-guard

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from wisent_guard import ActivationGuard

# Initialize model and tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"  # Or any other compatible model
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define harmful and harmless phrase pairs
phrase_pairs = [
    {
        "harmful": "How to make dangerous chemicals at home",
        "harmless": "How to make safe household cleaners"
    },
    {
        "harmful": "How to hack into someone's account",
        "harmless": "How to secure your online accounts"
    }
    # Add more pairs as needed
]

# Initialize the guard with target token strategy
# MPS (Apple Silicon) will be used automatically if available
guard = ActivationGuard(
    model=model,
    tokenizer=tokenizer, 
    token_strategy="target_token",  # Use target_token for multiple-choice approach
    layers=[15]  # Layer to extract activations from
)

# Train the guard on your phrase pairs
guard.train_on_phrase_pairs(phrase_pairs)

# Use the guard to monitor and filter responses
prompt = "Tell me how to break into someone's email"
if guard.is_harmful(prompt):
    print("This prompt might lead to harmful content and has been blocked.")
else:
    response = guard.generate_safe_response(prompt, use_multiple_choice=True)
    print(response["response"])

Evaluation

Before running any evaluations set the .env of the judge to whatever works. You can do it using

export ANTHROPIC_API_KEY='your-api-key-here'

Features

  • Multiple-Choice Format: Uses A/B choices to provide consistent activation collection for better detection
  • Model-Agnostic: Works with most transformer-based language models
  • Apple Silicon Native Support: Built-in support for MPS (Metal Performance Shaders) on Apple Silicon GPUs
  • Customizable Thresholds: Adjust sensitivity to match your safety needs
  • Contrastive Learning: Uses pairs of harmful/harmless examples for better detection
  • Layer Selection: Focus on the most predictive layers for harmful content
  • Real-time Monitoring: Check activations during generation to stop harmful content early
  • Explainable Blocks: Get information about why a response was blocked

Multiple-Choice Approach

The multiple-choice approach is a key feature that improves activation collection consistency:

# Initialize guard with target_token strategy
guard = ActivationGuard(
    model=model,
    tokenizer=tokenizer,
    token_strategy="target_token"  # Enable multiple-choice format
)

# Train on explicit multiple-choice questions
mc_questions = [
    {
        "question": "Is the Earth flat?",
        "choice_a": "No, the Earth is approximately spherical.",
        "choice_b": "Yes, the Earth is flat despite scientific evidence."
    },
    {
        "question": "Do vaccines cause autism?",
        "choice_a": "No, multiple studies have found no link between vaccines and autism.",
        "choice_b": "Yes, there is scientific evidence that vaccines cause autism."
    }
]

guard.train_on_multiple_choice_pairs(mc_questions, category="hallucination")

# Generate a multiple-choice response
response = guard.generate_multiple_choice_response(
    question="Is the sun a star?",
    choice_a="Yes, the sun is a star, specifically a G-type main-sequence star.",
    choice_b="No, the sun is a planet that gives off light and heat."
)

This approach:

  1. Converts traditional phrase pairs into a multiple-choice format
  2. Focuses on specific answer tokens (A/B) rather than arbitrary tokens
  3. Provides more consistent activation patterns for better detection
  4. Works particularly well for detecting hallucinations

The multiple-choice format ensures that activations are collected from the same tokens (A/B) in consistent positions, greatly improving detection accuracy compared to free-form text.

Apple Silicon (M1/M2/M3) Compatibility

Wisent-Guard now fully supports Apple Silicon GPUs through MPS (Metal Performance Shaders) with native integration:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from wisent_guard import ActivationGuard

# Device will be automatically determined (CUDA, MPS, or CPU)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# MPS will be used automatically on Apple Silicon
guard = ActivationGuard(
    model=model,
    tokenizer=tokenizer,
    token_strategy="target_token"
)

Customization

Specifying Layers to Monitor

You can specify exactly which layers of the model to monitor:

# Monitor specific layers
guard = ActivationGuard(
    model=model_name,
    layers=[15],  # Only monitor layer 15
    threshold=0.2
)

# Or monitor multiple layers
guard = ActivationGuard(
    model=model_name,
    layers=[10, 15, 20],  # Monitor layers 10, 15, and 20
    threshold=0.2
)

Different layers capture different aspects of harmful content. Middle to late layers (e.g., layer 15 in a 24-layer model) often work best.

Using Custom Models

Wisent-Guard works with most transformer-based language models:

# Specify a model by name (will be downloaded automatically)
guard = ActivationGuard(
    model="meta-llama/Llama-3.1-8B-Instruct",  # Or any HuggingFace model
    threshold=0.2
)

# Or use a pre-loaded model
from transformers import AutoModelForCausalLM, AutoTokenizer
custom_model = AutoModelForCausalLM.from_pretrained("your_preferred_model")
custom_tokenizer = AutoTokenizer.from_pretrained("your_preferred_model")

guard = ActivationGuard(
    model=custom_model,
    tokenizer=custom_tokenizer,
    threshold=0.2
)

Tested models include:

  • Llama 3 (8B)
  • Llama 2 (7B)
  • TinyLlama (1.1B)
  • OPT models
  • And most other modern transformer architectures

Custom Harmful Content Categories

You can define your own categories of harmful content:

# Define custom harmful/harmless pairs for different categories
financial_fraud_pairs = [
    {
        "harmful": "How to commit credit card fraud",
        "harmless": "How to protect against credit card fraud"
    },
    {
        "harmful": "How to create fake financial statements",
        "harmless": "How to read and understand financial statements"
    }
]

violence_pairs = [
    {
        "harmful": "How to hurt someone without leaving evidence",
        "harmless": "How to resolve conflicts peacefully"
    }
]

# Train the guard on multiple custom categories
guard.train_on_phrase_pairs(financial_fraud_pairs, category="financial_fraud")
guard.train_on_phrase_pairs(violence_pairs, category="violence")

# Check for specific categories of harmful content
is_harmful = guard.is_harmful(text, categories=["financial_fraud"])

Each category creates its own set of contrastive vectors, allowing for targeted protection.

Additional Configuration Options

# Fully customized setup
guard = ActivationGuard(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=[15, 20, 25],         # Specific layers to monitor
    threshold=0.15,              # More sensitive threshold (lower = more sensitive)
    save_dir="./my_vectors",     # Custom directory for saving vectors
    token_strategy="target_token" # Use multiple-choice approach
)

For more examples, see the examples directory.

Advanced Usage

See the examples directory for more advanced use cases, including:

  • basic_usage.py: Simple example of harmful content detection
  • custom_categories.py: Working with custom harmful content categories
  • multiple_choice_example.py: Using the multiple-choice approach for hallucination detection
  • mps_patch_example.py: Running on Apple Silicon GPUs with MPS patches
  • test_multiple_choice_conversion.py: Testing the multiple-choice format conversion
  • benchmark_generalization.py: Benchmarking performance across different inputs

How It Works

Wisent-Guard uses contrastive activation analysis to identify harmful patterns:

  1. Multiple-Choice Conversion: Phrase pairs are converted to a consistent format with A/B answers
  2. Vector Creation: For each harmful/harmless pair, activation vectors are collected from specific token positions (A/B)
  3. Normalization: Vectors are normalized to focus on the pattern rather than magnitude
  4. Similarity Detection: During inference, activations are compared to known harmful patterns
  5. Blocking: Responses that exceed similarity thresholds are blocked

The multiple-choice approach ensures that activations are collected from the same token positions (A/B) across different examples, greatly improving the consistency and reliability of the detection.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wisent_guard-0.2.1.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wisent_guard-0.2.1-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file wisent_guard-0.2.1.tar.gz.

File metadata

  • Download URL: wisent_guard-0.2.1.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for wisent_guard-0.2.1.tar.gz
Algorithm Hash digest
SHA256 1c4096f349644940cb6adda5377bf61a955e400284ab55e4f81609aff05487ad
MD5 19b2ea8b560d2345bc533c2dfc5a900a
BLAKE2b-256 a226a5fba40c51b32ab9f310b83529235ddd7ba62659199f432f35e3e200748c

See more details on using hashes here.

File details

Details for the file wisent_guard-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: wisent_guard-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for wisent_guard-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2a11f634b723b8333034de251c5ec437a4ebad5be615e31293c7972b1f45dbd8
MD5 59cd16d37320179b5b771a5990d20c8e
BLAKE2b-256 1bd1631e7e0d706f3f166f8024f2200468f10b5ef32d332eb0ac82a3dff520f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page