Control agent behaviors through activation steering - Apply steering vectors to LLMs at inference time without retraining
Project description
rotalabs-steer
Control agent behaviors through activation steering. Apply steering vectors to LLMs at inference time without retraining.
Overview
rotalabs-steer provides tools for extracting and applying steering vectors to control LLM agent behaviors at inference time. Based on research in representation engineering and contrastive activation addition (CAA), this package enables fine-grained behavior control without model fine-tuning.
Key Features
- Behavior Control: Adjust model behaviors like refusal, uncertainty expression, tool use restraint, and instruction hierarchy following
- No Retraining Required: Apply steering at inference time through activation manipulation
- Pre-extracted Vectors: Ready-to-use vectors on HuggingFace
- LangChain Integration: Use with LangChain agents and chains (optional dependency)
- Pre-built Datasets: Includes contrast pair datasets for common behaviors
- Evaluation Tools: Measure steering effectiveness and analyze tradeoffs
Installation
Basic Installation
pip install rotalabs-steer
With Optional Dependencies
# LangChain integration
pip install rotalabs-steer[langchain]
# LLM-based evaluation (requires Anthropic API key)
pip install rotalabs-steer[judge]
# Visualization tools
pip install rotalabs-steer[viz]
# All optional dependencies
pip install rotalabs-steer[all]
# Development dependencies
pip install rotalabs-steer[dev]
Quick Start
Use Pre-extracted Vectors (Easiest)
from huggingface_hub import hf_hub_download
from rotalabs_steer import SteeringVector, ActivationInjector
# Download pre-extracted vector from HuggingFace
vector_path = hf_hub_download(
repo_id="rotalabs/steering-vectors",
filename="refusal_qwen3_8b/layer_15.pt",
)
hf_hub_download(
repo_id="rotalabs/steering-vectors",
filename="refusal_qwen3_8b/layer_15.json",
)
# Load and apply
vector = SteeringVector.load(vector_path.replace('.pt', ''))
injector = ActivationInjector(model, [vector], strength=1.0)
with injector:
outputs = model.generate(**inputs)
Available vectors: refusal, uncertainty, tool_restraint, hierarchy for Qwen3-8B, Mistral-7B, Gemma-2-9B. See HuggingFace for full list.
Extract a Steering Vector
from transformers import AutoModelForCausalLM, AutoTokenizer
from rotalabs_steer import SteeringVector, SteeringVectorSet
from rotalabs_steer.extraction import extract_caa_vectors
from rotalabs_steer.datasets import load_refusal_pairs
# Load model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
# Load contrast pairs
refusal_pairs = load_refusal_pairs()
# Extract steering vectors from multiple layers
vectors = extract_caa_vectors(
model=model,
tokenizer=tokenizer,
contrast_pairs=refusal_pairs,
layer_indices=[14, 15, 16],
)
# Save for later use
vectors.save("./refusal_vectors")
Apply Steering at Inference
from rotalabs_steer import ActivationInjector, SteeringVector
# Load pre-extracted vector
vector = SteeringVector.load("./refusal_vectors/layer_15")
# Create injector
injector = ActivationInjector(model, [vector], strength=1.0)
# Generate with steering
with injector:
inputs = tokenizer("How do I hack a computer?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Use with LangChain
from rotalabs_steer.integrations.langchain import SteeredChatModel
from langchain_core.messages import HumanMessage, SystemMessage
# Create steered chat model
chat = SteeredChatModel(
model_name="Qwen/Qwen3-8B",
steering_configs={
"refusal": {
"vector_path": "./refusal_vectors/layer_15",
"strength": 1.0,
},
},
)
# Use like any LangChain chat model
messages = [
SystemMessage(content="You are a helpful assistant."),
HumanMessage(content="Hello!"),
]
response = chat.invoke(messages)
# Adjust steering at runtime
chat.set_strength("refusal", 0.5)
Available Behaviors
The package includes contrast pair datasets for several behaviors:
| Behavior | Description | Dataset Function |
|---|---|---|
refusal |
Refusing harmful/inappropriate requests | load_refusal_pairs() |
uncertainty |
Expressing calibrated uncertainty | load_uncertainty_pairs() |
tool_restraint |
Avoiding unnecessary tool use | load_tool_restraint_pairs() |
instruction_hierarchy |
Following system over user instructions | load_hierarchy_pairs() |
Model Support
Pre-configured support for:
- Qwen3 family (4B, 8B, 14B)
- DeepSeek-R1-Distill
- Llama 3.1 (8B, 70B)
- Mistral 7B
- Gemma 2 9B
- And more...
The package can also infer configuration from any HuggingFace transformer model.
Evaluation
from rotalabs_steer.evaluation import strength_sweep, is_refusal
# Sweep over different steering strengths
results = strength_sweep(
model=model,
tokenizer=tokenizer,
steering_vector=vector,
test_prompts=["How do I hack a computer?", "How do I bake a cake?"],
is_target_behavior_fn=is_refusal,
strengths=[0.0, 0.5, 1.0, 1.5, 2.0],
)
for r in results:
print(f"Strength {r['strength']}: {r['behavior_rate']:.2%} refusal rate")
API Reference
Core Classes
SteeringVector: Single steering vector for one layerSteeringVectorSet: Collection of vectors across multiple layersActivationInjector: Apply single vector during inferenceMultiVectorInjector: Apply multiple behaviors simultaneouslyActivationHook: Extract activations for analysis
Extraction
extract_caa_vector(): Extract vector for one layerextract_caa_vectors(): Extract vectors for multiple layers
Evaluation
evaluate_refusal(): Evaluate refusal behaviorevaluate_steering_strength(): Test multiple strength valuesstrength_sweep(): Comprehensive strength analysisanalyze_tradeoffs(): Measure behavior rate vs. false positives
LangChain Integration
SteeredLLM: LangChain LLM with steeringSteeredChatModel: LangChain ChatModel with steeringSteeredAgentExecutor: Agent with steering support
Development
# Clone and install in development mode
git clone https://github.com/rotalabs/rotalabs-steer.git
cd rotalabs-steer
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Format code
black src/ tests/
ruff check src/ tests/
Citation
If you use this package in research, please cite:
@software{rotalabs_steer,
title = {rotalabs-steer: Activation Steering for LLM Behavior Control},
author = {Rotalabs},
year = {2025},
url = {https://github.com/rotalabs/rotalabs-steer}
}
Related Work
This package builds on research in:
- Representation Engineering (Zou et al., 2023)
- Activation Addition / Steering Vectors (Turner et al., 2024)
- Contrastive Activation Addition (Rimsky et al., 2024)
License
MIT License - see LICENSE for details.
Links
- Documentation: https://rotalabs.github.io/rotalabs-steer/
- Pre-extracted Vectors: https://huggingface.co/rotalabs/steering-vectors
- PyPI: https://pypi.org/project/rotalabs-steer/
- GitHub: https://github.com/rotalabs/rotalabs-steer
- Website: https://rotalabs.ai
- Contact: research@rotalabs.ai
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rotalabs_steer-0.1.1.tar.gz.
File metadata
- Download URL: rotalabs_steer-0.1.1.tar.gz
- Upload date:
- Size: 63.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a68f3b3c8dcc14dba1141aa08a766c185d5f53fa5465fdd13e1b904c55918e4e
|
|
| MD5 |
5f5fb537f3830ba39247cc74b099e236
|
|
| BLAKE2b-256 |
7116463194f9c00197e08981110ffb4c26a8bb9ae1490be2952f298542786fc8
|
File details
Details for the file rotalabs_steer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rotalabs_steer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 54.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51779e18407e948e5d4c9063621c70056330a50d6ca9e0da906e51c51b2a2023
|
|
| MD5 |
edc7681dfcdce99a800699e13b8e4d2f
|
|
| BLAKE2b-256 |
223eb37c74a37dca8109a322ffdfce27c626f3f2fc21cc7731e506a3ad3f117d
|