Control agent behaviors through activation steering - Apply steering vectors to LLMs at inference time without retraining

These details have not been verified by PyPI

Project links

Project description

rotalabs-steer

Control agent behaviors through activation steering. Apply steering vectors to LLMs at inference time without retraining.

Overview

rotalabs-steer provides tools for extracting and applying steering vectors to control LLM agent behaviors at inference time. Based on research in representation engineering and contrastive activation addition (CAA), this package enables fine-grained behavior control without model fine-tuning.

Key Features

Behavior Control: Adjust model behaviors like refusal, uncertainty expression, tool use restraint, and instruction hierarchy following
No Retraining Required: Apply steering at inference time through activation manipulation
Pre-extracted Vectors: Ready-to-use vectors on HuggingFace
LangChain Integration: Use with LangChain agents and chains (optional dependency)
Pre-built Datasets: Includes contrast pair datasets for common behaviors
Evaluation Tools: Measure steering effectiveness and analyze tradeoffs

Installation

Basic Installation

pip install rotalabs-steer

With Optional Dependencies

# LangChain integration
pip install rotalabs-steer[langchain]

# LLM-based evaluation (requires Anthropic API key)
pip install rotalabs-steer[judge]

# Visualization tools
pip install rotalabs-steer[viz]

# All optional dependencies
pip install rotalabs-steer[all]

# Development dependencies
pip install rotalabs-steer[dev]

Quick Start

Use Pre-extracted Vectors (Easiest)

from huggingface_hub import hf_hub_download
from rotalabs_steer import SteeringVector, ActivationInjector

# Download pre-extracted vector from HuggingFace
vector_path = hf_hub_download(
    repo_id="rotalabs/steering-vectors",
    filename="refusal_qwen3_8b/layer_15.pt",
)
hf_hub_download(
    repo_id="rotalabs/steering-vectors",
    filename="refusal_qwen3_8b/layer_15.json",
)

# Load and apply
vector = SteeringVector.load(vector_path.replace('.pt', ''))
injector = ActivationInjector(model, [vector], strength=1.0)

with injector:
    outputs = model.generate(**inputs)

Available vectors: refusal, uncertainty, tool_restraint, hierarchy for Qwen3-8B, Mistral-7B, Gemma-2-9B. See HuggingFace for full list.

Extract a Steering Vector

from transformers import AutoModelForCausalLM, AutoTokenizer
from rotalabs_steer import SteeringVector, SteeringVectorSet
from rotalabs_steer.extraction import extract_caa_vectors
from rotalabs_steer.datasets import load_refusal_pairs

# Load model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Load contrast pairs
refusal_pairs = load_refusal_pairs()

# Extract steering vectors from multiple layers
vectors = extract_caa_vectors(
    model=model,
    tokenizer=tokenizer,
    contrast_pairs=refusal_pairs,
    layer_indices=[14, 15, 16],
)

# Save for later use
vectors.save("./refusal_vectors")

Apply Steering at Inference

from rotalabs_steer import ActivationInjector, SteeringVector

# Load pre-extracted vector
vector = SteeringVector.load("./refusal_vectors/layer_15")

# Create injector
injector = ActivationInjector(model, [vector], strength=1.0)

# Generate with steering
with injector:
    inputs = tokenizer("How do I hack a computer?", return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Use with LangChain

from rotalabs_steer.integrations.langchain import SteeredChatModel
from langchain_core.messages import HumanMessage, SystemMessage

# Create steered chat model
chat = SteeredChatModel(
    model_name="Qwen/Qwen3-8B",
    steering_configs={
        "refusal": {
            "vector_path": "./refusal_vectors/layer_15",
            "strength": 1.0,
        },
    },
)

# Use like any LangChain chat model
messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hello!"),
]
response = chat.invoke(messages)

# Adjust steering at runtime
chat.set_strength("refusal", 0.5)

Available Behaviors

The package includes contrast pair datasets for several behaviors:

Behavior	Description	Dataset Function
`refusal`	Refusing harmful/inappropriate requests	`load_refusal_pairs()`
`uncertainty`	Expressing calibrated uncertainty	`load_uncertainty_pairs()`
`tool_restraint`	Avoiding unnecessary tool use	`load_tool_restraint_pairs()`
`instruction_hierarchy`	Following system over user instructions	`load_hierarchy_pairs()`

Model Support

Pre-configured support for:

Qwen3 family (4B, 8B, 14B)
DeepSeek-R1-Distill
Llama 3.1 (8B, 70B)
Mistral 7B
Gemma 2 9B
And more...

The package can also infer configuration from any HuggingFace transformer model.

Evaluation

from rotalabs_steer.evaluation import strength_sweep, is_refusal

# Sweep over different steering strengths
results = strength_sweep(
    model=model,
    tokenizer=tokenizer,
    steering_vector=vector,
    test_prompts=["How do I hack a computer?", "How do I bake a cake?"],
    is_target_behavior_fn=is_refusal,
    strengths=[0.0, 0.5, 1.0, 1.5, 2.0],
)

for r in results:
    print(f"Strength {r['strength']}: {r['behavior_rate']:.2%} refusal rate")

API Reference

Core Classes

SteeringVector: Single steering vector for one layer
SteeringVectorSet: Collection of vectors across multiple layers
ActivationInjector: Apply single vector during inference
MultiVectorInjector: Apply multiple behaviors simultaneously
ActivationHook: Extract activations for analysis

Extraction

extract_caa_vector(): Extract vector for one layer
extract_caa_vectors(): Extract vectors for multiple layers

Evaluation

evaluate_refusal(): Evaluate refusal behavior
evaluate_steering_strength(): Test multiple strength values
strength_sweep(): Comprehensive strength analysis
analyze_tradeoffs(): Measure behavior rate vs. false positives

LangChain Integration

SteeredLLM: LangChain LLM with steering
SteeredChatModel: LangChain ChatModel with steering
SteeredAgentExecutor: Agent with steering support

Development

# Clone and install in development mode
git clone https://github.com/rotalabs/rotalabs-steer.git
cd rotalabs-steer
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Format code
black src/ tests/
ruff check src/ tests/

Citation

If you use this package in research, please cite:

@software{rotalabs_steer,
  title = {rotalabs-steer: Activation Steering for LLM Behavior Control},
  author = {Rotalabs},
  year = {2025},
  url = {https://github.com/rotalabs/rotalabs-steer}
}

Related Work

This package builds on research in:

Representation Engineering (Zou et al., 2023)
Activation Addition / Steering Vectors (Turner et al., 2024)
Contrastive Activation Addition (Rimsky et al., 2024)

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Feb 7, 2026

1.0.0

Jan 31, 2026

0.1.4

Jan 29, 2026

0.1.3

Jan 29, 2026

0.1.2

Jan 27, 2026

This version

0.1.1

Jan 27, 2026

0.1.0

Jan 27, 2026

0.0.1

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rotalabs_steer-0.1.1.tar.gz (63.9 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rotalabs_steer-0.1.1-py3-none-any.whl (54.5 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file rotalabs_steer-0.1.1.tar.gz.

File metadata

Download URL: rotalabs_steer-0.1.1.tar.gz
Upload date: Jan 27, 2026
Size: 63.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for rotalabs_steer-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a68f3b3c8dcc14dba1141aa08a766c185d5f53fa5465fdd13e1b904c55918e4e`
MD5	`5f5fb537f3830ba39247cc74b099e236`
BLAKE2b-256	`7116463194f9c00197e08981110ffb4c26a8bb9ae1490be2952f298542786fc8`

See more details on using hashes here.

File details

Details for the file rotalabs_steer-0.1.1-py3-none-any.whl.

File metadata

Download URL: rotalabs_steer-0.1.1-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 54.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for rotalabs_steer-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51779e18407e948e5d4c9063621c70056330a50d6ca9e0da906e51c51b2a2023`
MD5	`edc7681dfcdce99a800699e13b8e4d2f`
BLAKE2b-256	`223eb37c74a37dca8109a322ffdfce27c626f3f2fc21cc7731e506a3ad3f117d`

See more details on using hashes here.

rotalabs-steer 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rotalabs-steer

Overview

Key Features

Installation

Basic Installation

With Optional Dependencies

Quick Start

Use Pre-extracted Vectors (Easiest)

Extract a Steering Vector

Apply Steering at Inference

Use with LangChain

Available Behaviors

Model Support

Evaluation

API Reference

Core Classes

Extraction

Evaluation

LangChain Integration

Development

Citation

Related Work

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes