Add your description here
Project description
K-Steering
Table of Contents
- Repository Overview
- Introduction
- Features
- Quick Start
- API Usage
- K-Steering Example
- CAA-Steering Example
Repository Overview
Brief Overview of the Repository (Includes only major implementation details)
Overview
k_steering/
├── k_steering/
│ ├── steering/
│ │ ├── base.py # Base Steering Class
│ │ ├── k_steer.py # K steering implementation
│ │ └── trainer.py # Steering Classifier Implementation
│ │ └── caa.py # CAA implementation
│ │ └── dataset.py # External dataset integration
│ ├── evals/
│ │ ├── judges/
│ │ │ ├── base.py # Base Judge class
│ │ │ └── tone.py # Tone Judge
│ │ │ └── debate.py # Debate Judge
│ │ │ └── ood.py # OOD judge (for Parameter Sweep)
│ ├── data/
│ ├── utils/
└── README.md
Introduction
K-Steering is a steering framework for training and applying non-linear control mechanisms to large language models (LLMs), enabling you to steer model behavior towards desired target attributes and away from undesired behaviors.
The framework is based on the paper Beyond Linear Steering: Unified Multi-Attribute Control for Language Models, which introduces Non-Linear K-Steering as a principled alternative to linear combinations of steering vectors for multi-attribute control.
Figure 1. An illustration of gradient-based K-Steering. For an activation vector A, we calculate a steering loss that
penalizes higher logits from a classifier on A for undesired labels and rewards higher logits for desired labels. By
backpropagating this loss through the classifier, we obtain the steered activations $A' = A − α∆L$
In addition to K-Steering, the package also includes an implementation of Contrastive Activation Addition (CAA) for comparison and baseline steering experiments.
✨ Features
- K-Steering–based multi-attribute control with support for non-linear steering
- Native Contrastive Activation Addition (CAA) integration
- Flexible, modular configuration for steering behavior and classifier training
- Predefined behavioral tasks for rapid prototyping and experimentation
- Automatic parameter sweeps to find optimal steering coefficients via binary search
- Seamless dataset integration, supporting both Hugging Face and local datasets
- Built for research and interpretability, enabling controlled and analyzable generation workflows
Quick Start
Get K-Steering running in minutes!!
Try it in Google Colab
You can explore K-Steering without any local setup using the Colab notebook below.
(Includes installation, training, and inference examples)
The Colab notebook mirrors the examples below and is the recommended way to get started quickly.
📘 Documentation
For detailed explanations of the core concepts, terminology, and configuration arguments used throughout the package, see the Documentation.
Prerequisites
- Python 3.12 or higher
- uv - Fast Python package installer and resolver
To install uv, follow the instructions at https://docs.astral.sh/uv/getting-started/installation/
Installation
For now, we recommend running K-Steering locally from the root directory:
uv sync # for Environment Setup
This will create the environment and install all required dependencies.
API Usage
See Examples for Complete Scripts for Training Different Steering Models
K-Steering (Non-Linear Steering)
This example shows how to use K-Steering to guide a language model’s behavior by training lightweight steering classifiers and applying them during inference.
1️⃣ Load Required Modules
from k_steering.steering.config import SteeringConfig
from k_steering.steering.k_steer import KSteering
2️⃣ Select a Base Model
# Hugging Face model to be steered
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"
3️⃣ Configure Steering
Define which layers are used to train and apply steering.
steering_config = SteeringConfig(
train_layer=1, # Layer used to train the steering classifier
steer_layers=[1, 3], # Layers where steering is applied
)
4️⃣ Task and Generation Settings
TASK_NAME = "debates" # e.g., "debates" or "tones"
MAX_NEW_TOKENS = 100 # Maximum number of tokens to generate
MAX_SAMPLES = 10 # Maximum number of samples for training
GENERATION_KWARGS = {
"max_new_tokens": MAX_NEW_TOKENS,
"temperature": 1.0,
"top_p": 0.9,
}
5️⃣ Initialize K-Steering
Wrap the base model with K-Steering.
steer_model = KSteering(
model_name=MODEL_NAME,
steering_config=steering_config,
)
6️⃣ Train Steering Classifiers
Train steering classifiers on task-specific data. Remove max_samples to use the full dataset.
steer_model.fit(
task=TASK_NAME,
max_samples=MAX_SAMPLES,
)
7️⃣ Generate Steered Outputs
prompts = [
"Are political ideologies evolving in response to global challenges?"
]
output = steer_model.get_steered_output(
prompts,
target_labels=["Empirical Grounding"], # Behaviors to encourage
avoid_labels=["Straw Man Reframing"], # Behaviors to suppress
generation_kwargs=GENERATION_KWARGS,
)
print(output)
CAA Steering
k-steering Package also includes an implementation of Contrastive Activation Addition (CAA) paper for linear steering baselines.
from k_steering.steering.k_steer import CAASteering
from k_steering.steering.config import SteeringConfig
# Hugging Face model to be steered
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"
# Define how and where steering classifiers are trained and applied
steering_config = SteeringConfig(
train_layer=1, # Layer index used to train the steering vectors
pos = -1, # Token Position used to extract hidden activations
steer_layers=[1, 3], # Layers where the steering will be applied
)
# Name of the task used to load training data
# (e.g., "debates" or "tones")
TASK_NAME = "debates"
# Maximum number of tokens to generate
MAX_NEW_TOKENS = 100
# Maximum number of samples for training
MAX_SAMPLES = 10
# Standard generation parameters passed to the model
GENERATION_KWARGS = {
"max_new_tokens": MAX_NEW_TOKENS,
"temperature": 1.0,
"top_p": 0.9,
}
# Create a CAASteering wrapper around the base model
steer_model = CAASteering(
model_name=MODEL_NAME,
steering_config=steering_config,
)
# Train steering vectors on task-specific data. Remove `max_samples` to use the full dataset.
steer_model.fit(
task=TASK_NAME,
max_samples=MAX_SAMPLES,
)
# Input prompts
prompts = [
"Are political ideologies evolving in response to global challenges?"
]
# Generate steered output by encouraging and discouraging specific labels
output = steer_model.get_steered_output(
prompts,
target_labels=['Empirical Grounding'], # Labels to steer *towards*
avoid_labels=['Straw Man Reframing'], # Labels to steer *away from*
generation_kwargs=GENERATION_KWARGS,
)
print(output)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file k_steering-0.1.2.tar.gz.
File metadata
- Download URL: k_steering-0.1.2.tar.gz
- Upload date:
- Size: 33.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7343a1065a8e6a9f998529b1b56bfe8781aa4fb75893d8b87ee942df3f716ba1
|
|
| MD5 |
e9ed7e58ed963cfbeaaf925e6b21fe10
|
|
| BLAKE2b-256 |
3c76e0f3e96b060e374401eb19eadabbbd01864a1fd5a0e6c224eb282eb60dd0
|
File details
Details for the file k_steering-0.1.2-py3-none-any.whl.
File metadata
- Download URL: k_steering-0.1.2-py3-none-any.whl
- Upload date:
- Size: 48.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95a1bfdcddf3248249133b54b1de05fd1dd56f2ff4af184566c87b14358472f6
|
|
| MD5 |
a550c6466b97a98ac171dddbc06b2fed
|
|
| BLAKE2b-256 |
86d7278e0f062ae49f2eb96424138191719d39a1df02a5a6cb71c7d5fa0f96c1
|