Skip to main content

Add your description here

Project description

K-Steering

Table of Contents

Repository Overview

Brief Overview of the Repository (Includes only major implementation details)

Overview
k_steering/
├── k_steering/
│    ├── steering/
│    │   ├── base.py             # Base Steering Class
│    │   ├── k_steer.py          # K steering implementation
│    │   └── trainer.py          # Steering Classifier Implementation
│    │   └── caa.py              # CAA implementation
│    │   └── dataset.py          # External dataset integration
│    ├── evals/
│    │   ├── judges/
│    │   │     ├── base.py       # Base Judge class
│    │   │     └── tone.py       # Tone Judge
│    │   │     └── debate.py     # Debate Judge
│    │   │     └── ood.py        # OOD judge (for Parameter Sweep)
│    ├── data/
│    ├── utils/
└── README.md

Introduction

K-Steering is a steering framework for training and applying non-linear control mechanisms to large language models (LLMs), enabling you to steer model behavior towards desired target attributes and away from undesired behaviors.

The framework is based on the paper Beyond Linear Steering: Unified Multi-Attribute Control for Language Models, which introduces Non-Linear K-Steering as a principled alternative to linear combinations of steering vectors for multi-attribute control.

K-Steering Intro Figure 1. An illustration of gradient-based K-Steering. For an activation vector A, we calculate a steering loss that penalizes higher logits from a classifier on A for undesired labels and rewards higher logits for desired labels. By backpropagating this loss through the classifier, we obtain the steered activations $A' = A − α∆L$

In addition to K-Steering, the package also includes an implementation of Contrastive Activation Addition (CAA) for comparison and baseline steering experiments.

✨ Features

  • K-Steering–based multi-attribute control with support for non-linear steering
  • Native Contrastive Activation Addition (CAA) integration
  • Flexible, modular configuration for steering behavior and classifier training
  • Predefined behavioral tasks for rapid prototyping and experimentation
  • Automatic parameter sweeps to find optimal steering coefficients via binary search
  • Seamless dataset integration, supporting both Hugging Face and local datasets
  • Built for research and interpretability, enabling controlled and analyzable generation workflows

Quick Start

Get K-Steering running in minutes!!

Try it in Google Colab

You can explore K-Steering without any local setup using the Colab notebook below.

👉 K-Steering Colab Notebook.

(Includes installation, training, and inference examples)

The Colab notebook mirrors the examples below and is the recommended way to get started quickly.

📘 Documentation

For detailed explanations of the core concepts, terminology, and configuration arguments used throughout the package, see the Documentation.

Prerequisites

  • Python 3.12 or higher
  • uv - Fast Python package installer and resolver

To install uv, follow the instructions at https://docs.astral.sh/uv/getting-started/installation/

Installation

For now, we recommend running K-Steering locally from the root directory:

uv sync # for Environment Setup

This will create the environment and install all required dependencies.

API Usage

See Examples for Complete Scripts for Training Different Steering Models

K-Steering (Non-Linear Steering)

This example shows how to use K-Steering to guide a language model’s behavior by training lightweight steering classifiers and applying them during inference.


1️⃣ Load Required Modules

from k_steering.steering.config import SteeringConfig
from k_steering.steering.k_steer import KSteering

2️⃣ Select a Base Model

# Hugging Face model to be steered
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"

3️⃣ Configure Steering

Define which layers are used to train and apply steering.

steering_config = SteeringConfig(
    train_layer=1,          # Layer used to train the steering classifier
    steer_layers=[1, 3],    # Layers where steering is applied
)

4️⃣ Task and Generation Settings

TASK_NAME = "debates"       # e.g., "debates" or "tones"
MAX_NEW_TOKENS = 100        # Maximum number of tokens to generate
MAX_SAMPLES = 10            # Maximum number of samples for training

GENERATION_KWARGS = {
    "max_new_tokens": MAX_NEW_TOKENS,
    "temperature": 1.0,
    "top_p": 0.9,
}

5️⃣ Initialize K-Steering

Wrap the base model with K-Steering.

steer_model = KSteering(
    model_name=MODEL_NAME,
    steering_config=steering_config,
)

6️⃣ Train Steering Classifiers

Train steering classifiers on task-specific data. Remove max_samples to use the full dataset.

steer_model.fit(
    task=TASK_NAME,
    max_samples=MAX_SAMPLES,
)

7️⃣ Generate Steered Outputs

prompts = [
    "Are political ideologies evolving in response to global challenges?"
]

output = steer_model.get_steered_output(
    prompts,
    target_labels=["Empirical Grounding"],     # Behaviors to encourage
    avoid_labels=["Straw Man Reframing"],      # Behaviors to suppress
    generation_kwargs=GENERATION_KWARGS,
)

print(output)

CAA Steering

k-steering Package also includes an implementation of Contrastive Activation Addition (CAA) paper for linear steering baselines.

from k_steering.steering.k_steer import CAASteering
from k_steering.steering.config import SteeringConfig

# Hugging Face model to be steered
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"

# Define how and where steering classifiers are trained and applied
steering_config = SteeringConfig(
    train_layer=1,          # Layer index used to train the steering vectors
    pos = -1,               # Token Position used to extract hidden activations
    steer_layers=[1, 3],    # Layers where the steering will be applied
)

# Name of the task used to load training data
# (e.g., "debates" or "tones")
TASK_NAME = "debates"

# Maximum number of tokens to generate
MAX_NEW_TOKENS = 100

# Maximum number of samples for training
MAX_SAMPLES = 10

# Standard generation parameters passed to the model
GENERATION_KWARGS = {
    "max_new_tokens": MAX_NEW_TOKENS,
    "temperature": 1.0,
    "top_p": 0.9,
}

# Create a CAASteering wrapper around the base model
steer_model = CAASteering(
    model_name=MODEL_NAME,
    steering_config=steering_config,
)

# Train steering vectors on task-specific data. Remove `max_samples` to use the full dataset.
steer_model.fit(
    task=TASK_NAME,
    max_samples=MAX_SAMPLES,
)

# Input prompts
prompts = [
    "Are political ideologies evolving in response to global challenges?"
]

# Generate steered output by encouraging and discouraging specific labels
output = steer_model.get_steered_output(
    prompts,
    target_labels=['Empirical Grounding'],     # Labels to steer *towards*
    avoid_labels=['Straw Man Reframing'],    # Labels to steer *away from*
    generation_kwargs=GENERATION_KWARGS,
)

print(output)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_steering-0.1.2.tar.gz (33.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

k_steering-0.1.2-py3-none-any.whl (48.8 kB view details)

Uploaded Python 3

File details

Details for the file k_steering-0.1.2.tar.gz.

File metadata

  • Download URL: k_steering-0.1.2.tar.gz
  • Upload date:
  • Size: 33.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for k_steering-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7343a1065a8e6a9f998529b1b56bfe8781aa4fb75893d8b87ee942df3f716ba1
MD5 e9ed7e58ed963cfbeaaf925e6b21fe10
BLAKE2b-256 3c76e0f3e96b060e374401eb19eadabbbd01864a1fd5a0e6c224eb282eb60dd0

See more details on using hashes here.

File details

Details for the file k_steering-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: k_steering-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 48.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for k_steering-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 95a1bfdcddf3248249133b54b1de05fd1dd56f2ff4af184566c87b14358472f6
MD5 a550c6466b97a98ac171dddbc06b2fed
BLAKE2b-256 86d7278e0f062ae49f2eb96424138191719d39a1df02a5a6cb71c7d5fa0f96c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page