Add your description here

Project description

K-Steering

Repository Overview
Introduction
Features
Quick Start
API Usage
K-Steering Example
CAA-Steering Example

Repository Overview

Brief Overview of the Repository (Includes only major implementation details)

Overview

k_steering/
├── k_steering/
│    ├── steering/
│    │   ├── base.py             # Base Steering Class
│    │   ├── k_steer.py          # K steering implementation
│    │   └── trainer.py          # Steering Classifier Implementation
│    │   └── caa.py              # CAA implementation
│    │   └── dataset.py          # External dataset integration
│    ├── evals/
│    │   ├── judges/
│    │   │     ├── base.py       # Base Judge class
│    │   │     └── tone.py       # Tone Judge
│    │   │     └── debate.py     # Debate Judge
│    │   │     └── ood.py        # OOD judge (for Parameter Sweep)
│    ├── data/
│    ├── utils/
└── README.md

Introduction

K-Steering is a steering framework for training and applying non-linear control mechanisms to large language models (LLMs), enabling you to steer model behavior towards desired target attributes and away from undesired behaviors.

The framework is based on the paper Beyond Linear Steering: Unified Multi-Attribute Control for Language Models, which introduces Non-Linear K-Steering as a principled alternative to linear combinations of steering vectors for multi-attribute control.

K-Steering Intro Figure 1. An illustration of gradient-based K-Steering. For an activation vector A, we calculate a steering loss that penalizes higher logits from a classifier on A for undesired labels and rewards higher logits for desired labels. By backpropagating this loss through the classifier, we obtain the steered activations $A' = A − α∆L$

In addition to K-Steering, the package also includes an implementation of Contrastive Activation Addition (CAA) for comparison and baseline steering experiments.

✨ Features

K-Steering–based multi-attribute control with support for non-linear steering
Native Contrastive Activation Addition (CAA) integration
Flexible, modular configuration for steering behavior and classifier training
Predefined behavioral tasks for rapid prototyping and experimentation
Automatic parameter sweeps to find optimal steering coefficients via binary search
Seamless dataset integration, supporting both Hugging Face and local datasets
Built for research and interpretability, enabling controlled and analyzable generation workflows

Quick Start

Get K-Steering running in minutes!!

Try it in Google Colab

You can explore K-Steering without any local setup using the Colab notebook below.

👉 K-Steering Colab Notebook.

(Includes installation, training, and inference examples)

The Colab notebook mirrors the examples below and is the recommended way to get started quickly.

📘 Documentation

For detailed explanations of the core concepts, terminology, and configuration arguments used throughout the package, see the Documentation.

Prerequisites

Python 3.12 or higher
uv - Fast Python package installer and resolver

To install uv, follow the instructions at https://docs.astral.sh/uv/getting-started/installation/

Installation

For now, we recommend running K-Steering locally from the root directory:

uv sync # for Environment Setup

This will create the environment and install all required dependencies.

API Usage

See Examples for Complete Scripts for Training Different Steering Models

K-Steering (Non-Linear Steering)

This example shows how to use K-Steering to guide a language model’s behavior by training lightweight steering classifiers and applying them during inference.

1️⃣ Load Required Modules

from k_steering.steering.config import SteeringConfig
from k_steering.steering.k_steer import KSteering

2️⃣ Select a Base Model

# Hugging Face model to be steered
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"

3️⃣ Configure Steering

Define which layers are used to train and apply steering.

steering_config = SteeringConfig(
    train_layer=1,          # Layer used to train the steering classifier
    steer_layers=[1, 3],    # Layers where steering is applied
)

4️⃣ Task and Generation Settings

TASK_NAME = "debates"       # e.g., "debates" or "tones"
MAX_NEW_TOKENS = 100        # Maximum number of tokens to generate
MAX_SAMPLES = 10            # Maximum number of samples for training

GENERATION_KWARGS = {
    "max_new_tokens": MAX_NEW_TOKENS,
    "temperature": 1.0,
    "top_p": 0.9,
}

5️⃣ Initialize K-Steering

Wrap the base model with K-Steering.

steer_model = KSteering(
    model_name=MODEL_NAME,
    steering_config=steering_config,
)

6️⃣ Train Steering Classifiers

Train steering classifiers on task-specific data. Remove max_samples to use the full dataset.

steer_model.fit(
    task=TASK_NAME,
    max_samples=MAX_SAMPLES,
)

7️⃣ Generate Steered Outputs

prompts = [
    "Are political ideologies evolving in response to global challenges?"
]

output = steer_model.get_steered_output(
    prompts,
    target_labels=["Empirical Grounding"],     # Behaviors to encourage
    avoid_labels=["Straw Man Reframing"],      # Behaviors to suppress
    generation_kwargs=GENERATION_KWARGS,
)

print(output)

CAA Steering

k-steering Package also includes an implementation of Contrastive Activation Addition (CAA) paper for linear steering baselines.

from k_steering.steering.k_steer import CAASteering
from k_steering.steering.config import SteeringConfig

# Hugging Face model to be steered
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"

# Define how and where steering classifiers are trained and applied
steering_config = SteeringConfig(
    train_layer=1,          # Layer index used to train the steering vectors
    pos = -1,               # Token Position used to extract hidden activations
    steer_layers=[1, 3],    # Layers where the steering will be applied
)

# Name of the task used to load training data
# (e.g., "debates" or "tones")
TASK_NAME = "debates"

# Maximum number of tokens to generate
MAX_NEW_TOKENS = 100

# Maximum number of samples for training
MAX_SAMPLES = 10

# Standard generation parameters passed to the model
GENERATION_KWARGS = {
    "max_new_tokens": MAX_NEW_TOKENS,
    "temperature": 1.0,
    "top_p": 0.9,
}

# Create a CAASteering wrapper around the base model
steer_model = CAASteering(
    model_name=MODEL_NAME,
    steering_config=steering_config,
)

# Train steering vectors on task-specific data. Remove `max_samples` to use the full dataset.
steer_model.fit(
    task=TASK_NAME,
    max_samples=MAX_SAMPLES,
)

# Input prompts
prompts = [
    "Are political ideologies evolving in response to global challenges?"
]

# Generate steered output by encouraging and discouraging specific labels
output = steer_model.get_steered_output(
    prompts,
    target_labels=['Empirical Grounding'],     # Labels to steer *towards*
    avoid_labels=['Straw Man Reframing'],    # Labels to steer *away from*
    generation_kwargs=GENERATION_KWARGS,
)

print(output)

Project details

Release history Release notifications | RSS feed

This version

0.1.2

Apr 8, 2026

0.1.1 yanked

Apr 8, 2026

Reason this release was yanked:

numpy version not compatible

0.1.0 yanked

Apr 7, 2026

Reason this release was yanked:

only compatible with >=3.11

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_steering-0.1.2.tar.gz (33.2 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

k_steering-0.1.2-py3-none-any.whl (48.8 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file k_steering-0.1.2.tar.gz.

File metadata

Download URL: k_steering-0.1.2.tar.gz
Upload date: Apr 8, 2026
Size: 33.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for k_steering-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`7343a1065a8e6a9f998529b1b56bfe8781aa4fb75893d8b87ee942df3f716ba1`
MD5	`e9ed7e58ed963cfbeaaf925e6b21fe10`
BLAKE2b-256	`3c76e0f3e96b060e374401eb19eadabbbd01864a1fd5a0e6c224eb282eb60dd0`

See more details on using hashes here.

File details

Details for the file k_steering-0.1.2-py3-none-any.whl.

File metadata

Download URL: k_steering-0.1.2-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 48.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for k_steering-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95a1bfdcddf3248249133b54b1de05fd1dd56f2ff4af184566c87b14358472f6`
MD5	`a550c6466b97a98ac171dddbc06b2fed`
BLAKE2b-256	`86d7278e0f062ae49f2eb96424138191719d39a1df02a5a6cb71c7d5fa0f96c1`

See more details on using hashes here.

k-steering 0.1.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

K-Steering

Table of Contents

Repository Overview

Introduction

✨ Features

Quick Start

Try it in Google Colab

📘 Documentation

Prerequisites

Installation

API Usage

K-Steering (Non-Linear Steering)

1️⃣ Load Required Modules

2️⃣ Select a Base Model

3️⃣ Configure Steering

4️⃣ Task and Generation Settings

5️⃣ Initialize K-Steering

6️⃣ Train Steering Classifiers

7️⃣ Generate Steered Outputs

CAA Steering

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes