Skip to main content

Add your description here

Reason this release was yanked:

only compatible with >=3.11

Project description

K-Steering

Table of Contents

Repository Overview

Brief Overview of the Repository (Includes only major implementation details)

Overview
k_steering/
├── k_steering/
│    ├── steering/
│    │   ├── base.py             # Base Steering Class
│    │   ├── k_steer.py          # K steering implementation
│    │   └── trainer.py          # Steering Classifier Implementation
│    │   └── caa.py              # CAA implementation
│    │   └── dataset.py          # External dataset integration
│    ├── evals/
│    │   ├── judges/
│    │   │     ├── base.py       # Base Judge class
│    │   │     └── tone.py       # Tone Judge
│    │   │     └── debate.py     # Debate Judge
│    │   │     └── ood.py        # OOD judge (for Parameter Sweep)
│    ├── data/
│    ├── utils/
└── README.md

Introduction

K-Steering is a steering framework for training and applying non-linear control mechanisms to large language models (LLMs), enabling you to steer model behavior towards desired target attributes and away from undesired behaviors.

The framework is based on the paper Beyond Linear Steering: Unified Multi-Attribute Control for Language Models, which introduces Non-Linear K-Steering as a principled alternative to linear combinations of steering vectors for multi-attribute control.

K-Steering Intro Figure 1. An illustration of gradient-based K-Steering. For an activation vector A, we calculate a steering loss that penalizes higher logits from a classifier on A for undesired labels and rewards higher logits for desired labels. By backpropagating this loss through the classifier, we obtain the steered activations $A' = A − α∆L$

In addition to K-Steering, the package also includes an implementation of Contrastive Activation Addition (CAA) for comparison and baseline steering experiments.

✨ Features

  • K-Steering–based multi-attribute control with support for non-linear steering
  • Native Contrastive Activation Addition (CAA) integration
  • Flexible, modular configuration for steering behavior and classifier training
  • Predefined behavioral tasks for rapid prototyping and experimentation
  • Automatic parameter sweeps to find optimal steering coefficients via binary search
  • Seamless dataset integration, supporting both Hugging Face and local datasets
  • Built for research and interpretability, enabling controlled and analyzable generation workflows

Quick Start

Get K-Steering running in minutes!!

Try it in Google Colab

You can explore K-Steering without any local setup using the Colab notebook below.

👉 K-Steering Colab Notebook.

(Includes installation, training, and inference examples)

The Colab notebook mirrors the examples below and is the recommended way to get started quickly.

📘 Documentation

For detailed explanations of the core concepts, terminology, and configuration arguments used throughout the package, see the Documentation.

Prerequisites

  • Python 3.12 or higher
  • uv - Fast Python package installer and resolver

To install uv, follow the instructions at https://docs.astral.sh/uv/getting-started/installation/

Installation

For now, we recommend running K-Steering locally from the root directory:

uv sync # for Environment Setup

This will create the environment and install all required dependencies.

API Usage

See Examples for Complete Scripts for Training Different Steering Models

K-Steering (Non-Linear Steering)

This example shows how to use K-Steering to guide a language model’s behavior by training lightweight steering classifiers and applying them during inference.


1️⃣ Load Required Modules

from k_steering.steering.config import SteeringConfig
from k_steering.steering.k_steer import KSteering

2️⃣ Select a Base Model

# Hugging Face model to be steered
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"

3️⃣ Configure Steering

Define which layers are used to train and apply steering.

steering_config = SteeringConfig(
    train_layer=1,          # Layer used to train the steering classifier
    steer_layers=[1, 3],    # Layers where steering is applied
)

4️⃣ Task and Generation Settings

TASK_NAME = "debates"       # e.g., "debates" or "tones"
MAX_NEW_TOKENS = 100        # Maximum number of tokens to generate
MAX_SAMPLES = 10            # Maximum number of samples for training

GENERATION_KWARGS = {
    "max_new_tokens": MAX_NEW_TOKENS,
    "temperature": 1.0,
    "top_p": 0.9,
}

5️⃣ Initialize K-Steering

Wrap the base model with K-Steering.

steer_model = KSteering(
    model_name=MODEL_NAME,
    steering_config=steering_config,
)

6️⃣ Train Steering Classifiers

Train steering classifiers on task-specific data. Remove max_samples to use the full dataset.

steer_model.fit(
    task=TASK_NAME,
    max_samples=MAX_SAMPLES,
)

7️⃣ Generate Steered Outputs

prompts = [
    "Are political ideologies evolving in response to global challenges?"
]

output = steer_model.get_steered_output(
    prompts,
    target_labels=["Empirical Grounding"],     # Behaviors to encourage
    avoid_labels=["Straw Man Reframing"],      # Behaviors to suppress
    generation_kwargs=GENERATION_KWARGS,
)

print(output)

CAA Steering

k-steering Package also includes an implementation of Contrastive Activation Addition (CAA) paper for linear steering baselines.

from k_steering.steering.k_steer import CAASteering
from k_steering.steering.config import SteeringConfig

# Hugging Face model to be steered
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"

# Define how and where steering classifiers are trained and applied
steering_config = SteeringConfig(
    train_layer=1,          # Layer index used to train the steering vectors
    pos = -1,               # Token Position used to extract hidden activations
    steer_layers=[1, 3],    # Layers where the steering will be applied
)

# Name of the task used to load training data
# (e.g., "debates" or "tones")
TASK_NAME = "debates"

# Maximum number of tokens to generate
MAX_NEW_TOKENS = 100

# Maximum number of samples for training
MAX_SAMPLES = 10

# Standard generation parameters passed to the model
GENERATION_KWARGS = {
    "max_new_tokens": MAX_NEW_TOKENS,
    "temperature": 1.0,
    "top_p": 0.9,
}

# Create a CAASteering wrapper around the base model
steer_model = CAASteering(
    model_name=MODEL_NAME,
    steering_config=steering_config,
)

# Train steering vectors on task-specific data. Remove `max_samples` to use the full dataset.
steer_model.fit(
    task=TASK_NAME,
    max_samples=MAX_SAMPLES,
)

# Input prompts
prompts = [
    "Are political ideologies evolving in response to global challenges?"
]

# Generate steered output by encouraging and discouraging specific labels
output = steer_model.get_steered_output(
    prompts,
    target_labels=['Empirical Grounding'],     # Labels to steer *towards*
    avoid_labels=['Straw Man Reframing'],    # Labels to steer *away from*
    generation_kwargs=GENERATION_KWARGS,
)

print(output)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_steering-0.1.0.tar.gz (33.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

k_steering-0.1.0-py3-none-any.whl (48.8 kB view details)

Uploaded Python 3

File details

Details for the file k_steering-0.1.0.tar.gz.

File metadata

  • Download URL: k_steering-0.1.0.tar.gz
  • Upload date:
  • Size: 33.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for k_steering-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b1b6b01617d69740a5d39e21f0ef5e34f5303cfa0c47e6342b619944aa19c213
MD5 daa1c36fbbc6a3a89a513baf30a1cd74
BLAKE2b-256 27ec1b62c385a024e3617a52035b9ec29931b89f6acaadf4da61ddc625c4dabe

See more details on using hashes here.

File details

Details for the file k_steering-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: k_steering-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 48.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for k_steering-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ebfcc0ed79aff278fcdeb86b753fb9d2919103ffd6c511d5eb87d88b91a0c938
MD5 ed4bc54185c3242dbbb321e5e694eee2
BLAKE2b-256 f3ab9eae6106e908f94e96bf9112f3946ee61e045a981b9d7bad2be5737756df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page