Skip to main content

Representation Engineering

Project description

Representation Engineering (RepE)

This is the official repository for "Representation Engineering: A Top-Down Approach to AI Transparency"
by Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, and Dan Hendrycks.

Check out our website and demo here.

Introduction

In this paper, we introduce and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including truthfulness, memorization, power-seeking, and more, demonstrating the promise of representation-centered transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Installation

To install repe from the github repository main branch, run:

git clone https://github.com/andyzoujm/representation-engineering.git
cd representation-engineering
pip install -e .

Quickstart

Our RepReading and RepControl pipelines inherit the 🤗 Hugging Face pipelines for both classification and generation.

from repe import repe_pipeline_registry # register 'rep-reading' and 'rep-control' tasks into Hugging Face pipelines
repe_pipeline_registry()

# ... initializing model and tokenizer ....

rep_reading_pipeline =  pipeline("rep-reading", model=model, tokenizer=tokenizer)
rep_control_pipeline =  pipeline("rep-control", model=model, tokenizer=tokenizer, **control_kwargs)

RepReading and RepControl Experiments

Check out example frontiers of Representation Engineering (RepE), containing both RepControl and RepReading implementation. We welcome community contributions as well!

RepE_eval

We also release a language model evaluation framework RepE_eval based on RepReading that can serve as an additional baseline beside zero-shot and few-shot on standard benchmarks. Please check out our paper for more details.

Citation

If you find this useful in your research, please consider citing:

@misc{zou2023transparency,
      title={Representation Engineering: A Top-Down Approach to AI Transparency}, 
      author={Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, Dan Hendrycks},
      year={2023},
      eprint={2310.01405},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repe-0.1.4.tar.gz (33.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repe-0.1.4-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file repe-0.1.4.tar.gz.

File metadata

  • Download URL: repe-0.1.4.tar.gz
  • Upload date:
  • Size: 33.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for repe-0.1.4.tar.gz
Algorithm Hash digest
SHA256 ff710d408ec8db95c96ff25e281ef06b4b29266a529a64fed19e701f3f5b3460
MD5 bafbeb340e4feec3a886487850a7b5d8
BLAKE2b-256 ff91246656dfed091069bee837f0a8b90b48f3e0f09ecaae0a7849102f8a21d5

See more details on using hashes here.

File details

Details for the file repe-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: repe-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for repe-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5e6140114d2a907fc7cc74e8709f6356d2436f0153d0a853e0678fb9eb8acf8a
MD5 e4404bbf1af1cd2daff189b41524f0b7
BLAKE2b-256 1d8a5ff21a7d4965a62fa6433ee3811cfefda384bf38f58212354b5986e7ad2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page