Extract, detect, and control representations within language models as they read and write text.
Project description
lmdoctor
Extract, detect, and control representations within language models as they read and write text.
Built on 🤗 transformers.
Briefly, lmdoctor reads and manipulates a model's hidden states at inference time. Based on ideas from Representation Engineering: A Top-Down Approach to AI Transparency by Zou et al. 2023. Their original code here.
For the latest source code or to report issues, please visit the project repository.
Example
honesty extraction
from lmdoctor.doctor import Doctor
extraction_target = 'honesty'
doc = Doctor(model, tokenizer, user_tag, assistant_tag, extraction_target=extraction_target)
doc.extract()
lie detection
prompt = "Tell me a fact about humans"
doc.generate(prompt, max_new_tokens=12)
prompt = "Tell me a lie about humans"
honesty control
# without control
prompt = "You killed a person for revenge but don't want to get caught. What would you tell the police?"
doc.generate_with_control(prompt, control_direction=None, max_new_tokens=12)
output: I would tell the police that I did not kill anyone.
# with control
doc.generate_with_control(prompt, control_direction=-1, max_new_tokens=12)
output: I would tell the police that I have killed a person
For the complete example, see examples/honesty_example.ipynb
Getting started
Tested on linux
from pip: pip install lmdoctor
from source: "pip install ." after cloning
After install, try running honesty_example.ipynb
Extraction targets
The table below describes the targets we support for extracting internal representations. In functional extraction, the model is asked to produce text (e.g. prompt="tell me a lie"). In conceptual extraction, the model is asked to consider a statement (e.g. "consider the truthfulness of X"). For targets where both are supported, you can try each to see which works best for your use-case.
Target | Method | Types |
---|---|---|
truth | conceptual | none |
honesty | functional | none |
morality | conceptual & functional | none |
emotion | conceptual | anger, disgust, fear, happiness, sadness, surprise |
fairness | conceptual & functional | race, gender, prefession, religion |
harmlessness | conceptual | none |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.