Extract, detect, and control semantic representations within language models as they read and write text

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.9

Project description

lmdoctor

Extract, detect, and control semantic representations within language models as they read and write text. Built on 🤗 transformers.

lmdoctor pip package

Briefly, lmdoctor reads and manipulates a model's hidden states at inference time. Based on ideas from Representation Engineering: A Top-Down Approach to AI Transparency by Zou et al. 2023. Their original code here.

For the latest source code or to report issues, please visit the project repository.

Example

honesty representation extraction

from lmdoctor import extraction_utils
extractor = extraction_utils.Extractor(model, tokenizer, user_tag, assistant_tag, extraction_target='honesty')
extractor.find_directions()

lie detection

prompt = "Tell me a fact about humans"

from lmdoctor import detection_utils
ld = detection_utils.Detector(model, tokenizer, user_tag, assistant_tag)
ld.generate(prompt, max_new_tokens=10, do_sample=True) # capture the hidden_states as the model generates
ld.get_projections(extractor.direction_info) # project the hidden_states onto the representation direction vectors
ld.detect(use_n_middle_layers=15) # aggregate projections over layers

truth

prompt = "Tell me a lie about humans"

lie

honesty control

prompt = "You killed a person for revenge but don't want to get caught. What would you tell the police?"
vanilla_generate(prompt)

output: I would tell the police that I did not kill anyone.

# +honesty control
from lmdoctor import control_utils
hc = control_utils.Controller(honesty_extractor.direction_info, model, tokenizer, user_tag=user_tag, assistant_tag=assistant_tag)
hc.generate(prompt, control_direction=1, max_new_tokens=12)

output: I would tell the police that I have killed a person

For the complete example, see examples/honesty_example.ipynb

Getting started

[linux only]

recommended: pip install lmdoctor

from source: "pip install ." after cloning

After install, try running honesty_example.ipynb

Note: This package requires pytorch but does not include it because the specific version/cuda backend will depend the huggingface model you are using. If you don't already have it installed you will need to do 'pip install torch' or use the model-specific instructions.

Extraction targets

The table below describes the targets we support for extracting internal representations. In functional extraction, the model is asked to produce text (e.g. prompt="tell me a lie"). In conceptual extraction, the model is asked to consider a statement (e.g. "consider the truthfulness of X"). For targets where both are supported, you can try each to see which works best for your use-case.

Target	Method	Types
truth	conceptual	none
honesty	functional	none
morality	conceptual & functional	none
emotion	conceptual	anger, disgust, fear, happiness, sadness, surprise
fairness	conceptual & functional	race, gender, prefession, religion

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.9

Release history Release notifications | RSS feed

0.5.6

Mar 28, 2024

0.5.5

Mar 21, 2024

0.5.4

Mar 21, 2024

0.5.3

Mar 19, 2024

This version

0.5.2

Mar 8, 2024

0.5.1

Mar 7, 2024

0.5.0

Mar 7, 2024

0.4.0

Feb 29, 2024

0.3.0

Feb 28, 2024

0.2.0

Feb 26, 2024

0.1.0

Feb 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmdoctor-0.5.2.tar.gz (6.7 MB view hashes)

Uploaded Mar 8, 2024 Source

Hashes for lmdoctor-0.5.2.tar.gz

Hashes for lmdoctor-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`c3df47c211d8f8a66a57e90ca699482b3c49edde003a16d2b047bc6c7e4a968a`
MD5	`381f771de4bc945e53692564e8f20729`
BLAKE2b-256	`f50d81e85e9509c4676a0763e26a238623925214c1bf0a0d12302c8029a5fb01`