Implementation of Reinforcement Learning from Human Feedback (RLHF)
Project description
InstructGoose
Paper: InstructGPT - Training language models to follow instructions with human feedback
Questions
- In the context of RLHF, how to calculate the $L_t^{V F}(\theta)$,
- Like it’s a function of the PPO agent uses to predict how much reward it gets if generates the sequence?
Does the RL model and the SFT model use the same tokenizer? YesI don’t know how to returns the logit of the generation model
Install
pip install instruct_goose
Resources
I used these resources to implement this
- Copied the
load_yaml
function from https://github.com/Dahoas/reward-modeling - Learned how to build a dataset to train reward model: https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX–VmlldzozMzAwODM2
- Learned how to add value head in PPO agent: https://github.com/lvwerra/trl
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
instruct_goose-0.0.1.tar.gz
(9.5 kB
view details)
Built Distribution
File details
Details for the file instruct_goose-0.0.1.tar.gz
.
File metadata
- Download URL: instruct_goose-0.0.1.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94b817e5f79a9c7c560cf94e2fee25951e1a22ab18168c87d06d8fd61af7e944 |
|
MD5 | f273c1b42eb3a7c113596b9fb5e3398c |
|
BLAKE2b-256 | 376f9c25744e56f459c5a2f5b3012e555c3e0363c96b1b2b187ef74173569de0 |
File details
Details for the file instruct_goose-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: instruct_goose-0.0.1-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a35c99bdf597da6c28dc13b083390df927789ccb6310e6deb42ec98bbe5c2a85 |
|
MD5 | 6253f5e9b4a394cb254ee00c92a84ba4 |
|
BLAKE2b-256 | 06b613ad2b9e8efe39d1f4c165c4403c575a8f47303fef451b7c5fd7a1e0c4de |