Unofficial implementation of the Ask-LLM paper 'How to Train Data-Efficient LLMs', arXiv:2402.09668.
Project description
nano-askllm
Unofficial implementation of the Ask-LLM paper 'How to Train Data-Efficient LLMs', arXiv:2402.09668.
Installation
pip install nano-askllm
Usage
- Scoring C4 English dataset with
flan-t5-small
model.
Note: Flan-T5 models cannot tokenize multilingual text properly (e.g. Japanese).
# pip install datasets sentencepiece accelerate
from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset
from nano_askllm import AskLLM
model_id = "google/flan-t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
llm = AskLLM(tokenizer, model)
batch_size = 2
num_ask = 5
for i in range(num_ask):
datapoints = [item["text"] for item in list(dataset.take(batch_size))]
scores = llm.ask(datapoints)
for score, datapoint in zip(scores.tolist(), datapoints):
text = datapoint[:40].replace("\n", " ")
print(f"score: {score:.4f}\ttext: {text}")
dataset = dataset.skip(batch_size)
- Scoring mC4 Japanese dataset with
gemma-2b-it
model.gemma
models need to tweak the prompt template and the yes tokens.
# pip install datasets sentencepiece accelerate
# hugginface-cli login
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from nano_askllm import AskLLM
model_id = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
dataset = load_dataset("allenai/c4", "ja", split="train", streaming=True)
prompt_template_prefix = "###\n"
prompt_template_postfix = """
###
Does the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of the world, and strictly NOT have any harmful, racist, sexist, etc. content.
OPTIONS: yes/no
ANSWER:"""
yes_tokens = ["yes", "Yes", "YES", " yes", " Yes", " YES"]
llm = AskLLM(
tokenizer,
model,
prompt_template_prefix=prompt_template_prefix,
prompt_template_postfix=prompt_template_postfix,
yes_tokens=yes_tokens,
max_tokens=512, # You can increase it up to 8192 for gemma-2b-it.
)
batch_size = 2
num_ask = 5
for i in range(num_ask):
datapoints = [item["text"] for item in list(dataset.take(batch_size))]
scores = llm.ask(datapoints)
for score, datapoint in zip(scores.tolist(), datapoints):
text = datapoint[:40].replace("\n", " ")
print(f"score: {score:.4f}\ttext: {text}")
dataset = dataset.skip(batch_size)
If you want to see the debug logs, you can set the logger as follows:
from logging import DEBUG, StreamHandler, getLogger
logger = getLogger("nano_askllm.askllm")
logger.setLevel(DEBUG)
handler = StreamHandler()
handler.setLevel(DEBUG)
logger.addHandler(handler)
Development
poetry -V # Poetry (version 1.5.1)
git clone https://github.com/susumuota/nano-askllm.git
cd nano-askllm
poetry install
poetry run pytest -s # run pytest once
poetry run -- ptw -- -s # watch for changes and run pytest
Citation
@misc{sachdeva2024train,
title={How to Train Data-Efficient LLMs},
author={Noveen Sachdeva and Benjamin Coleman and Wang-Cheng Kang and Jianmo Ni and Lichan Hong and Ed H. Chi and James Caverlee and Julian McAuley and Derek Zhiyuan Cheng},
year={2024},
eprint={2402.09668},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
License
MIT License. See LICENSE for details.
TODO
- Add Colab notebook
- Add examples using Hugging Face Datasets
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nano_askllm-0.2.3.tar.gz
(5.5 kB
view details)
Built Distribution
File details
Details for the file nano_askllm-0.2.3.tar.gz
.
File metadata
- Download URL: nano_askllm-0.2.3.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.12 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e1e20fc2630aca09077fb6b39816dacff1eae3c3c6478d17a4ecb168302ca9d |
|
MD5 | f7f86d1e3a54f06481507ddf4699bc63 |
|
BLAKE2b-256 | b7ef8c0ef48034b9819549e2acbd34731fa9c6a5daf0c9cc2353c38cff522a39 |
File details
Details for the file nano_askllm-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: nano_askllm-0.2.3-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.12 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9de686504c32391e2686239144d97ef91fc46dc64ce1e3dca190186a1b1276cd |
|
MD5 | aa5b5b6f12c316c33ddfd6f8b40717cd |
|
BLAKE2b-256 | d6f543f19077544fb44a4132d533cfd6d28d02dd5e0e5f4e06c07d3f26c54b20 |