Skip to main content

Unofficial implementation of the Ask-LLM paper 'How to Train Data-Efficient LLMs', arXiv:2402.09668.

Project description

nano-askllm

Unofficial implementation of the Ask-LLM paper 'How to Train Data-Efficient LLMs', arXiv:2402.09668.

PyPI GitHub License GitHub last commit

Ask-LLM prompt

Installation

pip install nano-askllm

Usage

  • Scoring C4 English dataset with flan-t5-small model.

Note: Flan-T5 models cannot tokenize multilingual text properly (e.g. Japanese).

# pip install datasets sentencepiece accelerate

from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset
from nano_askllm import AskLLM

model_id = "google/flan-t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)

llm = AskLLM(tokenizer, model)

batch_size = 2
num_ask = 5

for i in range(num_ask):
    datapoints = [item["text"] for item in list(dataset.take(batch_size))]
    scores = llm.ask(datapoints)
    for score, datapoint in zip(scores.tolist(), datapoints):
        text = datapoint[:40].replace("\n", " ")
        print(f"score: {score:.4f}\ttext: {text}")
    dataset = dataset.skip(batch_size)
  • Scoring mC4 Japanese dataset with gemma-2b-it model. gemma models need to tweak the prompt template and the yes tokens.
# pip install datasets sentencepiece accelerate
# hugginface-cli login

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from nano_askllm import AskLLM

model_id = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

dataset = load_dataset("allenai/c4", "ja", split="train", streaming=True)

prompt_template_prefix = "###\n"
prompt_template_postfix = """
###

Does the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of the world, and strictly NOT have any harmful, racist, sexist, etc. content.

OPTIONS: yes/no
ANSWER:"""

yes_tokens = ["yes", "Yes", "YES", " yes", " Yes", " YES"]

llm = AskLLM(
    tokenizer,
    model,
    prompt_template_prefix=prompt_template_prefix,
    prompt_template_postfix=prompt_template_postfix,
    yes_tokens=yes_tokens,
    max_tokens=512,  # You can increase it up to 8192 for gemma-2b-it.
)

batch_size = 2
num_ask = 5

for i in range(num_ask):
    datapoints = [item["text"] for item in list(dataset.take(batch_size))]
    scores = llm.ask(datapoints)
    for score, datapoint in zip(scores.tolist(), datapoints):
        text = datapoint[:40].replace("\n", " ")
        print(f"score: {score:.4f}\ttext: {text}")
    dataset = dataset.skip(batch_size)

If you want to see the debug logs, you can set the logger as follows:

from logging import DEBUG, StreamHandler, getLogger

logger = getLogger("nano_askllm.askllm")
logger.setLevel(DEBUG)
handler = StreamHandler()
handler.setLevel(DEBUG)
logger.addHandler(handler)

Development

poetry -V  # Poetry (version 1.5.1)
git clone https://github.com/susumuota/nano-askllm.git
cd nano-askllm
poetry install
poetry run pytest -s     # run pytest once
poetry run -- ptw -- -s  # watch for changes and run pytest

Citation

@misc{sachdeva2024train,
      title={How to Train Data-Efficient LLMs},
      author={Noveen Sachdeva and Benjamin Coleman and Wang-Cheng Kang and Jianmo Ni and Lichan Hong and Ed H. Chi and James Caverlee and Julian McAuley and Derek Zhiyuan Cheng},
      year={2024},
      eprint={2402.09668},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

MIT License. See LICENSE for details.

TODO

  • Add Colab notebook
  • Add examples using Hugging Face Datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nano_askllm-0.2.3.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

nano_askllm-0.2.3-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file nano_askllm-0.2.3.tar.gz.

File metadata

  • Download URL: nano_askllm-0.2.3.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.12 Darwin/22.6.0

File hashes

Hashes for nano_askllm-0.2.3.tar.gz
Algorithm Hash digest
SHA256 8e1e20fc2630aca09077fb6b39816dacff1eae3c3c6478d17a4ecb168302ca9d
MD5 f7f86d1e3a54f06481507ddf4699bc63
BLAKE2b-256 b7ef8c0ef48034b9819549e2acbd34731fa9c6a5daf0c9cc2353c38cff522a39

See more details on using hashes here.

File details

Details for the file nano_askllm-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: nano_askllm-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.12 Darwin/22.6.0

File hashes

Hashes for nano_askllm-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9de686504c32391e2686239144d97ef91fc46dc64ce1e3dca190186a1b1276cd
MD5 aa5b5b6f12c316c33ddfd6f8b40717cd
BLAKE2b-256 d6f543f19077544fb44a4132d533cfd6d28d02dd5e0e5f4e06c07d3f26c54b20

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page