Find out where your model is perplexed!
Project description
perplexed
This library is based on the idea from Andrej Karpathy on understanding the failure cases of a model by looking at the worst predictions. Specifically, this library focuses on calculating the perplexity of Large Language Models (LLMs) such as GPT-2 and BERT. The idea is to calculate the perplexity of a model on a dataset at the per token level. This allows us to understand where the model is perplexed and where it is not. This is useful for debugging and understanding the model.
Install
pip install perplexed
How to use
Using the API
perplexed
is designed to work with the HuggingFace ecosystem and is built on top
of the transformers
and datasets
libraries. The API is designed to
be simple and easy to use. The main function is
perplexed
which takes in a model, dataset, and tokenizer and returns a simple
Counter object with the perplexity of each token in the dataset. Here is
an example of how to use it:
from perplexed.core import perplexed
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test").select(range(100))
# filter out empty strings
dataset = dataset.filter(lambda x: len(x["text"]) > 0)
perplexity_cnt = perplexed(model, dataset, tokenizer=tokenizer, column="text", batch_size=1, device="cpu")
perplexity_cnt.most_common(10)
Found cached dataset wikitext (/home/nathan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Loading cached processed dataset at /home/nathan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-68eb731029328d8b.arrow
Loading cached processed dataset at /home/nathan/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-1c1cd85efcee4db8.arrow
[(' wired', 60983688.0),
(' 768', 21569838.0),
(' shatter', 12281687.0),
(' unsett', 8289435.0),
(' ignited', 6605209.0),
(' Tanz', 4834899.0),
(' Influence', 4153321.75),
(' Career', 4064189.0),
(' Television', 2325870.75),
(' Moral', 2243574.5)]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file perplexed-0.0.1.tar.gz
.
File metadata
- Download URL: perplexed-0.0.1.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a282f893c2bf8460eb557d7cedabdfa896d29c50cf2db0813e22bb0b2a5feec |
|
MD5 | 3b05307bad5ac46f25253bf36ad8a73e |
|
BLAKE2b-256 | 86e12cb56e0fa6c2462ac4a0f37a8514f017189e70fcf8d23df48b19fa88c2db |
File details
Details for the file perplexed-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: perplexed-0.0.1-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4830eddfcb2b32257a944f247052593c6c68ab51f41cf63809e8638badde4255 |
|
MD5 | 189b71e9dd0e0311e86c1b5b190d82ce |
|
BLAKE2b-256 | 518564795628a4c6b2736d72c79eabe024d2bce5d06b8e77e2d257f05347677f |