Library for manipulating the existing tokenizer.
Project description
Tokens-Deletion
Python script for manipulating the existing tokenizer.
The solution was tested on Llama3-8B tokenizer.
Installation:
Installation from PyPI:
pip install tokenizerchanger
Usage:
changer = TokenizerChanger(tokenizer)
Create the object of TokenizerChanger
class that requires an existing tokenizer that will be changed, e.g. PreTrainedTokenizerFast
class from рџ¤— tokenizers
library.
Deletion:
changer.delete_k_least_frequent_tokens(k=1000)
changer.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)
Deletes k most frequent tokens. The exclude
argument stands for tokens that will be ignored during the deletion of least frequent tokens.
changer.delete_unwanted_tokens(list_of_unwanted_tokens)
Deletes all tokens from list_of_unwanted_tokens
from the tokenizer.
changer.delete_tokens(list_of_unwanted_tokens)
Now, you can delete exactly the list of unwanted tokens, in contrast to the delete_unwanted_tokens
function, which deletes all tokens from the list and tokens that contain unwanted tokens as a substring.
changer.delete_overlaps(vocab)
Finds and deletes all intersections of the tokenizer
's vocabulary and the vocab
variable from the tokenizer
. Notice that vocab
should be a dict
variable.
changer.delete_inappropriate_merges(vocab)
Deletes all merges from tokenizer
which contradict the vocab
variable. Notice that vocab
should be a list[str]
variable.
Addition:
The idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.
changer.add_tokens(list_of_tokens)
Adds the tokens from the list. The indexes will be filled automatically.
changer.add_merges(list_of_merges)
Adds the merges from the list.
"Get" functions:
changer.get_overlapping_tokens(vocab)
Returns the intersection between the tokenizer
's vocabulary and the vocab
variable. Notice that vocab
should be a dict
variable.
changer.get_overlapping_megres(merges)
Returns the intersection between the tokenizer
's merges and the merges
variable. Notice that merges
should be a list
variable.
Saving:
changer.save_tokenizer(path)
Saves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into path
folder (./updated_tokenizer
by default).
tokenizer = ch.updated_tokenizer()
Return the changed tokenizer.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for TokenizerChanger-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ffc9dc3b934a340ff452d95c26d469417ff389e7a596b0e754d5b94587e2d3b |
|
MD5 | f8ed42856127b4c053627a468f097760 |
|
BLAKE2b-256 | 793dcc6d5a4cecb14c700b4561140bf84feb52e8d7146748f8bdcc7f21c47f40 |