Library for manipulating the existing tokenizer.
Project description
Tokenizer-Changer
Python script for manipulating the existing tokenizer.
The solution was tested on Llama3-8B tokenizer.
Installation:
Installation from PyPI:
pip install tokenizerchanger
Usage:
changer = TokenizerChanger(tokenizer)
Create the object of TokenizerChanger
class that requires an existing tokenizer that will be changed, e.g. PreTrainedTokenizerFast
class from рџ¤— tokenizers
library.
Deletion:
changer.delete_k_least_frequent_tokens(k=1000)
changer.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)
Deletes k most frequent tokens. The exclude
argument stands for tokens that will be ignored during the deletion of least frequent tokens.
changer.delete_tokens(list_of_unwanted_tokens, include_substrings)
Deletes the unwanted tokens from the tokenizer. If include_substrings
is True
, all token occurrences will be deleted even if they are in other tokens. Defaults to True
.
changer.delete_overlaps(vocab)
Finds and deletes all intersections of the tokenizer
's vocabulary and the vocab
variable from the tokenizer
. Notice that vocab
should be a dict
variable.
changer.delete_inappropriate_merges(vocab)
Deletes all merges from tokenizer
which contradict the vocab
variable. Notice that vocab
should be a list[str]
variable.
Addition:
The idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.
changer.add_tokens(list_of_tokens)
Adds the tokens from the list. The indexes will be filled automatically.
changer.add_merges(list_of_merges)
Adds the merges from the list.
"Get" functions:
changer.get_overlapping_tokens(vocab)
Returns the intersection between the tokenizer
's vocabulary and the vocab
variable. Notice that vocab
should be a dict
variable.
changer.get_overlapping_megres(merges)
Returns the intersection between the tokenizer
's merges and the merges
variable. Notice that merges
should be a list
variable.
Saving:
changer.save_tokenizer(path)
Saves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into path
folder (./updated_tokenizer
by default).
tokenizer = ch.updated_tokenizer()
Return the changed tokenizer.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for TokenizerChanger-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d910830e045946152ead2efbc45030bc1cdda85444f50bf245e2ee0d289df279 |
|
MD5 | 4c41d917c5cda30c7f38171232132812 |
|
BLAKE2b-256 | 6b93d737d751a56d8ee05abe814106010e2f915030b52ee556ba3575e3137c61 |