A simple to adapt a pretrained language model to a new vocabulary
Project description
Tokenizer Adapter
A simple tool for adapting a pre-trained Huggingface model to a new vocabulary with (almost) no training.
This technique can significantly reduce sequence length when a language model is used on data with a specific vocabulary (biology, medicine, law, other languages, etc...).
Should work for most Huggingface Hub language models (requires further testing).
Everything is run on CPU.
Install
pip install tokenizer-adapter --upgrade
Usage
It is recommended to use an existing tokenizer to train the new vocabulary.
Best and easiest way is to use the tokenizer.train_new_from_iterator(...)
method.
from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM
BASE_MODEL_PATH = "camembert-base"
# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
# Train new vocabulary from the old tokenizer
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=300)
# Default params should work in most cases
adapter = TokenizerAdapter()
# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)
# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")
To rely on a custom tokenizer (experimental), you may need to use the custom_preprocessing
argument.
Example using a RoBERTa (similar to Phi-2) style tokenizer for a CamemBERT model:
from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM
BASE_MODEL_PATH = "camembert-base"
NEW_CUSTOM_TOKENIZER = "roberta-base"
# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
# Also load this custom tokenizer to train the new one
new_tokenizer = AutoTokenizer.from_pretrained(NEW_CUSTOM_TOKENIZER)
new_tokenizer = new_tokenizer.train_new_from_iterator(corpus, vocab_size=300)
# CamemBERT tokenizer relies on '▁' while the RoBERTa one relies on 'Ġ'
adapter = TokenizerAdapter(custom_preprocessing=lambda x: x.replace('Ġ', '▁'))
# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)
# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tokenizer_adapter-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a378326ebe710255705be5689d6d4664dfc710bd210cd4e5a8df504fcfd73f16 |
|
MD5 | d63edd727fd72633760e036d0109127d |
|
BLAKE2b-256 | 15f72ac73bcf4aa9424170df21cc56bf007c1df4a42deb6e28f2c265ad567642 |