A simple to adapt a pretrained language model to a new vocabulary
Project description
Tokenizer Adapter
A simple tool for adapting a pre-trained Huggingface model to a new vocabulary with (almost) no training.
This technique can significantly reduce sequence length when a language model is used on data with a specific vocabulary (biology, medicine, law, other languages, etc...).
Should work for most Huggingface Hub language models (requires further testing).
Everything is run on CPU.
Install
pip install tokenizer-adapter --upgrade
Usage
It is recommended to use an existing tokenizer to train the new vocabulary.
Best and easiest way is to use the tokenizer.train_new_from_iterator(...)
method.
from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM
BASE_MODEL_PATH = "camembert-base"
# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
# Train new vocabulary from the old tokenizer
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=300)
# Default params should work in most cases
adapter = TokenizerAdapter()
# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)
# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")
To rely on a custom tokenizer (experimental), you may need to use the custom_preprocessing
argument.
Example using a RoBERTa (similar to Phi-2) style tokenizer for a CamemBERT model:
from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM
BASE_MODEL_PATH = "camembert-base"
NEW_CUSTOM_TOKENIZER = "roberta-base"
# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
# Also load this custom tokenizer to train the new one
new_tokenizer = AutoTokenizer.from_pretrained(NEW_CUSTOM_TOKENIZER)
new_tokenizer = new_tokenizer.train_new_from_iterator(corpus, vocab_size=300)
# CamemBERT tokenizer relies on '▁' while the RoBERTa one relies on 'Ġ'
adapter = TokenizerAdapter(custom_preprocessing=lambda x: x.replace('Ġ', '▁'))
# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)
# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tokenizer-adapter-0.1.2.tar.gz
.
File metadata
- Download URL: tokenizer-adapter-0.1.2.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf0c3b859f78f941398b68b22d2c24346cbeecc421a37fa3dc3b932eb608dcbd |
|
MD5 | 3578d6cafa4c71c6038cb8a8876606a4 |
|
BLAKE2b-256 | e8c2a92888effcc9bf63a38c6d8912647616519209a35a2ee4d9eabb50aa1035 |
File details
Details for the file tokenizer_adapter-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: tokenizer_adapter-0.1.2-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a378326ebe710255705be5689d6d4664dfc710bd210cd4e5a8df504fcfd73f16 |
|
MD5 | d63edd727fd72633760e036d0109127d |
|
BLAKE2b-256 | 15f72ac73bcf4aa9424170df21cc56bf007c1df4a42deb6e28f2c265ad567642 |