Tools to adapt a pretrained model to a new vocabulary
Project description
Tokenizer Adapter
A simple tool to adapt a pretrained Huggingface model to a new vocabulary (domain specific) with (almost) no training.
Should work for almost all language models from the Huggingface Hub (need more test).
Install
pip install
Usage
from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM
BASE_MODEL_PATH = "camembert-base"
# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
# Train new vocabulary from the old tokenizer
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=300)
# Default params should work in most cases
adapter = TokenizerAdapter()
# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)
# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tokenizer-adapter-0.1.0.tar.gz
.
File metadata
- Download URL: tokenizer-adapter-0.1.0.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0fc0884e490d136b9bd587686ae8570498d5047a1edb61ac5915c7a2edd1546 |
|
MD5 | 8141e64a5299ff46a80d762fedde78e9 |
|
BLAKE2b-256 | b61ff0f72e7d90e5e0332854a62d28abfed6630f65af3149ada1b4c2352d6b5e |
File details
Details for the file tokenizer_adapter-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: tokenizer_adapter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4eb8756f313015bb5c412297683e79545058b015c33e545b18561c7558f318d |
|
MD5 | 445944a317ce0bbacfa93e49512b6f6b |
|
BLAKE2b-256 | c808b1136d77d00fc3cb8b29d4fc6bff39b39ba242a222c50ee65f6136737d24 |