Skip to main content

A simple to adapt a pretrained language model to a new vocabulary

Project description

Tokenizer Adapter

A simple tool for adapting a pre-trained Huggingface model to a new vocabulary with (almost) no training.

This technique can significantly reduce sequence length when a language model is used on data with a specific vocabulary (biology, medicine, law, other languages, etc...).

Should work for most Huggingface Hub language models (requires further testing).
Everything is run on CPU.

Install

pip install tokenizer-adapter --upgrade

Usage

It is recommended to use an existing tokenizer to train the new vocabulary.
Best and easiest way is to use the tokenizer.train_new_from_iterator(...) method.

from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

BASE_MODEL_PATH = "camembert-base"

# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# Train new vocabulary from the old tokenizer
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=300)

# Default params should work in most cases
adapter = TokenizerAdapter()

# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)

# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")

To rely on a custom tokenizer (experimental), you may need to use the custom_preprocessing argument.
Example using a RoBERTa (similar to Phi-2) style tokenizer for a CamemBERT model:

from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

BASE_MODEL_PATH = "camembert-base"
NEW_CUSTOM_TOKENIZER = "roberta-base"

# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# Also load this custom tokenizer to train the new one
new_tokenizer = AutoTokenizer.from_pretrained(NEW_CUSTOM_TOKENIZER)
new_tokenizer = new_tokenizer.train_new_from_iterator(corpus, vocab_size=300)

# CamemBERT tokenizer relies on '▁' while the RoBERTa one relies on 'Ġ'
adapter = TokenizerAdapter(custom_preprocessing=lambda x: x.replace('Ġ', '▁'))

# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)

# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer-adapter-0.1.2.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

tokenizer_adapter-0.1.2-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file tokenizer-adapter-0.1.2.tar.gz.

File metadata

  • Download URL: tokenizer-adapter-0.1.2.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for tokenizer-adapter-0.1.2.tar.gz
Algorithm Hash digest
SHA256 bf0c3b859f78f941398b68b22d2c24346cbeecc421a37fa3dc3b932eb608dcbd
MD5 3578d6cafa4c71c6038cb8a8876606a4
BLAKE2b-256 e8c2a92888effcc9bf63a38c6d8912647616519209a35a2ee4d9eabb50aa1035

See more details on using hashes here.

File details

Details for the file tokenizer_adapter-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for tokenizer_adapter-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a378326ebe710255705be5689d6d4664dfc710bd210cd4e5a8df504fcfd73f16
MD5 d63edd727fd72633760e036d0109127d
BLAKE2b-256 15f72ac73bcf4aa9424170df21cc56bf007c1df4a42deb6e28f2c265ad567642

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page