Skip to main content

A simple to adapt a pretrained language model to a new vocabulary

Project description

Tokenizer Adapter

A simple tool to adapt a pretrained Hugging Face transformer model to a new, specialized vocabulary with minimal to no retraining.

This technique can significantly reduce sequence length and computational cost when applying a general-purpose language model to domain-specific data, such as in biology, medicine, law, or other languages.

While a slight decrease in accuracy might be observed, especially with a significantly smaller vocabulary, this can typically be mitigated with a few steps of fine-tuning or further pre-training.

The library is designed to work with most language models available on the Hugging Face Hub and runs on the CPU by default.

Why Use Tokenizer Adapter?

Pretrained language models from the Hugging Face Hub, like roberta-base or modernbert, are trained on vast, general-domain text. Their tokenizers are optimized for this general vocabulary. When run these models on a specific domain (e.g., legal documents, scientific papers), words are often oversplitted. This leads to:

  • Longer input sequences: This increases memory consumption and computational time.
  • Potential loss of semantic meaning: Sub-optimal tokenization can obscure the meaning of domain-specific terms.

Tokenizer Adapter solves this by reconfiguring the model's token embeddings to match a new, more efficient tokenizer trained on your target corpus. This results in shorter sequences, faster processing, and potentially better performance after fine-tuning.

Installation

Install the package using pip:

pip install tokenizer-adapter --upgrade

How It Works

The core idea is to create a new, specialized vocabulary and tokenizer from your target corpus. Then, the TokenizerAdapter maps the embeddings of the original model's vocabulary to the new vocabulary. This is achieved by leveraging different methods to approximate the embeddings for the new tokens based on the existing ones.

The library offers several methods for this adaptation, including:

  • 'average': Averages the embeddings of the old tokens that constitute a new token. (Recommended)
  • 'first_attention': Uses the embedding of the first sub-token based on attention scores.
  • And many others like 'bos_attention', 'self_attention', 'frequency', 'svd', etc.

Basic Usage

The most straightforward approach is to train a new tokenizer from your corpus using the train_new_from_iterator() method of an existing tokenizer.

from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

# 1. Define paths and parameters
BASE_MODEL_PATH = "roberta-base"
SAVE_MODEL_PATH = "my-adapted-roberta"
VOCAB_SIZE = 5000  # Adjust based on your corpus size and domain specificity

# 2. Prepare your corpus (a list of strings)
# For a real-world scenario, this would be a large dataset
corpus = [
    "This is a sentence from my domain-specific corpus.",
    "It contains specialized terminology that the base model may not handle well.",
    # ... more sentences
]

# 3. Load the base model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# To run on a GPU, uncomment the following line:
# model.cuda()

# 4. Train a new tokenizer on your corpus
# This new tokenizer will be optimized for your data
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=VOCAB_SIZE)

# 5. Initialize the adapter and adapt the model
# The 'average' method is a robust default choice
adapter = TokenizerAdapter(method="average")
model = adapter.adapt_from_pretrained(model, new_tokenizer, tokenizer)

# 6. Save your new, adapted model and tokenizer
model.save_pretrained(SAVE_MODEL_PATH)
new_tokenizer.save_pretrained(SAVE_MODEL_PATH)

print(f"Model and tokenizer adapted and saved to {SAVE_MODEL_PATH}")

Advanced Usage: Custom Tokenizer (Experimental)

In some cases, you might want to use a tokenizer with a different architecture (e.g., adapting a CamemBERT model with a RoBERTa-style tokenizer). This is an experimental feature that may require a custom_preprocessing function to align the token representations.

For instance, CamemBERT uses as a prefix for sub-word units, while RoBERTa uses Ġ. The custom_preprocessing function can handle such differences.

from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

# 1. Define paths and parameters
BASE_MODEL_PATH = "camembert-base"
NEW_CUSTOM_TOKENIZER_PATH = "roberta-base"
SAVE_MODEL_PATH = "my-adapted-camembert"
VOCAB_SIZE = 5000

# 2. Prepare your corpus
corpus = [
    "Une phrase d'exemple pour notre corpus.",
    "Avec une terminologie spécifique au domaine.",
    # ...
]

# 3. Load the base model and its original tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
original_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# 4. Load the custom tokenizer and train it on your corpus
custom_tokenizer_template = AutoTokenizer.from_pretrained(NEW_CUSTOM_TOKENIZER_PATH)
new_tokenizer = custom_tokenizer_template.train_new_from_iterator(corpus, vocab_size=VOCAB_SIZE)

# 5. Define a preprocessing function to handle tokenization differences
# CamemBERT's ' ' vs. RoBERTa's 'Ġ'
def roberta_to_camembert_preprocessing(token):
    return token.replace('Ġ', '▁')

# 6. Initialize the adapter with the custom preprocessing and adapt the model
adapter = TokenizerAdapter(custom_preprocessing=roberta_to_camembert_preprocessing)
model = adapter.adapt_from_pretrained(model, new_tokenizer, original_tokenizer)

# 7. Save your adapted model and new tokenizer
model.save_pretrained(SAVE_MODEL_PATH)
new_tokenizer.save_pretrained(SAVE_MODEL_PATH)

print(f"Model with custom tokenizer adapted and saved to {SAVE_MODEL_PATH}")

Contributing

Contributions are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository.

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer_adapter-0.2.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenizer_adapter-0.2.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file tokenizer_adapter-0.2.0.tar.gz.

File metadata

  • Download URL: tokenizer_adapter-0.2.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for tokenizer_adapter-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b224808b617e60890486976b5100a880048a304571414081f95bc8d77ec7bee7
MD5 cc1d8ed28e8e0ff04db2f1741670cabd
BLAKE2b-256 f2b02c1143fbd33c6dbaade736a33d969b4379645aa0cbcdaa901da735756d06

See more details on using hashes here.

File details

Details for the file tokenizer_adapter-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tokenizer_adapter-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3653b7d0c238502988f87c8466368b3adc0c90ea651e3b233a0f1dec0099d446
MD5 75ee30e5d2f1034e62149cea0265a616
BLAKE2b-256 ba1c5591032d86e6a55423944076c6f7b201b450150af267e82d32d0dc166ca9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page