A simple to adapt a pretrained language model to a new vocabulary

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Tokenizer Adapter

A simple tool to adapt a pretrained Hugging Face transformer model to a new, specialized vocabulary with minimal to no retraining.

This technique can significantly reduce sequence length and computational cost when applying a general-purpose language model to domain-specific data, such as in biology, medicine, law, or other languages.

While a slight decrease in accuracy might be observed, especially with a significantly smaller vocabulary, this can typically be mitigated with a few steps of fine-tuning or further pre-training.

The library is designed to work with most language models available on the Hugging Face Hub and runs on the CPU by default.

Why Use Tokenizer Adapter?

Pretrained language models from the Hugging Face Hub, like roberta-base or modernbert, are trained on vast, general-domain text. Their tokenizers are optimized for this general vocabulary. When run these models on a specific domain (e.g., legal documents, scientific papers), words are often oversplitted. This leads to:

Longer input sequences: This increases memory consumption and computational time.
Potential loss of semantic meaning: Sub-optimal tokenization can obscure the meaning of domain-specific terms.

Tokenizer Adapter solves this by reconfiguring the model's token embeddings to match a new, more efficient tokenizer trained on your target corpus. This results in shorter sequences, faster processing, and potentially better performance after fine-tuning.

Installation

Install the package using pip:

pip install tokenizer-adapter --upgrade

How It Works

The core idea is to create a new, specialized vocabulary and tokenizer from your target corpus. Then, the TokenizerAdapter maps the embeddings of the original model's vocabulary to the new vocabulary. This is achieved by leveraging different methods to approximate the embeddings for the new tokens based on the existing ones.

The library offers several methods for this adaptation, including:

'average': Averages the embeddings of the old tokens that constitute a new token. (Recommended)
'first_attention': Uses the embedding of the first sub-token based on attention scores.
And many others like 'bos_attention', 'self_attention', 'frequency', 'svd', etc.

Basic Usage

The most straightforward approach is to train a new tokenizer from your corpus using the train_new_from_iterator() method of an existing tokenizer.

from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

# 1. Define paths and parameters
BASE_MODEL_PATH = "roberta-base"
SAVE_MODEL_PATH = "my-adapted-roberta"
VOCAB_SIZE = 5000  # Adjust based on your corpus size and domain specificity

# 2. Prepare your corpus (a list of strings)
# For a real-world scenario, this would be a large dataset
corpus = [
    "This is a sentence from my domain-specific corpus.",
    "It contains specialized terminology that the base model may not handle well.",
    # ... more sentences
]

# 3. Load the base model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# To run on a GPU, uncomment the following line:
# model.cuda()

# 4. Train a new tokenizer on your corpus
# This new tokenizer will be optimized for your data
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=VOCAB_SIZE)

# 5. Initialize the adapter and adapt the model
# The 'average' method is a robust default choice
adapter = TokenizerAdapter(method="average")
model = adapter.adapt_from_pretrained(model, new_tokenizer, tokenizer)

# 6. Save your new, adapted model and tokenizer
model.save_pretrained(SAVE_MODEL_PATH)
new_tokenizer.save_pretrained(SAVE_MODEL_PATH)

print(f"Model and tokenizer adapted and saved to {SAVE_MODEL_PATH}")

Advanced Usage: Custom Tokenizer (Experimental)

In some cases, you might want to use a tokenizer with a different architecture (e.g., adapting a CamemBERT model with a RoBERTa-style tokenizer). This is an experimental feature that may require a custom_preprocessing function to align the token representations.

For instance, CamemBERT uses ▁ as a prefix for sub-word units, while RoBERTa uses Ġ. The custom_preprocessing function can handle such differences.

from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

# 1. Define paths and parameters
BASE_MODEL_PATH = "camembert-base"
NEW_CUSTOM_TOKENIZER_PATH = "roberta-base"
SAVE_MODEL_PATH = "my-adapted-camembert"
VOCAB_SIZE = 5000

# 2. Prepare your corpus
corpus = [
    "Une phrase d'exemple pour notre corpus.",
    "Avec une terminologie spécifique au domaine.",
    # ...
]

# 3. Load the base model and its original tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
original_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# 4. Load the custom tokenizer and train it on your corpus
custom_tokenizer_template = AutoTokenizer.from_pretrained(NEW_CUSTOM_TOKENIZER_PATH)
new_tokenizer = custom_tokenizer_template.train_new_from_iterator(corpus, vocab_size=VOCAB_SIZE)

# 5. Define a preprocessing function to handle tokenization differences
# CamemBERT's ' ' vs. RoBERTa's 'Ġ'
def roberta_to_camembert_preprocessing(token):
    return token.replace('Ġ', '▁')

# 6. Initialize the adapter with the custom preprocessing and adapt the model
adapter = TokenizerAdapter(custom_preprocessing=roberta_to_camembert_preprocessing)
model = adapter.adapt_from_pretrained(model, new_tokenizer, original_tokenizer)

# 7. Save your adapted model and new tokenizer
model.save_pretrained(SAVE_MODEL_PATH)
new_tokenizer.save_pretrained(SAVE_MODEL_PATH)

print(f"Model with custom tokenizer adapted and saved to {SAVE_MODEL_PATH}")

Contributing

Contributions are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository.

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.0

Aug 12, 2025

0.1.2

Jan 17, 2024

0.1.1

Jan 15, 2024

0.1.0

Jan 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer_adapter-0.2.0.tar.gz (14.3 kB view details)

Uploaded Aug 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenizer_adapter-0.2.0-py3-none-any.whl (12.8 kB view details)

Uploaded Aug 12, 2025 Python 3

File details

Details for the file tokenizer_adapter-0.2.0.tar.gz.

File metadata

Download URL: tokenizer_adapter-0.2.0.tar.gz
Upload date: Aug 12, 2025
Size: 14.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for tokenizer_adapter-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b224808b617e60890486976b5100a880048a304571414081f95bc8d77ec7bee7`
MD5	`cc1d8ed28e8e0ff04db2f1741670cabd`
BLAKE2b-256	`f2b02c1143fbd33c6dbaade736a33d969b4379645aa0cbcdaa901da735756d06`

See more details on using hashes here.

File details

Details for the file tokenizer_adapter-0.2.0-py3-none-any.whl.

File metadata

Download URL: tokenizer_adapter-0.2.0-py3-none-any.whl
Upload date: Aug 12, 2025
Size: 12.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for tokenizer_adapter-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3653b7d0c238502988f87c8466368b3adc0c90ea651e3b233a0f1dec0099d446`
MD5	`75ee30e5d2f1034e62149cea0265a616`
BLAKE2b-256	`ba1c5591032d86e6a55423944076c6f7b201b450150af267e82d32d0dc166ca9`

See more details on using hashes here.

tokenizer-adapter 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Tokenizer Adapter

Why Use Tokenizer Adapter?

Installation

How It Works

Basic Usage

Advanced Usage: Custom Tokenizer (Experimental)

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes