Token translation for language models.
Project description
transtokenizers
Token translation for language models
- GitHub: https://github.com/LAGoM-NLP/transtokenizer
- PyPI: https://pypi.org/project/trans-tokenizers/
- Licence: MIT
Features
- Translate a model from one language to another.
- Support for most scripts beyond Latin.
Installation
pip install trans-tokenizers
Usage
You do need an installation of fast_align to align the tokens. You can install from the following repo: https://github.com/FremyCompany/fast_align.
To convert a Llama model from English to Dutch, you can use the following code. This might
from transtokenizers import create_aligned_corpus, align, map_tokens, smooth_mapping, remap_model
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
source_model = "meta-llama/Meta-Llama-3-8B"
target_tokenizer = "yhavinga/gpt-neo-1.3B-dutch"
export_dir = "en-nl-llama3-8b"
corpus = create_aligned_corpus(
source_language="en",
target_language="nl",
source_tokenizer=source_model,
target_tokenizer=target_tokenizer,
)
mapped_tokens_file = align(corpus, fast_align_path="fast_align")
tokenized_possible_translations, untokenized_possible_translations = map_tokens(mapped_tokens_file, source_model, target_tokenizer)
smoothed_mapping = smooth_mapping(target_tokenizer, tokenized_possible_translations)
model = remap_model(source_model, target_tokenizer, smoothed_mapping, source_model)
os.makedirs(export_dir, exist_ok=False)
new_tokenizer = AutoTokenizer.from_pretrained(target_tokenizer)
model.save_pretrained(export_dir)
new_tokenizer.save_pretrained(export_dir)
Credits
If this repo was useful to you, please cite the following paper
@inproceedings{remy-delobelle2024transtokenization,
title={Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of {LLM}s for Low-Resource {NLP}},
author={Remy, Fran{\c{c}}ois and Delobelle, Pieter and Avetisyan, Hayastan and Khabibullina, Alfiya and de Lhoneux, Miryam and Demeester, Thomas},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=sBxvoDhvao}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trans_tokenizers-0.1.4.tar.gz.
File metadata
- Download URL: trans_tokenizers-0.1.4.tar.gz
- Upload date:
- Size: 12.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a00abd7ce304b48b27c80fbae254bc5aa9dea475593cd9a6212945e8b5920352
|
|
| MD5 |
c3a94bb1502e859c5a1221aa3a911f2e
|
|
| BLAKE2b-256 |
ec340effb73367a9ff9d30275b69cad18193af7ddd626a324866d391bca9f5e9
|
File details
Details for the file trans_tokenizers-0.1.4-py3-none-any.whl.
File metadata
- Download URL: trans_tokenizers-0.1.4-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf327e90aee6094555284e7fbb59740b536a8ad9b0d31034ad306a4d5ba982dd
|
|
| MD5 |
4e055d5a7ea5b9952d0157d38266e361
|
|
| BLAKE2b-256 |
e908fe15b73bcfb922aab90edffcbbcd43e3a8ac5bbb5ba0cf12a888ef9905b0
|