Token translation for language models.
Project description
transtokenizers
Token translation for language models
- GitHub: https://github.com/LAGoM-NLP/transtokenizer
- PyPI: https://pypi.org/project/trans-tokenizers/
- Licence: MIT
Features
- Translate a model from one language to another.
- Support for most scripts beyond Latin.
Installation
pip install trans-tokenizers
Usage
You do need an installation of fast_align to align the tokens. You can install from the following repo: https://github.com/FremyCompany/fast_align.
To convert a Llama model from English to Dutch, you can use the following code. This might
from transtokenizers import create_aligned_corpus, align, map_tokens, smooth_mapping, remap_model
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
source_model = "meta-llama/Meta-Llama-3-8B"
target_tokenizer = "yhavinga/gpt-neo-1.3B-dutch"
export_dir = "en-nl-llama3-8b"
corpus = create_aligned_corpus(
source_language="en",
target_language="nl",
source_tokenizer=source_model,
target_tokenizer=target_tokenizer,
)
mapped_tokens_file = align(corpus, fast_align_path="fast_align")
tokenized_possible_translations, untokenized_possible_translations = map_tokens(mapped_tokens_file, source_model, target_tokenizer)
smoothed_mapping = smooth_mapping(target_tokenizer, tokenized_possible_translations)
model = remap_model(source_model, target_tokenizer, smoothed_mapping, source_model)
os.makedirs(export_dir, exist_ok=False)
new_tokenizer = AutoTokenizer.from_pretrained(target_tokenizer)
model.save_pretrained(export_dir)
new_tokenizer.save_pretrained(export_dir)
Credits
If this repo was useful to you, please cite the following paper
@inproceedings{remy-delobelle2024transtokenization,
title={Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of {LLM}s for Low-Resource {NLP}},
author={Remy, Fran{\c{c}}ois and Delobelle, Pieter and Avetisyan, Hayastan and Khabibullina, Alfiya and de Lhoneux, Miryam and Demeester, Thomas},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=sBxvoDhvao}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
trans_tokenizers-0.1.3.tar.gz
(12.2 kB
view details)
Built Distribution
File details
Details for the file trans_tokenizers-0.1.3.tar.gz
.
File metadata
- Download URL: trans_tokenizers-0.1.3.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b028c12b4bfdbc471684cc8301b7dc866c771e60f66716b891c1b171c9fe805 |
|
MD5 | 487fb363145c1647db05d90973903d93 |
|
BLAKE2b-256 | e9563084e45ec61ed776d29d84cf4cac788238b2240e398df82488c240c07a7d |
File details
Details for the file trans_tokenizers-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: trans_tokenizers-0.1.3-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3962c352f41b365b7ce731ccb2cf111a5e30401c14873c71f1e90c4dca07723f |
|
MD5 | cb43e52b47aa2d102a6c6d465dd21223 |
|
BLAKE2b-256 | 5c101bef1ef2afe826cbf7380aee41fc215447d125de41f2a490712de911f428 |