Skip to main content

Token translation for language models.

Project description

transtokenizers

pypi python

Token translation for language models

Features

  • Translate a model from one language to another.
  • Support for most scripts beyond Latin.

Installation

pip install trans-tokenizers

Usage

You do need an installation of fast_align to align the tokens. You can install from the following repo: https://github.com/FremyCompany/fast_align.

To convert a Llama model from English to Dutch, you can use the following code. This might

from transtokenizers import create_aligned_corpus, align, map_tokens, smooth_mapping, remap_model
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

source_model = "meta-llama/Meta-Llama-3-8B"

target_tokenizer = "yhavinga/gpt-neo-1.3B-dutch"
export_dir = "en-nl-llama3-8b"

corpus = create_aligned_corpus(
    source_language="en",
    target_language="nl",
    source_tokenizer=source_model,
    target_tokenizer=target_tokenizer,
)

mapped_tokens_file = align(corpus, fast_align_path="fast_align")

tokenized_possible_translations, untokenized_possible_translations = map_tokens(mapped_tokens_file, source_model, target_tokenizer)

smoothed_mapping = smooth_mapping(target_tokenizer, tokenized_possible_translations)

model = remap_model(source_model, target_tokenizer, smoothed_mapping, source_model)
os.makedirs(export_dir, exist_ok=False)
new_tokenizer = AutoTokenizer.from_pretrained(target_tokenizer)
model.save_pretrained(export_dir)
new_tokenizer.save_pretrained(export_dir)

Credits

If this repo was useful to you, please cite the following paper

@inproceedings{remy-delobelle2024transtokenization,
    title={Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of {LLM}s for Low-Resource {NLP}},
    author={Remy, Fran{\c{c}}ois and Delobelle, Pieter and Avetisyan, Hayastan and Khabibullina, Alfiya and de Lhoneux, Miryam and Demeester, Thomas},
    booktitle={First Conference on Language Modeling},
    year={2024},
    url={https://openreview.net/forum?id=sBxvoDhvao}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trans_tokenizers-0.1.4.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

trans_tokenizers-0.1.4-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file trans_tokenizers-0.1.4.tar.gz.

File metadata

  • Download URL: trans_tokenizers-0.1.4.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for trans_tokenizers-0.1.4.tar.gz
Algorithm Hash digest
SHA256 a00abd7ce304b48b27c80fbae254bc5aa9dea475593cd9a6212945e8b5920352
MD5 c3a94bb1502e859c5a1221aa3a911f2e
BLAKE2b-256 ec340effb73367a9ff9d30275b69cad18193af7ddd626a324866d391bca9f5e9

See more details on using hashes here.

File details

Details for the file trans_tokenizers-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for trans_tokenizers-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 bf327e90aee6094555284e7fbb59740b536a8ad9b0d31034ad306a4d5ba982dd
MD5 4e055d5a7ea5b9952d0157d38266e361
BLAKE2b-256 e908fe15b73bcfb922aab90edffcbbcd43e3a8ac5bbb5ba0cf12a888ef9905b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page