Skip to main content

Token translation for language models.

Project description

transtokenizers

pypi python

Token translation for language models

Features

  • Translate a model from one language to another.
  • Support for most scripts beyond Latin.

Installation

pip install trans-tokenizers

Usage

You do need an installation of fast_align to align the tokens. You can install from the following repo: https://github.com/FremyCompany/fast_align.

To convert a Llama model from English to Dutch, you can use the following code. This might

from transtokenizers import create_aligned_corpus, align, map_tokens, smooth_mapping, remap_model
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

source_model = "meta-llama/Meta-Llama-3-8B"

target_tokenizer = "yhavinga/gpt-neo-1.3B-dutch"
export_dir = "en-nl-llama3-8b"

corpus = create_aligned_corpus(
    source_language="en",
    target_language="nl",
    source_tokenizer=source_model,
    target_tokenizer=target_tokenizer,
)

mapped_tokens_file = align(corpus, fast_align_path="fast_align")

tokenized_possible_translations, untokenized_possible_translations = map_tokens(mapped_tokens_file, source_model, target_tokenizer)

smoothed_mapping = smooth_mapping(target_tokenizer, tokenized_possible_translations)

model = remap_model(source_model, target_tokenizer, smoothed_mapping, source_model)
os.makedirs(export_dir, exist_ok=False)
new_tokenizer = AutoTokenizer.from_pretrained(target_tokenizer)
model.save_pretrained(export_dir)
new_tokenizer.save_pretrained(export_dir)

Credits

If this repo was useful to you, please cite the following paper

@inproceedings{remy-delobelle2024transtokenization,
    title={Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of {LLM}s for Low-Resource {NLP}},
    author={Remy, Fran{\c{c}}ois and Delobelle, Pieter and Avetisyan, Hayastan and Khabibullina, Alfiya and de Lhoneux, Miryam and Demeester, Thomas},
    booktitle={First Conference on Language Modeling},
    year={2024},
    url={https://openreview.net/forum?id=sBxvoDhvao}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trans_tokenizers-0.1.3.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

trans_tokenizers-0.1.3-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file trans_tokenizers-0.1.3.tar.gz.

File metadata

  • Download URL: trans_tokenizers-0.1.3.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.4

File hashes

Hashes for trans_tokenizers-0.1.3.tar.gz
Algorithm Hash digest
SHA256 1b028c12b4bfdbc471684cc8301b7dc866c771e60f66716b891c1b171c9fe805
MD5 487fb363145c1647db05d90973903d93
BLAKE2b-256 e9563084e45ec61ed776d29d84cf4cac788238b2240e398df82488c240c07a7d

See more details on using hashes here.

File details

Details for the file trans_tokenizers-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for trans_tokenizers-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3962c352f41b365b7ce731ccb2cf111a5e30401c14873c71f1e90c4dca07723f
MD5 cb43e52b47aa2d102a6c6d465dd21223
BLAKE2b-256 5c101bef1ef2afe826cbf7380aee41fc215447d125de41f2a490712de911f428

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page