Build and view token mappings between languages and tokenizers

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

procesaur

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3.12

Project description

token2token

Easy-to-make token mappings using one or two tokenizers and a parallel corpus.

Example

You want to align French and English on sub-token level. You need:

A French (HuggingFace) tokenizer
An English tokenizer (could be the same one)
A French-English parallel corpus (if none provided OpenSubtitles2024 from huggingface is used by default)
This software

For each token in the first tokenizer you will get a list of possible matching tokens from the second tokenizers and a score for each of them.

Alternatively, you can still use the old pipeline and get word mappings based on NLTK or other specialized tokenizer

Usage

First, install the package using

git clone https://github.com/kakaobrain/word2word
python setup.py install

Then, in Python, download the model and retrieve top-5 word translations of any given word to the desired language:

from token2token import Token2token
enfr = Token2token.make(lang1="en", lang2="fr", tokenizer1="Qwen/Qwen3.5-0.8B", tokenizer2="Qwen/Qwen3.5-0.8B", n_lines=500000)
print(en2fr("Ġapple"))
# out: {'Ġpomme': 18.72391482536058, 'omm': 4.7151260350878825, 'nÃ©s': 2.887133318202845, 'Ġpommes': 2.8528411761126584, 'po': 2.799092675636191}

Alternatively you can still use the old pipeline to get word mappings:

from token2token import Word2word
enfr = Word2word.make(lang1="en", lang2="fr", n_lines=500000)
print(en2fr("apple"))
# out: {'pomme': 18.491287696990998, 'pommiers': 2.913168676725654, 'pommes': 2.8193681613734003, 'empoisonnés': 2.767322352478363, 'pommier': 1.8529305946107455}

The old pipeline has been modified :

to use huggingface datasets for corpora
to output scores together with words and
to save in plain, human readable JSON format.

In both cases, the custom lexicon can be loaded from the directory it is stored in (defaulting to home directory in linux or "C:\word2word" in Windows

from word2word import Token2token
my_en2fr = Token2token.load("en", "fr")
# Loaded token2token custom token mapping from C:\word2word\en-fr.json

from word2word import Word2word
my_en2fr = Word2word.load("en", "fr", "data/pubmed.en-fr")
# Loaded token2word custom bilingual lexicon from C:\word2word\en-fr.json

Supported Languages

As already mentioned, when custom dataset is not provided the fallback is OpenSubtitles2024, supporting 94 langugages.

Methodology

The approach computes top-k word translations based on the co-occurrence statistics between cross-lingual word pairs in a parallel corpus. There is also a correction term that controls for any confounding effect coming from other source words within the same sentence. The resulting method is an efficient and scalable approach that allows the construction of large bilingual dictionaries from any given parallel corpus, or a (subword) token alignment bwtween different languages and/or tokenizers.

For more details, see the Methodology section of the original paper.

Multiprocessing

In both the Python interface and the command line interface, make uses multiprocessing with 8 CPUs by default. The number of CPU workers can be adjusted by setting num_workers=N (Python) or --num_workers N (command line).

References

If you use word2word for research, please cite our paper:

@inproceedings{choe2020word2word,
 author = {Yo Joong Choe and Kyubyong Park and Dongwoo Kim},
 title = {word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs},
 booktitle = {Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)},
 year = {2020}
}

For token2token add-on citation coming soon.

Authors

Mihailo Škorić based on Kyubyong Park, Dongwoo Kim, and YJ Choe

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

procesaur

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3.12

Release history Release notifications | RSS feed

This version

1.0.0

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

token2token-1.0.0.tar.gz (15.3 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

token2token-1.0.0-py3-none-any.whl (15.5 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file token2token-1.0.0.tar.gz.

File metadata

Download URL: token2token-1.0.0.tar.gz
Upload date: Jun 26, 2026
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for token2token-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e2c4be49c81e0ebb3d0fbb796d4fadfe4d530e8fa6bbdca37fec10674a9dc248`
MD5	`8f3b34fead0c79c43217d9c3b22c62cd`
BLAKE2b-256	`a3acf4944733850b4f8a35f9e8d01292114bbc4d711ca728876316e9335ae16f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for token2token-1.0.0.tar.gz:

Publisher: python-publish.yml on procesaur/token2token

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: token2token-1.0.0.tar.gz
- Subject digest: e2c4be49c81e0ebb3d0fbb796d4fadfe4d530e8fa6bbdca37fec10674a9dc248
- Sigstore transparency entry: 1968993347
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: procesaur/token2token@2932c068e0827ee630711cc75be6de16cce8d48a
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/procesaur
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@2932c068e0827ee630711cc75be6de16cce8d48a
- Trigger Event: release

File details

Details for the file token2token-1.0.0-py3-none-any.whl.

File metadata

Download URL: token2token-1.0.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 15.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for token2token-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1b944de7bf7756ec1530de6fb8d2cdf04e40f3e2119af5f29de37e114e54bf8`
MD5	`9706c0f7bd7211af7045c9619fa50d19`
BLAKE2b-256	`9a7b76ea2ed020d935a32374c3230c6013d95ad9abf994879a863e9326523ccd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for token2token-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on procesaur/token2token

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: token2token-1.0.0-py3-none-any.whl
- Subject digest: c1b944de7bf7756ec1530de6fb8d2cdf04e40f3e2119af5f29de37e114e54bf8
- Sigstore transparency entry: 1968993455
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: procesaur/token2token@2932c068e0827ee630711cc75be6de16cce8d48a
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/procesaur
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@2932c068e0827ee630711cc75be6de16cce8d48a
- Trigger Event: release

token2token 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

token2token

Example

Usage

Supported Languages

Methodology

Multiprocessing

References

Authors

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance