Skip to main content

Word Translator for 3,564 Language Pairs

Project description

image image image image

word2word

Easy-to-use word-to-word translations for 3,564 language pairs.

Key Features

  • A large collection of freely & publicly available word-to-word translations for 3,564 language pairs across 62 unique languages.
  • Easy-to-use Python interface.
  • Constructed using an efficient approach that is quantitatively examined by proficient bilingual human labelers.

Usage

First, install the package using pip:

pip install word2word

OR

git clone https://github.com/Kyubyong/word2word.git
python setup.py install

Then, in Python, download the model and retrieve top-5 word translations of any given word to the desired language:

from word2word import Word2word
en2fr = Word2word("en", "fr")
print(en2fr("apple"))
# out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs']

gif

Supported Languages

We provide top-k word-to-word translations across all available pairs from OpenSubtitles2018. This amounts to a total of 3,564 language pairs across 62 unique languages.

The full list is provided here.

Methodology

Our approach computes the top-k word-to-word translations based on the co-occurrence statistics between cross-lingual word pairs in a parallel corpus. We additionally introduce a correction term that controls for any confounding effect coming from other source words within the same sentence. The resulting method is an efficient and scalable approach that allows us to construct large bilingual dictionaries from any given parallel corpus.

For more details, see the Methods section of our paper draft.

Comparisons with Existing Software

A popular publicly available dataset of word-to-word translations is facebookresearch/MUSE, which includes 110 bilingual dictionaries that are built from Facebook's internal translation tool. In comparison to MUSE, word2word does not rely on a translation software and contains much larger sets of language pairs (3,564). word2word also provides the top-k word-to-word translations for up to 100k words (compared to 5~10k words in MUSE) and can be applied to any language pairs for which there is a parallel corpus.

In terms of quality, while a direct comparison between the two methods is difficult, we did notice that MUSE's bilingual dictionaries involving non-European languages may be not as useful. For English-Vietnamese, we found that 80% of the 1,500 word pairs in the validation set had the same word twice as a pair (e.g. crimson-crimson, Suzuki-Suzuki, Randall-Randall).

For more details, see Appendix in our paper draft.

References

If you use our software for research, please cite:

@misc{word2word2019,
  author = {Park, Kyubyong and Kim, Dongwoo and Choe, Yo Joong},
  title = {word2word},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/word2word}}
}

(We may later update this bibtex with a reference to our paper report.)

All of our word-to-word translations were constructed from the publicly available OpenSubtitles2018 dataset:

@article{opensubtitles2016,
  title={Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles},
  author={Lison, Pierre and Tiedemann, J{\"o}rg},
  year={2016},
  publisher={European Language Resources Association}
}

Authors

Kyubyong Park, Dongwoo Kim, and YJ Choe

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word2word-0.1.6.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

word2word-0.1.6-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file word2word-0.1.6.tar.gz.

File metadata

  • Download URL: word2word-0.1.6.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.25.0 CPython/3.6.6

File hashes

Hashes for word2word-0.1.6.tar.gz
Algorithm Hash digest
SHA256 1e7b8b6076f8108c5b399215f6b16594645eb86b7e4f97cf60c8726767f611be
MD5 2b39f34a1c4656f04ebe49706ebb6cb5
BLAKE2b-256 3d570b228993a23958778b98ae51a0a8f089c62396e068186747545a0db1280f

See more details on using hashes here.

File details

Details for the file word2word-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: word2word-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.25.0 CPython/3.6.6

File hashes

Hashes for word2word-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 d8cfe449d713010a3428c37bc2322fbc01b656b6223e65956a752518f27414c4
MD5 94ed33e6ad94684f8044d64918b81de8
BLAKE2b-256 2f510f28402ff8b92be33fd5d9d0a83f1d608b9298d6802bacabd3da283deac7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page