Skip to main content

Multilingual word embeddings.

Project description

Combine pre-trained word embedding models for different languages to vectorise multi-language documents.

This package includes a python implementation of the the method outlined in MLS2013, which allows for word embeddings from one model to be translated to the vector space of another model.

This allows you to combine word embeddings from different languages, avoiding the expense and complexity of training bilingual models. With transvec, you can simply use pre-trained Word2Vec models for different languages to measure the similarity of words in different languages and produce document vectors for mixed-language text.

Installation

pip install transvec

Example

Let’s say we want to study a corpus of text that contains a mix of Russian and English. gensim has pre-trained models for both languages:

>>> import gensim.downloader
>>> ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
>>> en_model = gensim.downloader.load("glove-wiki-gigaword-300")

Now assume you don’t have the resources to train a single model that understands both languages well (and you probably don’t). It would be nice to take advantage of the knowledge we have in these two pre-trained models instead. Let’s use the Russian model to compare Russian words and the English model to compare English words:

>>> en_model.similar_by_word("king", 1)
[('queen', 0.6336469054222107)]

>>> ru_model.similar_by_word("царь_NOUN", 1) # "king"
[('царица_NOUN', 0.7304918766021729)] # "queen"

As advertised, the models correctly find words with a similar meaning. What if we now wish to compare words from different languages?

>>> ru_model.similar_by_word("king", 1)
Traceback (most recent call last):
    ...
KeyError: "word 'king' not in vocabulary"

It doesn’t work, because the Russian model was not trained on English words. We could of course convert our word to a vector in the English model, and then look for the most similar vector in our corpus:

>>> king_vector = en_model.get_vector("king")
>>> ru_model.similar_by_vector(king_vector, 1)
[('непроизводительный_ADJ', 0.21217751502990723)]

Our result (which appropriately means “unproductive”) makes no sense at all. The meaning is nothing like our input word. Why did this happen? Because the “king” vector is defined by the vector space of the English model, which has nothing to do with the vector space of the Russian model. Output from the two models is completely uncomparable. To remedy this, we must translate the vector from the source space (English in the above case) into the target space (Russian).

This is where transvec can help you. By providing pairs of words in the source language along with their translation into the target language, transvec can train a model that will translate the vector for a word in the source language to a vector in the target language:

>>> from transvec.transformers import TranslationWordVectorizer

>>> train = [
...     ("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
...     ("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
... ]

>>> bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)

For the convenience of English speakers, we have defined English to be our target language in this case. This will create a model that can take inputs in either language, but its output will always be in English.

Now we can make comparisons across both languages:

>>> bilingual_model.similar_by_word("царь_NOUN", 1) # "tsar"
[('king', 0.8043200969696045)]

If the provided word does not exist in the source corpus, but does exist in the target corpus, the model will fall back to using the target language’s vector:

>>> bilingual_model.similar_by_word("king", 1)
[('queen', 0.6336469054222107)]

We can also get sensible results for words that aren’t in our training set (the quality will depend on how comprehensive your training word pairs are):

>>> bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
[('king', 0.7763221263885498)]

Note that you can provide regularisation parameters to the TranslationWordVectorizer to help improve these results if you need to.

Extra features

Bulk vectorisation

For convenience, TranslationWordVectorizer also implements the scikit-learn Transformer API, allowing you to vectorise large sets of data in a pipeline easily. If you provide a 2D matrix of words, it will assume each row represents a single document and produce a single vector for each row, which is just the mean of all of the word vectors in the document (this is a simple, cheap way of approximating document vectors when your documents contain multiple languages).

Multilingual models

The example above converts a single source language into a target language. You can however train a model that recognises multiple source languages instead. Simply provide more than one source language when you initialise the model. Source languages will be prioritised in the order you define them. Note that your training data must now contain word tuples rather than word pairs; the order of the languages matching the order of your models.

How does it work?

The full details are outlined in MLS2013, but basically it’s just Ordinary Least Squares. The paper notes that a linear relationship exists between the vector spaces of monolingual models, meaning that a simple translation matrix can be used to convert a vector from its native vector space to a similar point in a target vector space, placing it close to words in the target language with similar meanings.

Unlike the original paper, transvec uses ridge regression rather than OLS to derive this translation matrix: this is to help prevent overfitting if you only have a small set of training word pairs. If you want to use OLS instead, simply set the regularization parameter (alpha) to zero in the TranslationWordVectorizer constructor.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transvec-0.1.0.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

transvec-0.1.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file transvec-0.1.0.tar.gz.

File metadata

  • Download URL: transvec-0.1.0.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.2

File hashes

Hashes for transvec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1a71552ea1943a5ccb8c7de0cdf519b4358c8fec7610f83089d98f88265e0e87
MD5 bd95b5ec4d2c00ed3d80be266739bac5
BLAKE2b-256 3722ee1688217972e3c387075c448a51389be3074466febc4d60b228c9d02a7a

See more details on using hashes here.

File details

Details for the file transvec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: transvec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.2

File hashes

Hashes for transvec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9e3e4e57826ba61aba4999308d125b7bfe1afa9c600cabc2b3e33b0a0ef5862
MD5 01e6648d5d2f86846cac952f02f2f19f
BLAKE2b-256 0dac7a9aab6c0940a879acc6691aa580bc0c5ce0fd5613da419baf1eefd3b665

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page