Skip to main content

Compute Word Mover's Distance using any type of Word Embedding model

Project description

Word Mover's Distance

In this package you will find the implementation of Word Mover's Distance for a generic Word Embeddings model.

I largely reused code available in the gensim library, in particular the wmdistance function, making it more general so that it can be used with other Word Embeddings models, such as GloVe.

You can find a real-world usage of this package in my news summariser repository, where I use Word Mover's distance for finding the most similar sentences in a given news article.

How to install

The preferred way to install this package is through pip:

pip install word-mover-distance

On Mac and Linux it works like a charm. On Windows, however, it is highly likely you will experience some issues: this is due to pyemd, which needs some C++ dependencies during build time. A quick way to solve this issue is to install "Build Tools for Visual Studio 2019" following this procedure:

  • Go to the following page and download "Build Tools for Visual Studio 2019" https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2019
  • Double click on the .exe file once finished and select to install C++ build tools
  • Check that among the suggested packages to install it is also selected "Windows 10 SDK" (the newest version is fine) as this is the critical dependency
  • Once the installation has finished reopen your PowerShell/Command Prompt and retry to install the library with the original pip instruction

If storage/connectivity speed is critical for your usecase and/or you would like to know more about the issue have a look at this Stack Overflow discussion.

Basic usage

Import the library:

from word_mover_distance import model

Initialise a Word Embedding object

You can pass the path where the model is stored:

my_model = model.WordEmbedding(model_fn="/path/where/my/model/is/stored.txt")

or you can pass the model itself, previously loaded (assuming your model is a dictionary, whose keys are the various words and its values the vector representation of the various words):

my_model = model.WordEmbedding(model=my_word_embedding_model)

Compute Word Mover's distance

s1 = 'Obama speaks to the media in Chicago'.lower().split()
s2 = 'The president spoke to the press in Chicago'.lower().split()
wmdistance = my_model.wmdistance(s1, s2)
1.8119693993679309

Remember that the wmdistance(s1, s2) method expects two List[str] as input!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word_mover_distance-0.0.3.tar.gz (3.9 kB view details)

Uploaded Source

File details

Details for the file word_mover_distance-0.0.3.tar.gz.

File metadata

  • Download URL: word_mover_distance-0.0.3.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.6.12

File hashes

Hashes for word_mover_distance-0.0.3.tar.gz
Algorithm Hash digest
SHA256 35342d342b032f43c761343a7ce154913b71cf7b2b1836da91a4ee59aa914072
MD5 0b0cfe026d28f30d8da83f56a3c36dda
BLAKE2b-256 0c120f8bdd834b1e1282a64bdaf3d228c021b2a1164efce3e8c9a86af3453c15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page