Skip to main content

Accelerated functions to calculate Word Mover's Distance

Project description

Fast Word Mover's Distance Build Status PyPI codecov

Calculates Word Mover's Distance as described in From Word Embeddings To Document Distances by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.

Word Mover's Distance

The high level logic is written in Python, the low level functions related to linear programming are offloaded to the bundled native extension. The native extension can be built as a generic shared library not related to Python at all. Python 2.7 and older are not supported. The heavy-lifting is done by google/or-tools.

Installation

pip3 install wmd

Tested on Linux and macOS.

Usage

You should have the embeddings numpy array and the nbow model - that is, every sample is a weighted set of items, and every item is embedded.

import numpy
from wmd import WMD

embeddings = numpy.array([[0.1, 1], [1, 0.1]], dtype=numpy.float32)
nbow = {"first":  ("#1", [0, 1], numpy.array([1.5, 0.5], dtype=numpy.float32)),
        "second": ("#2", [0, 1], numpy.array([0.75, 0.15], dtype=numpy.float32))}
calc = WMD(embeddings, nbow, vocabulary_min=2)
print(calc.nearest_neighbors("first"))
[('second', 0.10606599599123001)]

embeddings must support __getitem__ which returns an item by it's identifier; particularly, numpy.ndarray matches that interface. nbow must be iterable - returns sample identifiers - and support __getitem__ by those identifiers which returns tuples of length 3. The first element is the human-readable name of the sample, the second is an iterable with item identifiers and the third is numpy.ndarray with the corresponding weights. All numpy arrays must be float32. The return format is the list of tuples with sample identifiers and relevancy indices (lower the better).

It is possible to use this package with spaCy:

import spacy
import wmd

nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))

Besides, see another example which finds similar Wikipedia pages.

Building from source

Either build it as a Python package:

pip3 install git+https://github.com/src-d/wmd-relax

or use CMake:

git clone --recursive https://github.com/src-d/wmd-relax
cmake -D CMAKE_BUILD_TYPE=Release .
make -j

Please note the --recursive flag for git clone. This project uses source{d}'s fork of google/or-tools as the git submodule.

Tests

Tests are in test.py and use the stock unittest package.

Documentation

cd doc
make html

The files are in doc/doxyhtml and doc/html directories.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

Apache 2.0

README {#ignore_this_doxygen_anchor}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wmd-1.3.2.tar.gz (104.6 kB view details)

Uploaded Source

Built Distribution

wmd-1.3.2-cp37-cp37m-manylinux1_x86_64.whl (637.1 kB view details)

Uploaded CPython 3.7m

File details

Details for the file wmd-1.3.2.tar.gz.

File metadata

  • Download URL: wmd-1.3.2.tar.gz
  • Upload date:
  • Size: 104.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for wmd-1.3.2.tar.gz
Algorithm Hash digest
SHA256 a85736214e2c8b7ad650727ab2d08f937c791fe48b1eb11e964a9aee9d79569f
MD5 6071a2633571041dec1cbbd855ffde97
BLAKE2b-256 e514e1d122e56607ae49999041f372fa14166eb1e3b838122118d706f9bf1620

See more details on using hashes here.

File details

Details for the file wmd-1.3.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: wmd-1.3.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 637.1 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for wmd-1.3.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 865054ad3404275858183f5f2ea1a0a2af70f1aea6565bd6849c51063cf6f9db
MD5 0465076c83836d50e5f06c0ba559489b
BLAKE2b-256 6f6a57e4c9258402481bdb02c370cec1b67bd2981fb68eb6e141fbb3d461415f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page