Skip to main content

Accelerated functions to calculate Word Mover's Distance

Project description

Fast Word Mover's Distance [![Build Status](https://travis-ci.org/src-d/wmd-relax.svg?branch=master)](https://travis-ci.org/src-d/wmd-relax) [![PyPI](https://img.shields.io/pypi/v/wmd.svg)](https://pypi.python.org/pypi/wmd) [![codecov](https://codecov.io/github/src-d/wmd-relax/coverage.svg)](https://codecov.io/gh/src-d/wmd-relax)
==========================

Calculates Word Mover's Distance as described in
[From Word Embeddings To Document Distances](http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf)
by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.

<img src="https://blog.sourced.tech/post/lapjv/wmd.png" alt="Word Mover's Distance" width="200"/>

The high level logic is written in Python, the low level functions related to
linear programming are offloaded to the bundled native extension. The native
extension can be built as a generic shared library not related to Python at all.
**Python 2.7 and older are not supported.** The heavy-lifting is done by
[google/or-tools](https://github.com/google/or-tools).


### Installation

```
pip3 install wmd
```
Tested on Linux and macOS.

### Usage

You should have the embeddings numpy array and the nbow model - that is,
every sample is a weighted set of items, and every item is embedded.

```python
import numpy
from wmd import WMD

embeddings = numpy.array([[0.1, 1], [1, 0.1]], dtype=numpy.float32)
nbow = {"first": ("#1", [0, 1], numpy.array([1.5, 0.5], dtype=numpy.float32)),
"second": ("#2", [0, 1], numpy.array([0.75, 0.15], dtype=numpy.float32))}
calc = WMD(embeddings, nbow, vocabulary_min=2)
print(calc.nearest_neighbors("first"))
```
```
[('second', 0.10606599599123001)]
```

`embeddings` must support `__getitem__` which returns an item by it's
identifier; particularly, `numpy.ndarray` matches that interface.
`nbow` must be iterable - returns sample identifiers - and support
`__getitem__` by those identifiers which returns tuples of length 3.
The first element is the human-readable name of the sample, the
second is an iterable with item identifiers and the third is `numpy.ndarray`
with the corresponding weights. All numpy arrays must be float32. The return
format is the list of tuples with sample identifiers and relevancy
indices (lower the better).

It is possible to use this package with [spaCy](https://github.com/explosion/spaCy):

```python
import spacy
import wmd

nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))
```

Besides, see another [example](spacy_example.py) which finds similar Wikipedia
pages.

### Building from source

Either build it as a Python package:

```
pip3 install git+https://github.com/src-d/wmd-relax
```

or use CMake:

```
git clone --recursive https://github.com/src-d/wmd-relax
cmake -D CMAKE_BUILD_TYPE=Release .
make -j
```

Please note the `--recursive` flag for `git clone`. This project uses source{d}'s
fork of [google/or-tools](https://github.com/google/or-tools) as the git submodule.

### Tests

Tests are in `test.py` and use the stock `unittest` package.

### Documentation

```
cd doc
make html
```

The files are in `doc/doxyhtml` and `doc/html` directories.

### Contributions

...are welcome! See [CONTRIBUTING](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).

### License
[Apache 2.0](LICENSE.md)

#### README {#ignore_this_doxygen_anchor}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wmd-1.3.0.tar.gz (103.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

wmd-1.3.0-cp37-cp37m-manylinux1_x86_64.whl (635.6 kB view details)

Uploaded CPython 3.7m

wmd-1.3.0-cp36-cp36m-macosx_10_13_x86_64.whl (146.7 kB view details)

Uploaded CPython 3.6mmacOS 10.13+ x86-64

File details

Details for the file wmd-1.3.0.tar.gz.

File metadata

  • Download URL: wmd-1.3.0.tar.gz
  • Upload date:
  • Size: 103.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for wmd-1.3.0.tar.gz
Algorithm Hash digest
SHA256 9797d585a6f148bbfb0a926deb04f4eae20f1806dcac3527622bfd3b78a144af
MD5 f97d4db2818a4af6647908f9ff853437
BLAKE2b-256 2f61686d4dd4f2e37fea15b3bd04a5b68a74aa2cb54be18a31f59d5703991f0b

See more details on using hashes here.

File details

Details for the file wmd-1.3.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: wmd-1.3.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 635.6 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.8.1 pkginfo/1.3.2 requests/2.18.4 setuptools/39.1.0 requests-toolbelt/0.7.0 clint/0.5.1 CPython/3.7.0 Linux/4.15.0-36-generic

File hashes

Hashes for wmd-1.3.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4213900907d2c14f23b92b7db0c6cbe49db04c6aece1584603bcd5a9d78edddb
MD5 d85f748e42a6ded86dba4c3518ec5924
BLAKE2b-256 f6577a276d3711cc189afe8d6f40e6ef91dd5a2e807bbf6d40b7cad3b72241c7

See more details on using hashes here.

File details

Details for the file wmd-1.3.0-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for wmd-1.3.0-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 13bc4b10359aa8fbea4a7d2af388e8aa0dd591e16137c3ff2548461eec217308
MD5 030da46df3c65fb2406919fbf3cdbaf5
BLAKE2b-256 ad0a5457dea17077965394481e9311058e352717dc5a5095e95dfc8e79370fda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page