Accelerated functions to calculate Word Mover's Distance

These details have not been verified by PyPI

Project links

Project description

Fast Word Mover's Distance

Calculates Word Mover's Distance as described in From Word Embeddings To Document Distances by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.

The high level logic is written in Python, the low level functions related to linear programming are offloaded to the bundled native extension. The native extension can be built as a generic shared library not related to Python at all. Python 2.7 and older are not supported. The heavy-lifting is done by google/or-tools.

Installation

pip3 install wmd

Tested on Linux and macOS.

Usage

You should have the embeddings numpy array and the nbow model - that is, every sample is a weighted set of items, and every item is embedded.

import numpy
from wmd import WMD

embeddings = numpy.array([[0.1, 1], [1, 0.1]], dtype=numpy.float32)
nbow = {"first":  ("#1", [0, 1], numpy.array([1.5, 0.5], dtype=numpy.float32)),
        "second": ("#2", [0, 1], numpy.array([0.75, 0.15], dtype=numpy.float32))}
calc = WMD(embeddings, nbow, vocabulary_min=2)
print(calc.nearest_neighbors("first"))

[('second', 0.10606599599123001)]

embeddings must support __getitem__ which returns an item by it's identifier; particularly, numpy.ndarray matches that interface. nbow must be iterable - returns sample identifiers - and support __getitem__ by those identifiers which returns tuples of length 3. The first element is the human-readable name of the sample, the second is an iterable with item identifiers and the third is numpy.ndarray with the corresponding weights. All numpy arrays must be float32. The return format is the list of tuples with sample identifiers and relevancy indices (lower the better).

It is possible to use this package with spaCy:

import spacy
import wmd

nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))

Besides, see another example which finds similar Wikipedia pages.

Building from source

Either build it as a Python package:

pip3 install git+https://github.com/src-d/wmd-relax

or use CMake:

git clone --recursive https://github.com/src-d/wmd-relax
cmake -D CMAKE_BUILD_TYPE=Release .
make -j

Please note the --recursive flag for git clone. This project uses source{d}'s fork of google/or-tools as the git submodule.

Tests

Tests are in test.py and use the stock unittest package.

Documentation

cd doc
make html

The files are in doc/doxyhtml and doc/html directories.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

Apache 2.0

README {#ignore_this_doxygen_anchor}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.3.2

Oct 21, 2019

1.3.1

Apr 23, 2019

1.3.0

Oct 29, 2018

1.2.11

Aug 31, 2018

1.2.10

Aug 21, 2018

1.2.8

Jan 28, 2018

1.2.7

Jan 23, 2018

1.2.6

Jul 19, 2017

1.2.5

Jul 13, 2017

1.2.4

May 8, 2017

1.2.3

May 8, 2017

1.2.2

May 6, 2017

1.2.1

May 5, 2017

1.2.0

Apr 27, 2017

1.1.6

Apr 27, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wmd-1.3.2.tar.gz (104.6 kB view details)

Uploaded Oct 21, 2019 Source

Built Distribution

wmd-1.3.2-cp37-cp37m-manylinux1_x86_64.whl (637.1 kB view details)

Uploaded Oct 21, 2019 CPython 3.7m

File details

Details for the file wmd-1.3.2.tar.gz.

File metadata

Download URL: wmd-1.3.2.tar.gz
Upload date: Oct 21, 2019
Size: 104.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for wmd-1.3.2.tar.gz
Algorithm	Hash digest
SHA256	`a85736214e2c8b7ad650727ab2d08f937c791fe48b1eb11e964a9aee9d79569f`
MD5	`6071a2633571041dec1cbbd855ffde97`
BLAKE2b-256	`e514e1d122e56607ae49999041f372fa14166eb1e3b838122118d706f9bf1620`

See more details on using hashes here.

File details

Details for the file wmd-1.3.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

Download URL: wmd-1.3.2-cp37-cp37m-manylinux1_x86_64.whl
Upload date: Oct 21, 2019
Size: 637.1 kB
Tags: CPython 3.7m
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for wmd-1.3.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`865054ad3404275858183f5f2ea1a0a2af70f1aea6565bd6849c51063cf6f9db`
MD5	`0465076c83836d50e5f06c0ba559489b`
BLAKE2b-256	`6f6a57e4c9258402481bdb02c370cec1b67bd2981fb68eb6e141fbb3d461415f`

See more details on using hashes here.

wmd 1.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Fast Word Mover's Distance

Installation

Usage

Building from source

Tests

Documentation

Contributions

License

README {#ignore_this_doxygen_anchor}

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes