A string similarity measure based on the Earth Mover's Distance

Project description

N-gram Mover's Distance

A string similarity measure based on Earth Mover's Distance

Why another string matching algorithm?

Edit distance really wasn't cutting it when I needed to look up a dictionary for a misspelled word
- With an edit distance of 1 or 2, the results are not very useful
- With a distance >=5, the results are meaningless
- Same goes for Damerau-Levenshtein
Also, edit distance is pretty slow when looking up long words in a large dictionary
- Even with a decent automaton or trie implementation
- NMD was designed with indexing in mind
- A simpler index could be used for Jaccard or cosine similarity over ngrams
  - todo: try this paper's algo
  - which referenced this paper

Usage

`ngram_movers_distance()`

string distance metric, use this to compare two strings

from nmd import ngram_movers_distance

# n-gram mover's distance
print(ngram_movers_distance(f'hello', f'yellow'))

# similarity (inverted distance)
print(ngram_movers_distance(f'hello', f'yellow', invert=True))

# distance, normalized to the range 0 to 1 (inclusive of 0 and 1)
print(ngram_movers_distance(f'hello', f'yellow', normalize=True))

# similarity, normalized to the range 0 to 1 (inclusive of 0 and 1)
print(ngram_movers_distance(f'hello', f'yellow', invert=True, normalize=True))

`WordList`

use this for dictionary lookups of words

from nmd import WordList

# get words from a text file
with open(f'words_ms.txt', encoding=f'utf8') as f:
    words = set(f.read().split())

# index words
word_list = WordList((2, 4), filter_n=0)  # combined 2- and 4-grams seem to work best
for word in words:
    word_list.add_word(word)

# lookup a word
print(word_list.lookup(f'asalamalaikum'))  # -> 'assalamualaikum'
print(word_list.lookup(f'walaikumalasam'))  # -> 'waalaikumsalam'

`bow_ngram_movers_distance()`

WARNING: requires scipy.optimize, so it's not available by default in the nmd namespace
use this to compare sequences of tokens (not necessarily unique)

from nmd.nmd_bow import bow_ngram_movers_distance
from tokenizer import unicode_tokenize

text_1 = f'Clementi Sports Hub'
text_2 = f'sport hubs clemmeti'
print(bow_ngram_movers_distance(bag_of_words_1=unicode_tokenize(text_1.casefold(), words_only=True),
                                bag_of_words_2=unicode_tokenize(text_2.casefold(), words_only=True),
                                invert=True,  # invert: return similarity instead of distance
                                normalize=True,  # return a score between 0 and 1
                                ))

todo

use less bizarre test strings
rename nmd_bow because it isn't really a bag-of-words, it's a token sequence
consider a real_quick_ratio-like optimization, or maybe calculate length bounds?
- needs a cutoff to actually speed up though, makes a huge difference for difflib
- a sufficiently low cutoff is not unreasonable, although the default of 0.6 might be a little high for nmd
- that said the builtin diff performs pretty badly at low similarities, so 0.6 is reasonable for them

def real_quick_ratio(self):
    """Return an upper bound on ratio() very quickly.

    This isn't defined beyond that it is an upper bound on .ratio(), and
    is faster to compute than either .ratio() or .quick_ratio().
    """

    la, lb = len(self.a), len(self.b)
    # can't have more matches than the number of elements in the shorter sequence
    matches, length = min(la, lb), la + lb
    if length:
        return 2.0 * matches / length
    return 1.0

create a better string container for the index, more like a set
- add(word: str)
- remove(word: str)
- clear()
- __contains__(word: str)
- __iter__()
better lookup
- add a min_similarity filter (float, based on normalized distance)
  - lookup(word: str, min_similarity: float = 0, filter: bool = True)
- try __contains__ first
  - try levenshtein automaton (distance=1) second?
    - sort by nmd, since most likely there will only be a few results
  - but how to get multiple results?
    - still need to run full search?
    - or maybe just return top 1 result?
- make the 3-gram filter optional
prefix lookup
- look for all strings that are approximately prefixed
- like existing index but not normalized and ignoring unmatched ngrams from target
bag of words
- use WMD with NMD word distances
- may require proper EMD implementation?

Publishing (notes for myself)

init
- pip install flit
- flit init
- make sure nmd/__init__.py contains a docstring and version
publish / update
- increment __version__ in nmd/__init__.py

Project details

Release history Release notifications | RSS feed

0.0.6

Jul 13, 2023

This version

0.0.5

Aug 2, 2022

0.0.4

Aug 2, 2022

0.0.3

Aug 2, 2022

0.0.2

Aug 2, 2022

0.0.1

Aug 2, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nmd-0.0.5.tar.gz (2.1 MB view hashes)

Uploaded Aug 2, 2022 Source

Built Distribution

nmd-0.0.5-py2.py3-none-any.whl (12.4 kB view hashes)

Uploaded Aug 2, 2022 Python 2 Python 3

Hashes for nmd-0.0.5.tar.gz

Hashes for nmd-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`e9f9bc7b308eabc1c7cbabc338f321dd94ff088d260ea0c44fdd1bbc439aa8cc`
MD5	`45dded17a55eebf2da9e99a131efa980`
BLAKE2b-256	`982b1620e1b91add22d8007fc3baad449f7ff29f2a5bd3019ff7e3451e90fe6f`

Hashes for nmd-0.0.5-py2.py3-none-any.whl

Hashes for nmd-0.0.5-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`28efd1965d31b4d20797143a2e447d03766c48aa25ef6449fb4c788ac5611947`
MD5	`4830b00c076f0506988ad3b17c544cdf`
BLAKE2b-256	`e3e3250d8e751a3adde8c9f381d8c39f1f628558d72d38dd0e905090e0fd2a07`