A string similarity measure based on the Earth Mover's Distance
Project description
N-gram Mover's Distance
A string similarity measure based on Earth Mover's Distance
Why another string matching algorithm?
- Edit distance really wasn't cutting it when I needed to look up a dictionary for a misspelled word
- With an edit distance of 1 or 2, the results are not very useful
- With a distance >=5, the results are meaningless
- Same goes for Damerau-Levenshtein
- Also, edit distance is pretty slow when looking up long words in a large dictionary
- Even with a decent automaton or trie implementation
- NMD was designed with indexing in mind
- A simpler index could be used for Jaccard or cosine similarity over ngrams
- todo: try this paper's algo
- which referenced this paper
Usage
ngram_movers_distance()
- string distance metric, use this to compare two strings
from nmd import ngram_movers_distance
# n-gram mover's distance
print(ngram_movers_distance(f'hello', f'yellow'))
# similarity (inverted distance)
print(ngram_movers_distance(f'hello', f'yellow', invert=True))
# distance, normalized to the range 0 to 1 (inclusive of 0 and 1)
print(ngram_movers_distance(f'hello', f'yellow', normalize=True))
# similarity, normalized to the range 0 to 1 (inclusive of 0 and 1)
print(ngram_movers_distance(f'hello', f'yellow', invert=True, normalize=True))
WordList
- use this for dictionary lookups of words
from nmd import WordList
# get words from a text file
with open(f'words_ms.txt', encoding=f'utf8') as f:
words = set(f.read().split())
# index words
word_list = WordList((2, 4), filter_n=0) # combined 2- and 4-grams seem to work best
for word in words:
word_list.add_word(word)
# lookup a word
print(word_list.lookup(f'asalamalaikum')) # -> 'assalamualaikum'
print(word_list.lookup(f'walaikumalasam')) # -> 'waalaikumsalam'
bow_ngram_movers_distance()
- WARNING: requires
scipy.optimize
, so it's not available by default in thenmd
namespace - use this to compare sequences of tokens (not necessarily unique)
from nmd.nmd_bow import bow_ngram_movers_distance
from tokenizer import unicode_tokenize
text_1 = f'Clementi Sports Hub'
text_2 = f'sport hubs clemmeti'
print(bow_ngram_movers_distance(bag_of_words_1=unicode_tokenize(text_1.casefold(), words_only=True),
bag_of_words_2=unicode_tokenize(text_2.casefold(), words_only=True),
invert=True, # invert: return similarity instead of distance
normalize=True, # return a score between 0 and 1
))
todo
- use less bizarre test strings
- rename nmd_bow because it isn't really a bag-of-words, it's a token sequence
- consider a
real_quick_ratio
-like optimization, or maybe calculate length bounds?- needs a cutoff to actually speed up though, makes a huge difference for difflib
- a sufficiently low cutoff is not unreasonable, although the default of 0.6 might be a little high for nmd
- that said the builtin diff performs pretty badly at low similarities, so 0.6 is reasonable for them
def real_quick_ratio(self):
"""Return an upper bound on ratio() very quickly.
This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute than either .ratio() or .quick_ratio().
"""
la, lb = len(self.a), len(self.b)
# can't have more matches than the number of elements in the shorter sequence
matches, length = min(la, lb), la + lb
if length:
return 2.0 * matches / length
return 1.0
- create a better string container for the index, more like a
set
add(word: str)
remove(word: str)
clear()
__contains__(word: str)
__iter__()
- better lookup
- add a min_similarity filter (float, based on normalized distance)
lookup(word: str, min_similarity: float = 0, filter: bool = True)
- try
__contains__
first- try levenshtein automaton (distance=1) second?
- sort by nmd, since most likely there will only be a few results
- but how to get multiple results?
- still need to run full search?
- or maybe just return top 1 result?
- try levenshtein automaton (distance=1) second?
- make the 3-gram filter optional
- add a min_similarity filter (float, based on normalized distance)
- prefix lookup
- look for all strings that are approximately prefixed
- like existing index but not normalized and ignoring unmatched ngrams from target
- bag of words
- use WMD with NMD word distances
- may require proper EMD implementation?
Publishing (notes for myself)
- init
pip install flit
flit init
- make sure
nmd/__init__.py
contains a docstring and version
- publish / update
- increment
__version__
innmd/__init__.py
- increment
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nmd-0.0.5.tar.gz
(2.1 MB
view hashes)
Built Distribution
nmd-0.0.5-py2.py3-none-any.whl
(12.4 kB
view hashes)