todo: write a useful docstring
Project description
N-gram Mover's Distance
a string similarity measure based on Earth Mover's Distance
Why another string matching algorithm?
- Edit distance really wasn't cutting it when I needed to look up a dictionary for a misspelled word
- With an edit distance of 1 or 2, the results are not very useful
- With a distance >=5, the results are meaningless
- Same goes for Damerau-Levenshtein
- Also, edit distance is pretty slow when looking up long words in a large dictionary
- Even with a decent automaton or trie implementation
- NMD was designed with indexing in mind
- A simpler index could be used for Jaccard or cosine similarity over ngrams
- todo: try this paper's algo
- which referenced this paper
Usage
nmd.ngram_movers_distance
string distance metric, use this to compare two strings
from nmd.nmd import ngram_movers_distance
# n-gram mover's distance
print(ngram_movers_distance(f'hello', f'yellow'))
# similarity (inverted distance)
print(ngram_movers_distance(f'hello', f'yellow', invert=True))
# distance, normalized to the range 0 to 1 (inclusive of 0 and 1)
print(ngram_movers_distance(f'hello', f'yellow', normalize=True))
# similarity, normalized to the range 0 to 1 (inclusive of 0 and 1)
print(ngram_movers_distance(f'hello', f'yellow', invert=True, normalize=True))
nmd_index.WordList
use this for dictionary lookups of words
from nmd.nmd_index import WordList
# get words from a text file
with open(f'words_ms.txt', encoding=f'utf8') as f:
words = set(f.read().split())
# index words
word_list = WordList((2, 4), filter_n=0) # combined 2- and 4-grams seem to work best
for word in words:
word_list.add_word(word)
# lookup a word
print(word_list.lookup(f'asalamalaikum')) # -> 'assalamualaikum'
print(word_list.lookup(f'walaikumalasam')) # -> 'waalaikumsalam'
nmd_bow.bow_ngram_movers_distance
use this to compare sequences of tokens (not necessarily unique)
from nmd.nmd_bow import bow_ngram_movers_distance
from tokenizer import unicode_tokenize
text_1 = f'Clementi Sports Hub'
text_2 = f'sport hubs clemmeti'
print(bow_ngram_movers_distance(bag_of_words_1=unicode_tokenize(text_1.casefold(), words_only=True),
bag_of_words_2=unicode_tokenize(text_2.casefold(), words_only=True),
invert=True, # invert: return similarity instead of distance
normalize=True, # return a score between 0 and 1
))
todo
- rename nmd_bow because it isn't really a bag-of-words, it's a token sequence
- consider a
real_quick_ratio
-like optimization, or maybe calculate length bounds?- needs a cutoff to actually speed up though, makes a huge difference for difflib
- a sufficiently low cutoff is not unreasonable, although the default of 0.6 might be a little high for nmd
- that said the builtin diff performs pretty badly at low similarities, so 0.6 is reasonable for them
def real_quick_ratio(self):
"""Return an upper bound on ratio() very quickly.
This isn't defined beyond that it is an upper bound on .ratio(), and
is faster to compute than either .ratio() or .quick_ratio().
"""
la, lb = len(self.a), len(self.b)
# can't have more matches than the number of elements in the shorter sequence
matches, length = min(la, lb), la + lb
if length:
return 2.0 * matches / length
return 1.0
- create a better string container for the index, more like a
set
add(word: str)
remove(word: str)
clear()
__contains__(word: str)
__iter__()
- better lookup
- add a min_similarity filter (float, based on normalized distance)
lookup(word: str, min_similarity: float = 0, filter: bool = True)
- try
__contains__
first- try levenshtein automaton (distance=1) second?
- sort by nmd, since most likely there will only be a few results
- but how to get multiple results?
- still need to run full search?
- or maybe just return top 1 result?
- try levenshtein automaton (distance=1) second?
- make the 3-gram filter optional
- add a min_similarity filter (float, based on normalized distance)
- prefix lookup
- look for all strings that are approximately prefixed
- like existing index but not normalized and ignoring unmatched ngrams from target
- bag of words
- use WMD with NMD word distances
- may require proper EMD implementation?
Publishing (notes for myself)
pip install flit
flit init
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nmd-0.0.2.tar.gz
(2.1 MB
view hashes)
Built Distribution
nmd-0.0.2-py2.py3-none-any.whl
(12.7 kB
view hashes)