Skip to main content

todo: write a useful docstring

Project description

N-gram Mover's Distance

a string similarity measure based on Earth Mover's Distance

Why another string matching algorithm?

  • Edit distance really wasn't cutting it when I needed to look up a dictionary for a misspelled word
    • With an edit distance of 1 or 2, the results are not very useful
    • With a distance >=5, the results are meaningless
    • Same goes for Damerau-Levenshtein
  • Also, edit distance is pretty slow when looking up long words in a large dictionary
    • Even with a decent automaton or trie implementation
    • NMD was designed with indexing in mind
    • A simpler index could be used for Jaccard or cosine similarity over ngrams

Usage

ngram_movers_distance()

  • string distance metric, use this to compare two strings
from nmd import ngram_movers_distance

# n-gram mover's distance
print(ngram_movers_distance(f'hello', f'yellow'))

# similarity (inverted distance)
print(ngram_movers_distance(f'hello', f'yellow', invert=True))

# distance, normalized to the range 0 to 1 (inclusive of 0 and 1)
print(ngram_movers_distance(f'hello', f'yellow', normalize=True))

# similarity, normalized to the range 0 to 1 (inclusive of 0 and 1)
print(ngram_movers_distance(f'hello', f'yellow', invert=True, normalize=True))

WordList

  • use this for dictionary lookups of words
from nmd import WordList

# get words from a text file
with open(f'words_ms.txt', encoding=f'utf8') as f:
    words = set(f.read().split())

# index words
word_list = WordList((2, 4), filter_n=0)  # combined 2- and 4-grams seem to work best
for word in words:
    word_list.add_word(word)

# lookup a word
print(word_list.lookup(f'asalamalaikum'))  # -> 'assalamualaikum'
print(word_list.lookup(f'walaikumalasam'))  # -> 'waalaikumsalam'

bow_ngram_movers_distance()

  • WARNING: requires scipy.optimize, so it's not available by default in the nmd namespace
  • use this to compare sequences of tokens (not necessarily unique)
from nmd.nmd_bow import bow_ngram_movers_distance
from tokenizer import unicode_tokenize

text_1 = f'Clementi Sports Hub'
text_2 = f'sport hubs clemmeti'
print(bow_ngram_movers_distance(bag_of_words_1=unicode_tokenize(text_1.casefold(), words_only=True),
                                bag_of_words_2=unicode_tokenize(text_2.casefold(), words_only=True),
                                invert=True,  # invert: return similarity instead of distance
                                normalize=True,  # return a score between 0 and 1
                                ))

todo

  • use less bizarre test strings
  • rename nmd_bow because it isn't really a bag-of-words, it's a token sequence
  • consider a real_quick_ratio-like optimization, or maybe calculate length bounds?
    • needs a cutoff to actually speed up though, makes a huge difference for difflib
    • a sufficiently low cutoff is not unreasonable, although the default of 0.6 might be a little high for nmd
    • that said the builtin diff performs pretty badly at low similarities, so 0.6 is reasonable for them
def real_quick_ratio(self):
    """Return an upper bound on ratio() very quickly.

    This isn't defined beyond that it is an upper bound on .ratio(), and
    is faster to compute than either .ratio() or .quick_ratio().
    """

    la, lb = len(self.a), len(self.b)
    # can't have more matches than the number of elements in the shorter sequence
    matches, length = min(la, lb), la + lb
    if length:
        return 2.0 * matches / length
    return 1.0
  • create a better string container for the index, more like a set
    • add(word: str)
    • remove(word: str)
    • clear()
    • __contains__(word: str)
    • __iter__()
  • better lookup
    • add a min_similarity filter (float, based on normalized distance)
      • lookup(word: str, min_similarity: float = 0, filter: bool = True)
    • try __contains__ first
      • try levenshtein automaton (distance=1) second?
        • sort by nmd, since most likely there will only be a few results
      • but how to get multiple results?
        • still need to run full search?
        • or maybe just return top 1 result?
    • make the 3-gram filter optional
  • prefix lookup
    • look for all strings that are approximately prefixed
    • like existing index but not normalized and ignoring unmatched ngrams from target
  • bag of words
    • use WMD with NMD word distances
    • may require proper EMD implementation?

Publishing (notes for myself)

pip install flit
flit init
flit publish

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nmd-0.0.4.tar.gz (2.1 MB view hashes)

Uploaded Source

Built Distribution

nmd-0.0.4-py2.py3-none-any.whl (12.4 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page