simplefuzzyset

A simpler python fuzzyset implementation.

These details have not been verified by PyPI

Project links

Homepage

Project description

Forked from https://github.com/axiak/fuzzyset

Removed levenshtein support
No dependencies on install

fuzzyset is a data structure that performs something akin to fulltext search against data to determine likely mispellings and approximate string matching.

Usage

The usage is simple. Just add a string to the set, and ask for it later by using either .get or []:

>>> a = fuzzyset.FuzzySet()
>>> a.add("michael axiak")
>>> a.get("micael asiak")
[(0.8461538461538461, u'michael axiak')]

The result will be a list of (score, mached_value) tuples. The score is between 0 and 1, with 1 being a perfect match.

For roughly 15% performance increase, there is also a Cython-implemented version called cfuzzyset. So you can write the following, akin to cStringIO and cPickle:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

Construction Arguments

iterable: An iterable that yields strings to initialize the data structure with

gram_size_lower: The lower bound of gram sizes to use, inclusive (see Theory of operation). Default: 2

gram_size_upper: The upper bound of gram sizes to use, inclusive (see Theory of operation). Default: 3

use_levenshtein: Whether or not to use the levenshtein distance to determine the match scoring. Default: True

Theory of operation

Adding to the data structure

First let’s look at adding a string, ‘michaelich’ to an empty set. We first break apart the string into n-grams (strings of length n). So trigrams of ‘michaelich’ would look like:

'-mi'
'mic'
'ich'
'cha'
'hae'
'ael'
'eli'
'lic'
'ich'
'ch-'

Note that fuzzyset will first normalize the string by removing non word characters except for spaces and commas and force everything to be lowercase.

Next the fuzzyset essentially creates a reverse index on those grams. Maintaining a dictionary that says:

'mic' -> (1, 0)
'ich' -> (2, 0)
...

And there’s a list that looks like:

[(3.31, 'michaelich')]

Note that we maintain this reverse index for all grams from gram_size_lower to gram_size_upper in the constructor. This becomes important in a second.

Retrieving

To search the data structure, we take the n-grams of the query string and perform a reverse index look up. To illustrate, let’s consider looking up 'michael' in our fictitious set containing 'michaelich' where the gram_size_upper and gram_size_lower parameters are default (3 and 2 respectively).

We begin by considering first all trigrams (the value of gram_size_upper). Those grams are:

'-mi'
'mic'
'ich'
'cha'
'el-'

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

If use_levenshtein is false, then we return all top matched elements with the same cosine similarity.

If use_levenshtein is true, then we truncate the possible search space to 50, compute a score based on the levenshtein distance (so that we handle transpositions), and return based on that.

In the event that none of the trigrams matched, we try the whole thing again with bigrams (note though that if there are no matches, the failure to match will be quick). Bigram searching will always be slower because there will be a much larger set to order.

Install

easy_install fuzzyset

License

BSD

Author

Mike Axiak <mike@axiak.net>

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.12

May 26, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simplefuzzyset-0.0.12.tar.gz (319.9 kB view details)

Uploaded May 26, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

simplefuzzyset-0.0.12-py3-none-any.whl (7.0 kB view details)

Uploaded May 26, 2018 Python 3

File details

Details for the file simplefuzzyset-0.0.12.tar.gz.

File metadata

Download URL: simplefuzzyset-0.0.12.tar.gz
Upload date: May 26, 2018
Size: 319.9 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for simplefuzzyset-0.0.12.tar.gz
Algorithm	Hash digest
SHA256	`9a1b30c38b6afb76c6600bdd66c1c1dc3d8505b082e9e3d466f60f40e8b7e1f2`
MD5	`a9caedd19fbd1873ccf7c0a7d32ed65d`
BLAKE2b-256	`cebc7e5d5eaa5566ade033cda9ff0eb51b0942ab2138288b445c469d2814cd2f`

See more details on using hashes here.

File details

Details for the file simplefuzzyset-0.0.12-py3-none-any.whl.

File metadata

Download URL: simplefuzzyset-0.0.12-py3-none-any.whl
Upload date: May 26, 2018
Size: 7.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for simplefuzzyset-0.0.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c7f146e600ac9fb8d87a013439903f6bb8337ce902e740d25b8ed5f41b8f44da`
MD5	`461072fe245bf76bd036077cab1e458a`
BLAKE2b-256	`858ed05f60a19b82edb7e40b8a8767f29b804f4212ea58e7d5c5432a85084936`

See more details on using hashes here.

simplefuzzyset 0.0.12

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Usage

Construction Arguments

Theory of operation

Adding to the data structure

Retrieving

Install

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes