Skip to main content

A simple python fuzzyset implementation.

Project description

Note

This is a maintained fork of the unfortunately no longer maintained fuzzyset package package by Mike Axiak. This fork is available on PyPi as fuzzyset2.

fuzzyset is a data structure that performs something akin to fulltext search against data to determine likely misspellings and approximate string matching.

Usage

The usage is simple. Just add a string to the set, and ask for it later by using either .get or []:

>>> a = fuzzyset.FuzzySet()
>>> a.add("michael axiak")
>>> a.get("micael asiak")
[(0.8461538461538461, u'michael axiak')]

The result will be a list of (score, matched_value) tuples. The score is between 0 and 1, with 1 being a perfect match.

For roughly 15% performance increase, there is also a Cython-implemented version called cfuzzyset. So you can write the following, akin to cStringIO and cPickle:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

Construction Arguments

  • iterable: An iterable that yields strings to initialize the data structure with

  • gram_size_lower: The lower bound of gram sizes to use, inclusive (see Theory of operation). Default: 2

  • gram_size_upper: The upper bound of gram sizes to use, inclusive (see Theory of operation). Default: 3

  • use_levenshtein: Whether or not to use the levenshtein distance to determine the match scoring. Default: True

Theory of operation

Adding to the data structure

First let’s look at adding a string, ‘michaelich’ to an empty set. We first break apart the string into n-grams (strings of length n). So trigrams of ‘michaelich’ would look like:

'-mi'
'mic'
'ich'
'cha'
'hae'
'ael'
'eli'
'lic'
'ich'
'ch-'

Note that fuzzyset will first normalize the string by removing non word characters except for spaces and commas and force everything to be lowercase.

Next the fuzzyset essentially creates a reverse index on those grams. Maintaining a dictionary that says:

'mic' -> (1, 0)
'ich' -> (2, 0)
...

And there’s a list that looks like:

[(3.31, 'michaelich')]

Note that we maintain this reverse index for all grams from gram_size_lower to gram_size_upper in the constructor. This becomes important in a second.

Retrieving

To search the data structure, we take the n-grams of the query string and perform a reverse index look up. To illustrate, let’s consider looking up 'michael' in our fictitious set containing 'michaelich' where the gram_size_upper and gram_size_lower parameters are default (3 and 2 respectively).

We begin by considering first all trigrams (the value of gram_size_upper). Those grams are:

'-mi'
'mic'
'ich'
'cha'
'el-'

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

If use_levenshtein is false, then we return all top matched elements with the same cosine similarity.

If use_levenshtein is true, then we truncate the possible search space to 50, compute a score based on the levenshtein distance (so that we handle transpositions), and return based on that.

In the event that none of the trigrams matched, we try the whole thing again with bigrams (note though that if there are no matches, the failure to match will be quick). Bigram searching will always be slower because there will be a much larger set to order.

Install

pip install fuzzyset2

Afterwards, you can import the package simply with:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

License

BSD

Author

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzyset2-0.2.0.tar.gz (379.2 kB view details)

Uploaded Source

Built Distributions

fuzzyset2-0.2.0-cp310-cp310-win_amd64.whl (38.2 kB view details)

Uploaded CPython 3.10 Windows x86-64

fuzzyset2-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (253.4 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.0-cp310-cp310-macosx_10_15_x86_64.whl (44.9 kB view details)

Uploaded CPython 3.10 macOS 10.15+ x86-64

fuzzyset2-0.2.0-cp39-cp39-win_amd64.whl (38.2 kB view details)

Uploaded CPython 3.9 Windows x86-64

fuzzyset2-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (249.3 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.0-cp39-cp39-macosx_10_15_x86_64.whl (44.9 kB view details)

Uploaded CPython 3.9 macOS 10.15+ x86-64

fuzzyset2-0.2.0-cp38-cp38-win_amd64.whl (38.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

fuzzyset2-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (250.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.0-cp38-cp38-macosx_10_14_x86_64.whl (43.9 kB view details)

Uploaded CPython 3.8 macOS 10.14+ x86-64

fuzzyset2-0.2.0-cp37-cp37m-win_amd64.whl (37.6 kB view details)

Uploaded CPython 3.7m Windows x86-64

fuzzyset2-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (218.3 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.0-cp37-cp37m-macosx_10_14_x86_64.whl (42.9 kB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

fuzzyset2-0.2.0-cp36-cp36m-win_amd64.whl (41.9 kB view details)

Uploaded CPython 3.6m Windows x86-64

fuzzyset2-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (217.2 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.0-cp36-cp36m-macosx_10_14_x86_64.whl (43.1 kB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

File details

Details for the file fuzzyset2-0.2.0.tar.gz.

File metadata

  • Download URL: fuzzyset2-0.2.0.tar.gz
  • Upload date:
  • Size: 379.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0.tar.gz
Algorithm Hash digest
SHA256 009de9edbe2dc7b7c88a0d905d59bbcb1354c319d9cad01482d9dc929b4e226d
MD5 8235a0461b6e46a128c215b032075d9d
BLAKE2b-256 0f87a274c7bd22b39dc878b3e54f601b491631c4d4bf13b0c9bda72a5015cd4a

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 db8baf4d35230c5ee71846df79f41ae8f5056f7b3ba9f2256d4dfe8b7289eb7d
MD5 d544d4d689488d71a870fafd6a9eddc8
BLAKE2b-256 18f98e51533be53e4c6832fc036cad16b6c10b69b4d719e204381f39ee8b89a0

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aadad13ffd00ebf849bc581e33eaed5e86b2f5b683afe09863313851c46517cf
MD5 334eeabe32ce306fe79157c3eb6a2501
BLAKE2b-256 8147f7cd30eb5e382db0cb8e3bb5851409644a2661e3e91079c56c80b5e4d16b

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp310-cp310-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: CPython 3.10, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 41b8672eb59184ca21de3ebf658a7429cfe280a7a5c8d4212efc128342c6dad2
MD5 a7e5d747df5a8e3bcf7a68c24812ea65
BLAKE2b-256 c2a37b768e0cc8171d08cf8473a72dc0bda470e29a86e6d1eb0a90e97e888ed1

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 4e660d843f51893d885ba47e0b0efe32b90f59338f855ed61db6eab475b91461
MD5 a0cd7d479be67f1f11d6cbd0afc1d6c6
BLAKE2b-256 01f6147109d4e617b7aa010b33f1112f747583149a99f7cd5856f96ccdfb57f4

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 39324b60e03cc9c6cee859e175bf34e85fa125c931347ccc8d4097d7da0418b0
MD5 527828077f85339a7253fc57672683fb
BLAKE2b-256 62b362e6934abc2e7b9a328669895fa353761047b419e290d26b00c224225172

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp39-cp39-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: CPython 3.9, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 68c9e63755cb01bba90440bdd4708336a9e987028f1ddd669e62b0a4c1196809
MD5 e98599be7815b07e2b1ec7490bf1f567
BLAKE2b-256 bec2d73480ece7ac4144c168568749bb8cb995ac2cc0481e4b0c7d21de1ec789

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 4d8c890cfab324e78b66693519587e3d49f8f43f57ebc7a9d01ff32e90d74035
MD5 f498a5e68b2ca6a6ba6ee86c0b5300df
BLAKE2b-256 3aacb92137b9cc566874744ffb256f2ae9f47bd5e22384ee9f34872381370f3f

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 10a57570bd02a5b0da2d044b093766e915b55a462988661cd21084986a493f6d
MD5 055bc9f35d5f075ee2713c78a6d758c5
BLAKE2b-256 8fc4c5b0192a7de4b99e6928b7191d67cdca0b7d39ce1a8ab53c8a8b39817548

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp38-cp38-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 43.9 kB
  • Tags: CPython 3.8, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 d29ff5e45770fafee1eb3daec1e384dfda365dd22e536c40731e05918d17a1b0
MD5 9ae33d1ef7f1222b5b7724bf78f1b2a0
BLAKE2b-256 0c86e25cd1eb21dca9dc2ab5c71d5aa896e2a44f4fada7081d623861eda13860

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 37.6 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 c1387bacfaaee71825b493b6a0e0f69a968d39c2d2432313807e6e05de1c8825
MD5 3760182b7011a979f138ca479d9a1bf4
BLAKE2b-256 f6fdee106fdd5b81f7ec4445e077db613578309f92c8bdfa02d1c8ea2ed81165

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 49abc521bafeba530d7dc3c79c963e4f1a78384fed67a46e66750017e9c3847d
MD5 fffcbbb31d7a4730ec3c313ecd11cd35
BLAKE2b-256 b2a1b0bff3602b89953887b74ff10a6f7ccaf69d7057bcc2b212f27f13dd5fe7

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 beab5f6c6017cc72b91a4cd455e0a91a7ab3f6a7f1b8c355793018d10d7f816b
MD5 e0c7d5be2af4ce637f6ba6602c3312f3
BLAKE2b-256 641bb2be475f5c2ab8d1565f77aa6aea5ca306fd00af23424adef54c6f0168c2

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 41.9 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 d38458b9d04ece2fba17739572882be965e37fe6ab569968a3ed6d8c02664915
MD5 a8e0ead467538c65fc25aeabc2103e72
BLAKE2b-256 dec437c0c66efe212e1e4b70ff7e930808b61a75a75d38c4f9ad37635b288c15

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 41e6940e655bdc9189eaf6d442e21adbc9681951f1d00d1dbf8fb38a30006ff4
MD5 b3136c3771d7e9037b1feffd322f22cb
BLAKE2b-256 f1287a661124bc01fcb9f0462eefb39f0bc16446f6c6f1be619d8a87b2722196

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.0-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.0-cp36-cp36m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: CPython 3.6m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for fuzzyset2-0.2.0-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 ab6182461ceb868c698de085659b562232d22f746d72cd6da6c23398d554af5e
MD5 a405aa48dc391f85114c7d9655d15383
BLAKE2b-256 75699c2252c6bad3da4457ee38cc3d25cb2cc0e68e347530adc633e0ad0d92d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page