Skip to main content

A simple python fuzzyset implementation.

Project description

Note

This is a maintained fork of the unfortunately no longer maintained fuzzyset package package by Mike Axiak. This fork is available on PyPi as fuzzyset2.

fuzzyset is a data structure that performs something akin to fulltext search against data to determine likely misspellings and approximate string matching.

Usage

The usage is simple. Just add a string to the set, and ask for it later by using either .get or []:

>>> a = fuzzyset.FuzzySet()
>>> a.add("michael axiak")
>>> a.get("micael asiak")
[(0.8461538461538461, u'michael axiak')]

The result will be a list of (score, mached_value) tuples. The score is between 0 and 1, with 1 being a perfect match.

For roughly 15% performance increase, there is also a Cython-implemented version called cfuzzyset. So you can write the following, akin to cStringIO and cPickle:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

Construction Arguments

  • iterable: An iterable that yields strings to initialize the data structure with

  • gram_size_lower: The lower bound of gram sizes to use, inclusive (see Theory of operation). Default: 2

  • gram_size_upper: The upper bound of gram sizes to use, inclusive (see Theory of operation). Default: 3

  • use_levenshtein: Whether or not to use the levenshtein distance to determine the match scoring. Default: True

Theory of operation

Adding to the data structure

First let’s look at adding a string, ‘michaelich’ to an empty set. We first break apart the string into n-grams (strings of length n). So trigrams of ‘michaelich’ would look like:

'-mi'
'mic'
'ich'
'cha'
'hae'
'ael'
'eli'
'lic'
'ich'
'ch-'

Note that fuzzyset will first normalize the string by removing non word characters except for spaces and commas and force everything to be lowercase.

Next the fuzzyset essentially creates a reverse index on those grams. Maintaining a dictionary that says:

'mic' -> (1, 0)
'ich' -> (2, 0)
...

And there’s a list that looks like:

[(3.31, 'michaelich')]

Note that we maintain this reverse index for all grams from gram_size_lower to gram_size_upper in the constructor. This becomes important in a second.

Retrieving

To search the data structure, we take the n-grams of the query string and perform a reverse index look up. To illustrate, let’s consider looking up 'michael' in our fictitious set containing 'michaelich' where the gram_size_upper and gram_size_lower parameters are default (3 and 2 respectively).

We begin by considering first all trigrams (the value of gram_size_upper). Those grams are:

'-mi'
'mic'
'ich'
'cha'
'el-'

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

If use_levenshtein is false, then we return all top matched elements with the same cosine similarity.

If use_levenshtein is true, then we truncate the possible search space to 50, compute a score based on the levenshtein distance (so that we handle transpositions), and return based on that.

In the event that none of the trigrams matched, we try the whole thing again with bigrams (note though that if there are no matches, the failure to match will be quick). Bigram searching will always be slower because there will be a much larger set to order.

Install

pip install fuzzyset2

Afterwards, you can import the package simply with:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

License

BSD

Author

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzyset2-0.1.1.tar.gz (378.2 kB view details)

Uploaded Source

Built Distributions

fuzzyset2-0.1.1-cp39-cp39-win_amd64.whl (43.1 kB view details)

Uploaded CPython 3.9 Windows x86-64

fuzzyset2-0.1.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (248.4 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.1.1-cp39-cp39-macosx_10_14_x86_64.whl (44.6 kB view details)

Uploaded CPython 3.9 macOS 10.14+ x86-64

fuzzyset2-0.1.1-cp38-cp38-win_amd64.whl (43.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

fuzzyset2-0.1.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (249.6 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.1.1-cp38-cp38-macosx_10_14_x86_64.whl (44.0 kB view details)

Uploaded CPython 3.8 macOS 10.14+ x86-64

fuzzyset2-0.1.1-cp37-cp37m-win_amd64.whl (42.1 kB view details)

Uploaded CPython 3.7m Windows x86-64

fuzzyset2-0.1.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (218.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.1.1-cp37-cp37m-macosx_10_14_x86_64.whl (42.8 kB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

fuzzyset2-0.1.1-cp36-cp36m-win_amd64.whl (42.1 kB view details)

Uploaded CPython 3.6m Windows x86-64

fuzzyset2-0.1.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (217.5 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.1.1-cp36-cp36m-macosx_10_14_x86_64.whl (43.1 kB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

File details

Details for the file fuzzyset2-0.1.1.tar.gz.

File metadata

  • Download URL: fuzzyset2-0.1.1.tar.gz
  • Upload date:
  • Size: 378.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1.tar.gz
Algorithm Hash digest
SHA256 572c53a1f09d8d6c5b8b012b7b699327834c2d892a48e3d66ec650458a19614e
MD5 55b27dbb4936b6bd56b51b1482464899
BLAKE2b-256 cd576eba745dd426f14508d46a113deba642a4c0078f7041bfd974a16c9fbeac

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1c28b97aff4fa474ba2edc039229fb651271409c5c7e1a545666d45d11f2dd6a
MD5 5c644e3bd890d003b38124b1c9cc7051
BLAKE2b-256 6ec65b11aa3de2762990a15121365cde2fb6523827f283b114f31806e5eda67b

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.1.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8fd5f3344e8c3ebc371826d6c86cb56b2ce2e79328945f8aa938b73e5061deb7
MD5 c2e5bfe709cc555bcc71faf4dc76d305
BLAKE2b-256 5552e54ef86077913d23b752fcdf65ca0ba357a47df6e0079bf9070b693d249d

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.1-cp39-cp39-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 44.6 kB
  • Tags: CPython 3.9, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 0dbaa60cfb458342d085adbad1b5ad4319ea002e6c2e5891f5ee7eb76cb2e5c8
MD5 9cec0ecf71593c2fe4691496fe7d9b75
BLAKE2b-256 fa249e3fbbae90d7c056c144df095484a9245af1143c649f227cb21f807b5ae4

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 43.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 0a4cb08c56a70116e3d04baa852c4db543435e9b36d3e103cd405cf108f8cfa4
MD5 439c8034a1c2e3b6de58b45868be2415
BLAKE2b-256 99e6132691b5ecef68f2eb648701d5fd99b738b5d1954cecab5a8c10c1cd2eac

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.1.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1d6cc76177a64b469edf49ef0a039af436712cbedb2d75059cb30c7a471c5cfb
MD5 05353b6411efcaccc433193de9c1f874
BLAKE2b-256 067e35aeab3a0ba469cd7c6005de45335f5d19e5465321d933c592670ac0e0d9

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.1-cp38-cp38-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 44.0 kB
  • Tags: CPython 3.8, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 a98cdf048b1bd4e63331fdfacbe4e951fe50412d9e15552cadf18e78253dd353
MD5 83c21164f464b42b2a24c51f16cea3d6
BLAKE2b-256 7bf05bdf7a6c4196e739b7b9d7fa9ceae530a774e85900af42a77bbfb180e7c8

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 42.1 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 c7357b29c66effbd09445a542a2c6fb8747f284183df7e592502d7b7c90d0cbc
MD5 ed126de300892c117fa71380201832b7
BLAKE2b-256 e98a77153b2ae9569580ca2d648f79d21b3e1bd351c6b115c8d419c86bff46fc

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.1.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 281356973623adc933713fa7c8e9a83c98e1b1ab299b97589c16fd62842a491b
MD5 284472d7b7451b866afe9c88828511c3
BLAKE2b-256 0447879f4fde5a6a93ff5f8a72c1215e7f125af758d6c7bd50c5374f41e75c74

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.1-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 42.8 kB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 9f3d6e56e098f9e8656d9519acac7bf1c3660ace727d2a3f8ac3d20effab9669
MD5 146dfe13e51c53d068c4b09ac0c78f4f
BLAKE2b-256 05c68d903fb6bf12cf96971c7571856308fd4e912070c5ceec66fea70820b406

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 42.1 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 722e9468a98d437c2386a1be7b3a9b73d2349e3d9fcb4d96c6651e8e4a58bf1a
MD5 3a2a037df7be0adfe05fb6c928ee9f94
BLAKE2b-256 d0714cd5794298e9cf2cb35bf6c2b4174ee9debde078c7a5309b88a949ac0b60

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.1.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e9e225e50bd420e24ef7cae7971a47e25881ad58da30a9bd34787971775b5e52
MD5 0a7fe47ba0fc047ace54aab3f8f55d7a
BLAKE2b-256 392c2ad74c92a6bd5e301af077cced9a496c96ae118cd5c48c4c1251309ab282

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.1-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.1-cp36-cp36m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: CPython 3.6m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.1-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 bff7fa5c39787b17dec1e3a929b72c71ee2a4fd4058f4fcc2380bc61ddaa586e
MD5 da2ebeca07e4c3dbc28575aea9f04d24
BLAKE2b-256 ae5d564b617e99ef83821e368b17a78487c1845e35a37a96272bc9ada5ada36c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page