Skip to main content

A simple python fuzzyset implementation.

Project description

Note

This is a maintained fork of the unfortunately no longer maintained fuzzyset package package by Mike Axiak. This fork is available on PyPi as fuzzyset2.

fuzzyset is a data structure that performs something akin to fulltext search against data to determine likely misspellings and approximate string matching.

Usage

The usage is simple. Just add a string to the set, and ask for it later by using either .get or []:

>>> a = fuzzyset.FuzzySet()
>>> a.add("michael axiak")
>>> a.get("micael asiak")
[(0.8461538461538461, u'michael axiak')]

The result will be a list of (score, matched_value) tuples. The score is between 0 and 1, with 1 being a perfect match.

For roughly 15% performance increase, there is also a Cython-implemented version called cfuzzyset. So you can write the following, akin to cStringIO and cPickle:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

Construction Arguments

  • iterable: An iterable that yields strings to initialize the data structure with

  • gram_size_lower: The lower bound of gram sizes to use, inclusive (see Theory of operation). Default: 2

  • gram_size_upper: The upper bound of gram sizes to use, inclusive (see Theory of operation). Default: 3

  • use_levenshtein: Whether or not to use the levenshtein distance to determine the match scoring. Default: True

Theory of operation

Adding to the data structure

First let’s look at adding a string, ‘michaelich’ to an empty set. We first break apart the string into n-grams (strings of length n). So trigrams of ‘michaelich’ would look like:

'-mi'
'mic'
'ich'
'cha'
'hae'
'ael'
'eli'
'lic'
'ich'
'ch-'

Note that fuzzyset will first normalize the string by removing non word characters except for spaces and commas and force everything to be lowercase.

Next the fuzzyset essentially creates a reverse index on those grams. Maintaining a dictionary that says:

'mic' -> (1, 0)
'ich' -> (2, 0)
...

And there’s a list that looks like:

[(3.31, 'michaelich')]

Note that we maintain this reverse index for all grams from gram_size_lower to gram_size_upper in the constructor. This becomes important in a second.

Retrieving

To search the data structure, we take the n-grams of the query string and perform a reverse index look up. To illustrate, let’s consider looking up 'michael' in our fictitious set containing 'michaelich' where the gram_size_upper and gram_size_lower parameters are default (3 and 2 respectively).

We begin by considering first all trigrams (the value of gram_size_upper). Those grams are:

'-mi'
'mic'
'ich'
'cha'
'el-'

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

If use_levenshtein is false, then we return all top matched elements with the same cosine similarity.

If use_levenshtein is true, then we truncate the possible search space to 50, compute a score based on the levenshtein distance (so that we handle transpositions), and return based on that.

In the event that none of the trigrams matched, we try the whole thing again with bigrams (note though that if there are no matches, the failure to match will be quick). Bigram searching will always be slower because there will be a much larger set to order.

Install

pip install fuzzyset2

Afterwards, you can import the package simply with:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

License

BSD

Author

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzyset2-0.2.1.tar.gz (379.9 kB view details)

Uploaded Source

Built Distributions

fuzzyset2-0.2.1-cp311-cp311-win_amd64.whl (37.3 kB view details)

Uploaded CPython 3.11 Windows x86-64

fuzzyset2-0.2.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (269.9 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.1-cp311-cp311-macosx_10_9_universal2.whl (79.0 kB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

fuzzyset2-0.2.1-cp310-cp310-win_amd64.whl (37.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

fuzzyset2-0.2.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (252.7 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.1-cp310-cp310-macosx_10_15_x86_64.whl (43.8 kB view details)

Uploaded CPython 3.10 macOS 10.15+ x86-64

fuzzyset2-0.2.1-cp39-cp39-win_amd64.whl (39.0 kB view details)

Uploaded CPython 3.9 Windows x86-64

fuzzyset2-0.2.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (251.2 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.1-cp39-cp39-macosx_10_15_x86_64.whl (45.4 kB view details)

Uploaded CPython 3.9 macOS 10.15+ x86-64

fuzzyset2-0.2.1-cp38-cp38-win_amd64.whl (38.9 kB view details)

Uploaded CPython 3.8 Windows x86-64

fuzzyset2-0.2.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (251.8 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.1-cp38-cp38-macosx_10_15_x86_64.whl (44.5 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

fuzzyset2-0.2.1-cp37-cp37m-win_amd64.whl (38.4 kB view details)

Uploaded CPython 3.7m Windows x86-64

fuzzyset2-0.2.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (220.2 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.1-cp37-cp37m-macosx_10_15_x86_64.whl (43.4 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

fuzzyset2-0.2.1-cp36-cp36m-win_amd64.whl (42.9 kB view details)

Uploaded CPython 3.6m Windows x86-64

fuzzyset2-0.2.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (218.6 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.2.1-cp36-cp36m-macosx_10_14_x86_64.whl (43.5 kB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

File details

Details for the file fuzzyset2-0.2.1.tar.gz.

File metadata

  • Download URL: fuzzyset2-0.2.1.tar.gz
  • Upload date:
  • Size: 379.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for fuzzyset2-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0f768632ef8813f3ef89739aa10439ab16861699c92a3804374c655f818afa84
MD5 bae2399a4ca3c72c8aae0cb9ca922def
BLAKE2b-256 52af5dfecc9dbfc92b5ff4be2ab210bebe33bdf8b5c8f134da43d306afacee1a

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 cc6e8a202c8603d0f6d5d38e9e7af6577a1d8583f646a3d1a93a4b7092c59750
MD5 d1d59f4c8be0352c684b93f66afa2719
BLAKE2b-256 3ca8995ac7b263f91e87c7943d435e53946332b2f94c45c3e7eb128e537e30e8

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7c72a92b19f7d7a14acd5e6759449b2d03c3fef7771fc6a56bbe2d4b2663866a
MD5 ef53c76ece9fd378c2cea2a04b568ecd
BLAKE2b-256 9c4de9b77bd149128c9abc9c8c8c0ace70d894f27ac4c7b23b0adc28c84e7fe5

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 5a1a842e73381db831748b7f0a88ef20a3fd94783b9b905552dd961ef52b83dc
MD5 d44665ed0682c174a728dc244b57f8bb
BLAKE2b-256 56a157311b1070e9f96726835d5751535a50e12a89f01b0014c6a9a6382fd8c1

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 2ace356dbc3d29ad2b93d4c28ba73d4201e7c3da275bcf2590b494e17ccc11d8
MD5 285c9510b4eeffda030185f34d2a5cff
BLAKE2b-256 040af11314f74f308f447f56d8f8155e58ec9d235f167aed3a63b6afd52ae5de

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9accb7d755867a8cfa67fefa3eece275289e71fc130cc87d6ffddbef9e204088
MD5 2cc31e230b3094f7a6175aae0c12548e
BLAKE2b-256 c0637f59aa3c55c7e7b47076a481f2e1ff230b545de6f6e38d17260188876cdb

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 6823ca21eee938c2cd9387cc8da2961995f092bce0aff19d964ba51669aebe18
MD5 c9d77bc118ae3d653261b06cbc4baba6
BLAKE2b-256 d284097711f88a249501fe83362217c8e766710fbb7165e9d524fa82e99cab7e

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for fuzzyset2-0.2.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 15587ee55431538db69056a5c65eca5973ada86faff6284a4ab711f91ca34e34
MD5 f367099a108778bf4a093fe8a10f0afe
BLAKE2b-256 ceea5c47e5b5745832160da693f7040db3196c7f260a1168d9c59b643fb347d8

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6e5daee0b02450efc4f5dadfdc2b01195077b5366dc0bf4652f032d674dab56a
MD5 d87d7fe87acfb55ed8290b1818c22c84
BLAKE2b-256 d0e6bb75246b63cfd9a9ceadadc1b2b1a5a708d2d198e9e4f71ebdbadd321285

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 e4f5846a7714088bf29a618a720cbcfa3916e1b982b4cb37100b94fcca16b183
MD5 31bc3b9718a5b33dc77a8aa0fd0fc3c1
BLAKE2b-256 aba79429e6ea899f910315d2d367d106b58c87aa7d0b0a0e50c89682a3fee556

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.2.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for fuzzyset2-0.2.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 c68126b77e53fcede508bda8261faecb1677bd10f21009ac07bfca53509e41f0
MD5 8fc242c55298a834006b641cfad3b1f3
BLAKE2b-256 6548929458f1383b95efc319dec97c353a91d402e7b455c49d9d0d608465d327

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7b5cbcf53446888eabdea5808cb9bbc1fdbaa2e804db7fcf8053567a3ec187ed
MD5 ed7cbc82bd808429d46a2367bbe55563
BLAKE2b-256 2bb62bc464d47f8f087b0fcffa1bd53ddc91fb93ce29115d14141b0084e0d23b

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 82687b63860fd6ccd3555352af9bfc564e8891706f2819afb9813c2d2deba147
MD5 5ccc87e3e52bfc920212a481587df144
BLAKE2b-256 fd29c43ce0b8be9b3ef7a47187ca979779e9509d9177ac25cb2c9d7a0fa784a2

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 656e52bb60327d34ba450d536b82f3e02d488810c0c2ecc77e168348bab45c0e
MD5 ad986db44b58bb3a0b132a3ec6c1470a
BLAKE2b-256 7373828af7c8803f8eae040262894469f68911650a5bdcd309230f62b6b7e01d

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fa82c5b4c181e9e9211408bf32b11eda0175a72fe275912c64218e817edb7b06
MD5 e890c2a1ce82a249167cedf80b9eb64a
BLAKE2b-256 021266361228e18a387b5ce0c6ab3a606a7b218c388f2ca84faa1901596123f4

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 4bf54c5e1ec9acc1a5c25b5f22f324395906812cb82734cad16999ba1db8b611
MD5 e7af13d1505e43897cb5ae8ce33db645
BLAKE2b-256 39d78bb8754066b40d622b2bbb9de00ab3cd401261df9ede323a87bdd3e7c669

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp36-cp36m-win_amd64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 12d0f80dd64cd1d946997471531591fbc4fc55d5de05796a1b622daa2ab003c3
MD5 457b97f396a94d818b6502fe86c6c52f
BLAKE2b-256 a7b312d7151b68ff99b36723051b5180306b3f91962e39a41402ed808e06203b

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bcdbb303cdeeaf797ea5b557ea042676889f4e8273cd556e7f3daae3a955236f
MD5 0966d53c7027a9320540bd395a49a16e
BLAKE2b-256 b5e84ef56468a060bfffa097e0c1f79944759fca057a277a8593a9f9ca728c69

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.2.1-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.2.1-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 11eac9b25fd5337f741d3ccd446ff85a1127c4f44ce818ec30fa803a75344107
MD5 4f247fd1f6082c49841eb3bcf5495af7
BLAKE2b-256 2bc3d444db50a7af4872820f19edaceef6722ec011213c0f540f0c7f318014a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page