Skip to main content

A simple python fuzzyset implementation.

Project description

fuzzyset is a data structure that performs something akin to fulltext search against data to determine likely mispellings and approximate string matching.

Usage

The usage is simple. Just add a string to the set, and ask for it later by using either .get or []:

>>> a = fuzzyset.FuzzySet()
>>> a.add("michael axiak")
>>> a.get("micael asiak")
[(0.8461538461538461, u'michael axiak')]

The result will be a list of (score, mached_value) tuples. The score is between 0 and 1, with 1 being a perfect match.

For roughly 15% performance increase, there is also a Cython-implemented version called cfuzzyset. So you can write the following, akin to cStringIO and cPickle:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

Construction Arguments

  • iterable: An iterable that yields strings to initialize the data structure with

  • gram_size_lower: The lower bound of gram sizes to use, inclusive (see Theory of operation). Default: 2

  • gram_size_upper: The upper bound of gram sizes to use, inclusive (see Theory of operation). Default: 3

  • use_levenshtein: Whether or not to use the levenshtein distance to determine the match scoring. Default: True

Theory of operation

Adding to the data structure

First let’s look at adding a string, ‘michaelich’ to an empty set. We first break apart the string into n-grams (strings of length n). So trigrams of ‘michaelich’ would look like:

'-mi'
'mic'
'ich'
'cha'
'hae'
'ael'
'eli'
'lic'
'ich'
'ch-'

Note that fuzzyset will first normalize the string by removing non word characters except for spaces and commas and force everything to be lowercase.

Next the fuzzyset essentially creates a reverse index on those grams. Maintaining a dictionary that says:

'mic' -> (1, 0)
'ich' -> (2, 0)
...

And there’s a list that looks like:

[(3.31, 'michaelich')]

Note that we maintain this reverse index for all grams from gram_size_lower to gram_size_upper in the constructor. This becomes important in a second.

Retrieving

To search the data structure, we take the n-grams of the query string and perform a reverse index look up. To illustrate, let’s consider looking up 'michael' in our fictitious set containing 'michaelich' where the gram_size_upper and gram_size_lower parameters are default (3 and 2 respectively).

We begin by considering first all trigrams (the value of gram_size_upper). Those grams are:

'-mi'
'mic'
'ich'
'cha'
'el-'

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

If use_levenshtein is false, then we return all top matched elements with the same cosine similarity.

If use_levenshtein is true, then we truncate the possible search space to 50, compute a score based on the levenshtein distance (so that we handle transpositions), and return based on that.

In the event that none of the trigrams matched, we try the whole thing again with bigrams (note though that if there are no matches, the failure to match will be quick). Bigram searching will always be slower because there will be a much larger set to order.

Install

pip install fuzzyset2

License

BSD

Author

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzyset2-0.1.0.tar.gz (377.8 kB view details)

Uploaded Source

Built Distributions

fuzzyset2-0.1.0-cp39-cp39-win_amd64.whl (43.2 kB view details)

Uploaded CPython 3.9 Windows x86-64

fuzzyset2-0.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (248.4 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.1.0-cp39-cp39-macosx_10_14_x86_64.whl (44.6 kB view details)

Uploaded CPython 3.9 macOS 10.14+ x86-64

fuzzyset2-0.1.0-cp38-cp38-win_amd64.whl (43.3 kB view details)

Uploaded CPython 3.8 Windows x86-64

fuzzyset2-0.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (249.7 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.1.0-cp38-cp38-macosx_10_14_x86_64.whl (44.1 kB view details)

Uploaded CPython 3.8 macOS 10.14+ x86-64

fuzzyset2-0.1.0-cp37-cp37m-win_amd64.whl (42.2 kB view details)

Uploaded CPython 3.7m Windows x86-64

fuzzyset2-0.1.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (218.8 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.1.0-cp37-cp37m-macosx_10_14_x86_64.whl (42.9 kB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

fuzzyset2-0.1.0-cp36-cp36m-win_amd64.whl (42.2 kB view details)

Uploaded CPython 3.6m Windows x86-64

fuzzyset2-0.1.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (217.6 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

fuzzyset2-0.1.0-cp36-cp36m-macosx_10_14_x86_64.whl (43.2 kB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

File details

Details for the file fuzzyset2-0.1.0.tar.gz.

File metadata

  • Download URL: fuzzyset2-0.1.0.tar.gz
  • Upload date:
  • Size: 377.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e81e06c6d6b24d53c51eac7e235e8deb86337e4822977c57551c4cfa179f2acc
MD5 2df66af956f0a548b3bb0aedd4ce359b
BLAKE2b-256 791427e065f5f132f22ccb7007059e6f85809783933e4396f992f60a14828265

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 43.2 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 5cd7e84826c9254c46d2235b462386013d703abd88ae9bf71f51cbc88d7b6878
MD5 80d08777b48d385c837ed27954336ec7
BLAKE2b-256 8f454af14fd81f533c4d5862b4123711cafba9b87dfc96cfa4532a5354caa225

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.1.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9ffadf68ebe71f9ca13350f552d106dfc7c06ef79d977dae9ca5541e91a4fefe
MD5 95d423a349a229209c7db2b1362b226b
BLAKE2b-256 79821a88204befd1552475d7157b30489d55b081c46dbaf6bd7bc9473956c25b

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.0-cp39-cp39-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 44.6 kB
  • Tags: CPython 3.9, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 b854faefbd9e08cbd73edb6bfcf6f4ba2061f733bd6355724e8171b9505efcda
MD5 7afcf78f31c37894766ea8949f4c510a
BLAKE2b-256 804ff3addd21244d867b0bae9637475a58bb71b6f95c2c8edff892aab46e3e20

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 43.3 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 99c2d3b7b45efd5313c83c06f1744d0bff741292c52145d763ca6a535e0b5223
MD5 0de2f10f7fd6797847d3fafb3582193f
BLAKE2b-256 67b6aec43b95065722fcd1ea6a4d1cb84d3f15e401373c5f16ada25a12646a68

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.1.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b4813fd652d74a2ed51ab4072bc5373d951b2536e23609c748c40abda4668a10
MD5 516c896e20f86c1676d08f51a5f20415
BLAKE2b-256 11ff53511e6aa11d4576eb6a9463276443f99509d2eef944105e8734bfc6a4a3

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.0-cp38-cp38-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 44.1 kB
  • Tags: CPython 3.8, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 911086b27976a98ee85193c6a91cb00d912d4b76d44237ad39256d34e091df60
MD5 16b4713d4c222d40f956270f1855f6b8
BLAKE2b-256 0158f1b5760f04a4c7a1619c181ea184b4a7cd5b954adae4e55c45933158006d

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 42.2 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 257bf00cc6f348d0b440a8351fff28b2dc28112200b72caf8da405503c4d445c
MD5 132e0b7dbee7bc491c5a1ea55d24b797
BLAKE2b-256 fefdc78a7c33159d7f0496114dd8bfc7e5af7749acf9eedc038047a5b9148337

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.1.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee0ea971e7ada8e12ae8e96efca3e77c1eb47da8501d186bc96368be0127f2b6
MD5 42e26fa7df7014083a68efb03ec33840
BLAKE2b-256 d6a1e0fc033ea1d14ab2c5642b11d05e807b5a30e28ddd059e1b8d6f00dfedef

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.0-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 189dcdb68184682635f08090c22ddea1f1d04720dd8792bbecdfbebe5e8c61fc
MD5 a835913689c4c734ff3eabd8fc05dee7
BLAKE2b-256 7dec4415f17fdb5a0301c5c2a601860dfccc47795347f6dc8c16111778650f8a

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 42.2 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 5a1d9671554d41afd81c2837651ec876e645e212253e2a73261f92088ecd2446
MD5 0c7ce6735b8be0b0c2127ac8cd1081f4
BLAKE2b-256 09c809439c1aaf79e88613b2de8b2a9281e25859ae792c9028a191511b683d58

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for fuzzyset2-0.1.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 245425d451771d3b460558a8adede15bae1b8b35638178245b0353285198a49b
MD5 1698174070a4b01c1aed4351e9624e85
BLAKE2b-256 c7a8edee3a42703e5882dd53b98d68d66f09d1bb460c275203e886a97d13ee8c

See more details on using hashes here.

File details

Details for the file fuzzyset2-0.1.0-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fuzzyset2-0.1.0-cp36-cp36m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 43.2 kB
  • Tags: CPython 3.6m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for fuzzyset2-0.1.0-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 49c3258919b8dd6347dd93bc44bca6d8d5c52b277002b36cf863e2f0c670e41c
MD5 40c99c0ee830ebeb16e313d3684b0342
BLAKE2b-256 42fbc60c13f04d6222a0e2a39d07ba8f18ba9614cad7404dcc187f12eecc0ed2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page