fuzzysearch is useful for finding approximate subsequence matches
Project description
Fuzzy search: Find parts of long text or data, allowing for some changes/typos.
Easy, fast, and just works!
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
Two simple functions to use: one for in-memory data and one for files
Fastest search algorithm is chosen automatically
Levenshtein Distance metric with configurable parameters
Separately configure the max. allowed distance, substitutions, deletions and/or insertions
Advanced algorithms with optional C and Cython optimizations
Properly handles Unicode; special optimizations for binary data
- Simple installation:
pip install fuzzysearch just works
pure-Python fallbacks for compiled modules
only one dependency (attrs)
Extensively tested
Free software: MIT license
For more info, see the documentation.
Installation
fuzzysearch supports Python versions 2.7 and 3.5+, as well as PyPy 2.7 and 3.6.
$ pip install fuzzysearch
This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks.
Usage
Just call find_near_matches() with the sub-sequence you’re looking for, the sequence to search, and the matching parameters:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
To search in a file, use find_near_matches_in_file() similarly:
>>> from fuzzysearch import find_near_matches_in_file
>>> with open('data_file', 'rb') as f:
... find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
Examples
fuzzysearch is great for ad-hoc searches of genetic data, such as DNA or protein sequences, before reaching for “heavier”, domain-specific tools like BioPython:
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]
BioPython sequences are also supported:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> sequence = Seq('''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG''', IUPAC.unambiguous_dna)
>>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]
Matching Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
maximum Levenshtein distance (max_l_dist)
maximum # of subsitutions
maximum # of deletions (“delete” = skip a character in the sub-sequence)
maximum # of insertions (“insert” = skip a character in the sequence)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]
# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1
# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2, matched="PATERRN")]
When to Use Other Tools
Use case: Search through a list of strings for almost-exactly matching strings. For example, searching through a list of names for possible slight variations of a certain name.
Suggestion: Consider using fuzzywuzzy.
History
0.7.3 (2020-06-27)
Fixed segmentation faults due to wrong handling of inputs in bytes-like-only functions in C extensions.
0.7.2 (2020-05-07)
Added PyPy support.
Several minor bug fixes.
0.7.1 (2020-04-05)
Dropped support for Python 3.4.
Removed deprecation warning with Python 3.8.
Fixed a couple of nasty bugs.
0.7.0 (2020-01-14)
Added matched attribue to Match objects containing the matched part of the sequence.
Added support for CPython 3.8. Now supporting CPython 2.7 and 3.4-3.8.
0.6.2 (2019-04-22)
Fix calling search_exact() without passing end_index.
Fix edge case: max. dist >= sub-sequence length.
0.6.1 (2018-12-08)
Fixed some C compiler warnings for the C and Cython modules
0.6.0 (2018-12-07)
Dropped support for Python versions 2.6, 3.2 and 3.3
Added support and testing for Python 3.7
Optimized the n-grams Levenshtein search for long sub-sequences
Further optimized the n-grams Levenshtein search
Cython versions of the optimized parts of the n-grams Levenshtein search
0.5.0 (2017-09-05)
Fixed search_exact_byteslike() to support supplying start and end indexes
Added support for lists, tuples and other Sequence types to search_exact()
Fixed a bug where find_near_matches() could return a wrong Match.end with max_l_dist=0
Added more tests and improved some existing ones.
0.4.0 (2017-07-06)
Added support and testing for Python 3.5 and 3.6
Many small improvements to README, setup.py and CI testing
0.3.0 (2015-02-12)
Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage
0.2.2 (2014-03-27)
Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs
0.2.1 (2014-03-14)
Fixed major match grouping bug
0.2.0 (2013-03-13)
New utility function find_near_matches() for easier use
Additional documentation
0.1.0 (2013-11-12)
Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha
0.0.1 (2013-11-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file fuzzysearch-0.7.3.tar.gz
.
File metadata
- Download URL: fuzzysearch-0.7.3.tar.gz
- Upload date:
- Size: 112.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5a1b114ceee50a5e181b2fe1ac1b4371ac8db92142770a48fed49ecbc37ca4c |
|
MD5 | bd44856c0a698a7ce5df030c739265c3 |
|
BLAKE2b-256 | f7283e9e4e55fd35356f331a22976694e151eb0214b68d3cd471936f9c09deba |
File details
Details for the file fuzzysearch-0.7.3-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 83.4 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3def5f55c90300ae1c4f3da830ff3ed8bf7827ace1e3cecff819c9de5f23aa7 |
|
MD5 | e2775da0f1661c8fb751d7624321c90c |
|
BLAKE2b-256 | 727715ab16415da2feb8727509cdbf807e59c4e6f920a47705a45135a7167383 |
File details
Details for the file fuzzysearch-0.7.3-cp38-cp38-win32.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp38-cp38-win32.whl
- Upload date:
- Size: 75.7 kB
- Tags: CPython 3.8, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbf3047d3ed3fd80b69d38c7a07fdef5c9738436249586095e27f32397c8ad2e |
|
MD5 | 5cc6795e8075bbe28467cd0b70195663 |
|
BLAKE2b-256 | c169c45b34e529fa2e900d070e791a6fc02f4cd35bbb553acc4d1b2252578d6d |
File details
Details for the file fuzzysearch-0.7.3-cp38-cp38-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp38-cp38-macosx_10_9_x86_64.whl
- Upload date:
- Size: 80.4 kB
- Tags: CPython 3.8, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42c518fd174b37bc39ae6737025b7b8efb93b3746ea539d7784ca9718780f2b4 |
|
MD5 | 8fe2e3cdfc638895900849b4c81934cd |
|
BLAKE2b-256 | 9b2292664d57ac56dc454e630acab0531d4e33d7e0c70c66f6495054b81a9904 |
File details
Details for the file fuzzysearch-0.7.3-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 82.1 kB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f88c3412f198b0252c6cfc640db636a0e47ad33fe7b6a3de907b77c748c7b1d2 |
|
MD5 | 27a07bd602381cee3f21c9728c442caf |
|
BLAKE2b-256 | 890f31141dc27d651efed736eb6c9b07cdf1b0615f456b50945404e8ebff4de3 |
File details
Details for the file fuzzysearch-0.7.3-cp37-cp37m-win32.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp37-cp37m-win32.whl
- Upload date:
- Size: 74.5 kB
- Tags: CPython 3.7m, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 261f65f12dd919e5c8b73da8a70ac4d46fa7a0de639529ffb36cc05b6bb5a6f4 |
|
MD5 | b39fae7f68ea44340fc16b9505ed1170 |
|
BLAKE2b-256 | 2be3628b25ce3e7dad3c052ebf1b881d1946f7fe0fddddfb1b6a2d0e56be0d99 |
File details
Details for the file fuzzysearch-0.7.3-cp37-cp37m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp37-cp37m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 79.6 kB
- Tags: CPython 3.7m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e8c6d34b2687779288861ec5b67bdbc9a46659d731417960e281a3d8d713eae |
|
MD5 | 1beba24840e6331cebe46a1385c0eb25 |
|
BLAKE2b-256 | 968b05cdf31208e2ccbd59e49e28902fd77a6ddb730546ba3ab84aecc366600d |
File details
Details for the file fuzzysearch-0.7.3-cp36-cp36m-win_amd64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp36-cp36m-win_amd64.whl
- Upload date:
- Size: 82.1 kB
- Tags: CPython 3.6m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f60730060de44f534a0b3c33ddc4e06172dd0eea8a025c4da31a7b7aeb1fff2 |
|
MD5 | 0a4e87d4a114727d05a00382e215de5c |
|
BLAKE2b-256 | 48484ea2fe592ccea64a9c415cc953bd16c29a3e18c0c2321073278dc0157848 |
File details
Details for the file fuzzysearch-0.7.3-cp36-cp36m-win32.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp36-cp36m-win32.whl
- Upload date:
- Size: 74.4 kB
- Tags: CPython 3.6m, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f18ff4761bbe2d1efebdbc64041a42f788b27a2191839ebf7fd7c2ac34c55bd |
|
MD5 | ce2967e07810d79710d9692bca0c961a |
|
BLAKE2b-256 | a452fc9a338e50822a955b751315bd304625279e25162acf35afe0759f4f6cc6 |
File details
Details for the file fuzzysearch-0.7.3-cp36-cp36m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp36-cp36m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 79.5 kB
- Tags: CPython 3.6m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46237a2b70c50ed1d01555af7e8d11c77446243dd3ccc58d6e4c1b76b8008d7b |
|
MD5 | 7c409ca3e2d310fb837d10cf29e1aebf |
|
BLAKE2b-256 | a3e527c4b9ad344e3f31403080b38f0f2c2991560c4b3ac9ab76d243cbc3c0f3 |
File details
Details for the file fuzzysearch-0.7.3-cp35-cp35m-win_amd64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp35-cp35m-win_amd64.whl
- Upload date:
- Size: 81.4 kB
- Tags: CPython 3.5m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fba65a49444ec8a408f4e9a822ba26de34eb8755713b19d9fc642fe9b610d9ee |
|
MD5 | d4bbe327d45259513fb7d78d0e436dce |
|
BLAKE2b-256 | b2057f3f962b35418c74d97420ee487ac153f3e69277892aa1cf16eb91a653fd |
File details
Details for the file fuzzysearch-0.7.3-cp35-cp35m-win32.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp35-cp35m-win32.whl
- Upload date:
- Size: 73.8 kB
- Tags: CPython 3.5m, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e8f601d612781afb784bc177f7e860b0944ad703be8824d734f6cb2a72589ad |
|
MD5 | 309245cdfd2b747721137f5a8e6f573d |
|
BLAKE2b-256 | 2c043b67ea51e3c8989d67f638e96be96759f3a3757d2a9a748e97bd91057392 |
File details
Details for the file fuzzysearch-0.7.3-cp35-cp35m-macosx_10_6_intel.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp35-cp35m-macosx_10_6_intel.whl
- Upload date:
- Size: 134.0 kB
- Tags: CPython 3.5m, macOS 10.6+ intel
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 751143fb6b1b590e313a8350775c0edef3dc6a46409572f0c2ea0e9f9bea7b2a |
|
MD5 | bc0cf716a01efe2700a95ce9af7c951a |
|
BLAKE2b-256 | b4b2e8cb57bc1b206c864ec8266695ffe86a18fbc8db98a7c935bfcb049bf530 |
File details
Details for the file fuzzysearch-0.7.3-cp27-cp27m-win_amd64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp27-cp27m-win_amd64.whl
- Upload date:
- Size: 72.7 kB
- Tags: CPython 2.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93a18d596df2af6419b5f705e5842e4e28752c4abb2a264e1bcdc2f302d7f377 |
|
MD5 | b4950b676c2b3f033df2d6f92dce2cce |
|
BLAKE2b-256 | 1c08a16fc73af86413d8a6d11e3919f3bf6a75dd2404e95cc28c63b1e5d2e509 |
File details
Details for the file fuzzysearch-0.7.3-cp27-cp27m-win32.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp27-cp27m-win32.whl
- Upload date:
- Size: 69.2 kB
- Tags: CPython 2.7m, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de93de038e8bd50cfc05beb03d1543f6772ce93b52775ded778d04aa918accf8 |
|
MD5 | 01e68a3a30f5d63508fd93fb9439a0d8 |
|
BLAKE2b-256 | 40f3b08f45c7cb3dded4f09c7b48058c79a001818d37bdc686c0c48daf810675 |
File details
Details for the file fuzzysearch-0.7.3-cp27-cp27m-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: fuzzysearch-0.7.3-cp27-cp27m-macosx_10_9_x86_64.whl
- Upload date:
- Size: 77.0 kB
- Tags: CPython 2.7m, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ed619f24303a41e59d33fc01bf959439df40b2bf096d00607eff7e0c5a26290 |
|
MD5 | e62df035edcfb15ee7b7539371b29977 |
|
BLAKE2b-256 | 86d7e0b435f5de32cd90fef1f2d0a80ba8e4989302d630bb46f63ae892eae1d9 |