fuzzysearch is useful for finding approximate subsequence matches
Project description
Fuzzy search: Find parts of long text or data, allowing for some changes/typos.
Easy, fast, and just works!
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
Two simple functions to use: one for in-memory data and one for files
Fastest search algorithm is chosen automatically
Levenshtein Distance metric with configurable parameters
Separately configure the max. allowed distance, substitutions, deletions and/or insertions
Advanced algorithms with optional C and Cython optimizations
Properly handles Unicode; special optimizations for binary data
- Simple installation:
pip install fuzzysearch just works
pure-Python fallbacks for compiled modules
only one dependency (attrs)
Extensively tested
Free software: MIT license
For more info, see the documentation.
Installation
fuzzysearch supports Python versions 2.7 and 3.5+, as well as PyPy 2.7 and 3.6.
$ pip install fuzzysearch
This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks.
Usage
Just call find_near_matches() with the sub-sequence you’re looking for, the sequence to search, and the matching parameters:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
To search in a file, use find_near_matches_in_file() similarly:
>>> from fuzzysearch import find_near_matches_in_file
>>> with open('data_file', 'rb') as f:
... find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
Examples
fuzzysearch is great for ad-hoc searches of genetic data, such as DNA or protein sequences, before reaching for “heavier”, domain-specific tools like BioPython:
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]
BioPython sequences are also supported:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> sequence = Seq('''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG''', IUPAC.unambiguous_dna)
>>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]
Matching Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
maximum Levenshtein distance (max_l_dist)
maximum # of subsitutions
maximum # of deletions (“delete” = skip a character in the sub-sequence)
maximum # of insertions (“insert” = skip a character in the sequence)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]
# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1
# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2, matched="PATERRN")]
When to Use Other Tools
Use case: Search through a list of strings for almost-exactly matching strings. For example, searching through a list of names for possible slight variations of a certain name.
Suggestion: Consider using fuzzywuzzy.
History
0.7.3 (2020-06-27)
Fixed segmentation faults due to wrong handling of inputs in bytes-like-only functions in C extensions.
0.7.2 (2020-05-07)
Added PyPy support.
Several minor bug fixes.
0.7.1 (2020-04-05)
Dropped support for Python 3.4.
Removed deprecation warning with Python 3.8.
Fixed a couple of nasty bugs.
0.7.0 (2020-01-14)
Added matched attribue to Match objects containing the matched part of the sequence.
Added support for CPython 3.8. Now supporting CPython 2.7 and 3.4-3.8.
0.6.2 (2019-04-22)
Fix calling search_exact() without passing end_index.
Fix edge case: max. dist >= sub-sequence length.
0.6.1 (2018-12-08)
Fixed some C compiler warnings for the C and Cython modules
0.6.0 (2018-12-07)
Dropped support for Python versions 2.6, 3.2 and 3.3
Added support and testing for Python 3.7
Optimized the n-grams Levenshtein search for long sub-sequences
Further optimized the n-grams Levenshtein search
Cython versions of the optimized parts of the n-grams Levenshtein search
0.5.0 (2017-09-05)
Fixed search_exact_byteslike() to support supplying start and end indexes
Added support for lists, tuples and other Sequence types to search_exact()
Fixed a bug where find_near_matches() could return a wrong Match.end with max_l_dist=0
Added more tests and improved some existing ones.
0.4.0 (2017-07-06)
Added support and testing for Python 3.5 and 3.6
Many small improvements to README, setup.py and CI testing
0.3.0 (2015-02-12)
Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage
0.2.2 (2014-03-27)
Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs
0.2.1 (2014-03-14)
Fixed major match grouping bug
0.2.0 (2013-03-13)
New utility function find_near_matches() for easier use
Additional documentation
0.1.0 (2013-11-12)
Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha
0.0.1 (2013-11-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for fuzzysearch-0.7.3-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3def5f55c90300ae1c4f3da830ff3ed8bf7827ace1e3cecff819c9de5f23aa7 |
|
MD5 | e2775da0f1661c8fb751d7624321c90c |
|
BLAKE2b-256 | 727715ab16415da2feb8727509cdbf807e59c4e6f920a47705a45135a7167383 |
Hashes for fuzzysearch-0.7.3-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbf3047d3ed3fd80b69d38c7a07fdef5c9738436249586095e27f32397c8ad2e |
|
MD5 | 5cc6795e8075bbe28467cd0b70195663 |
|
BLAKE2b-256 | c169c45b34e529fa2e900d070e791a6fc02f4cd35bbb553acc4d1b2252578d6d |
Hashes for fuzzysearch-0.7.3-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42c518fd174b37bc39ae6737025b7b8efb93b3746ea539d7784ca9718780f2b4 |
|
MD5 | 8fe2e3cdfc638895900849b4c81934cd |
|
BLAKE2b-256 | 9b2292664d57ac56dc454e630acab0531d4e33d7e0c70c66f6495054b81a9904 |
Hashes for fuzzysearch-0.7.3-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f88c3412f198b0252c6cfc640db636a0e47ad33fe7b6a3de907b77c748c7b1d2 |
|
MD5 | 27a07bd602381cee3f21c9728c442caf |
|
BLAKE2b-256 | 890f31141dc27d651efed736eb6c9b07cdf1b0615f456b50945404e8ebff4de3 |
Hashes for fuzzysearch-0.7.3-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 261f65f12dd919e5c8b73da8a70ac4d46fa7a0de639529ffb36cc05b6bb5a6f4 |
|
MD5 | b39fae7f68ea44340fc16b9505ed1170 |
|
BLAKE2b-256 | 2be3628b25ce3e7dad3c052ebf1b881d1946f7fe0fddddfb1b6a2d0e56be0d99 |
Hashes for fuzzysearch-0.7.3-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e8c6d34b2687779288861ec5b67bdbc9a46659d731417960e281a3d8d713eae |
|
MD5 | 1beba24840e6331cebe46a1385c0eb25 |
|
BLAKE2b-256 | 968b05cdf31208e2ccbd59e49e28902fd77a6ddb730546ba3ab84aecc366600d |
Hashes for fuzzysearch-0.7.3-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f60730060de44f534a0b3c33ddc4e06172dd0eea8a025c4da31a7b7aeb1fff2 |
|
MD5 | 0a4e87d4a114727d05a00382e215de5c |
|
BLAKE2b-256 | 48484ea2fe592ccea64a9c415cc953bd16c29a3e18c0c2321073278dc0157848 |
Hashes for fuzzysearch-0.7.3-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f18ff4761bbe2d1efebdbc64041a42f788b27a2191839ebf7fd7c2ac34c55bd |
|
MD5 | ce2967e07810d79710d9692bca0c961a |
|
BLAKE2b-256 | a452fc9a338e50822a955b751315bd304625279e25162acf35afe0759f4f6cc6 |
Hashes for fuzzysearch-0.7.3-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46237a2b70c50ed1d01555af7e8d11c77446243dd3ccc58d6e4c1b76b8008d7b |
|
MD5 | 7c409ca3e2d310fb837d10cf29e1aebf |
|
BLAKE2b-256 | a3e527c4b9ad344e3f31403080b38f0f2c2991560c4b3ac9ab76d243cbc3c0f3 |
Hashes for fuzzysearch-0.7.3-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fba65a49444ec8a408f4e9a822ba26de34eb8755713b19d9fc642fe9b610d9ee |
|
MD5 | d4bbe327d45259513fb7d78d0e436dce |
|
BLAKE2b-256 | b2057f3f962b35418c74d97420ee487ac153f3e69277892aa1cf16eb91a653fd |
Hashes for fuzzysearch-0.7.3-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e8f601d612781afb784bc177f7e860b0944ad703be8824d734f6cb2a72589ad |
|
MD5 | 309245cdfd2b747721137f5a8e6f573d |
|
BLAKE2b-256 | 2c043b67ea51e3c8989d67f638e96be96759f3a3757d2a9a748e97bd91057392 |
Hashes for fuzzysearch-0.7.3-cp35-cp35m-macosx_10_6_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 751143fb6b1b590e313a8350775c0edef3dc6a46409572f0c2ea0e9f9bea7b2a |
|
MD5 | bc0cf716a01efe2700a95ce9af7c951a |
|
BLAKE2b-256 | b4b2e8cb57bc1b206c864ec8266695ffe86a18fbc8db98a7c935bfcb049bf530 |
Hashes for fuzzysearch-0.7.3-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93a18d596df2af6419b5f705e5842e4e28752c4abb2a264e1bcdc2f302d7f377 |
|
MD5 | b4950b676c2b3f033df2d6f92dce2cce |
|
BLAKE2b-256 | 1c08a16fc73af86413d8a6d11e3919f3bf6a75dd2404e95cc28c63b1e5d2e509 |
Hashes for fuzzysearch-0.7.3-cp27-cp27m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | de93de038e8bd50cfc05beb03d1543f6772ce93b52775ded778d04aa918accf8 |
|
MD5 | 01e68a3a30f5d63508fd93fb9439a0d8 |
|
BLAKE2b-256 | 40f3b08f45c7cb3dded4f09c7b48058c79a001818d37bdc686c0c48daf810675 |
Hashes for fuzzysearch-0.7.3-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ed619f24303a41e59d33fc01bf959439df40b2bf096d00607eff7e0c5a26290 |
|
MD5 | e62df035edcfb15ee7b7539371b29977 |
|
BLAKE2b-256 | 86d7e0b435f5de32cd90fef1f2d0a80ba8e4989302d630bb46f63ae892eae1d9 |