fuzzysearch is useful for finding approximate subsequence matches
Project description
Fuzzy search: Find parts of long text or data, allowing for some changes/typos.
Easy, fast, and just works!
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
Two simple functions to use: one for in-memory data and one for files
Fastest search algorithm is chosen automatically
Levenshtein Distance metric with configurable parameters
Separately configure the max. allowed distance, substitutions, deletions and/or insertions
Advanced algorithms with optional C and Cython optimizations
Properly handles Unicode; special optimizations for binary data
- Simple installation:
pip install fuzzysearch just works
pure-Python fallbacks for compiled modules
only one dependency (attrs)
Extensively tested
Free software: MIT license
For more info, see the documentation.
Installation
fuzzysearch supports Python versions 2.7 and 3.5+.
$ pip install fuzzysearch
This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks.
Usage
Just call find_near_matches() with the sub-sequence you’re looking for, the sequence to search, and the matching parameters:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
To search in a file, use find_near_matches_in_file() similarly:
>>> from fuzzysearch import find_near_matches_in_file
>>> with open('data_file', 'rb') as f:
... find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
Examples
fuzzysearch is great for ad-hoc searches of genetic data, such as DNA or protein sequences, before reaching for “heavier”, domain-specific tools like BioPython:
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]
BioPython sequences are also supported:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> sequence = Seq('''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG''', IUPAC.unambiguous_dna)
>>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]
Matching Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
maximum Levenshtein distance (max_l_dist)
maximum # of subsitutions
maximum # of deletions (“delete” = skip a character in the sub-sequence)
maximum # of insertions (“insert” = skip a character in the sequence)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]
# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1
# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2, matched="PATERRN")]
When to Use Other Tools
Use case: Search through a list of strings for almost-exactly matching strings. For example, searching through a list of names for possible slight variations of a certain name.
Suggestion: Consider using fuzzywuzzy.
History
0.7.1 (2020-04-05)
Dropped support for Python 3.4.
Removed deprecation warning with Python 3.8.
Fixed a couple of nasty bugs.
0.7.0 (2020-01-14)
Added matched attribue to Match objects containing the matched part of the sequence.
Added support for CPython 3.8. Now supporting CPython 2.7 and 3.4-3.8.
0.6.2 (2019-04-22)
Fix calling search_exact() without passing end_index.
Fix edge case: max. dist >= sub-sequence length.
0.6.1 (2018-12-08)
Fixed some C compiler warnings for the C and Cython modules
0.6.0 (2018-12-07)
Dropped support for Python versions 2.6, 3.2 and 3.3
Added support and testing for Python 3.7
Optimized the n-grams Levenshtein search for long sub-sequences
Further optimized the n-grams Levenshtein search
Cython versions of the optimized parts of the n-grams Levenshtein search
0.5.0 (2017-09-05)
Fixed search_exact_byteslike() to support supplying start and end indexes
Added support for lists, tuples and other Sequence types to search_exact()
Fixed a bug where find_near_matches() could return a wrong Match.end with max_l_dist=0
Added more tests and improved some existing ones.
0.4.0 (2017-07-06)
Added support and testing for Python 3.5 and 3.6
Many small improvements to README, setup.py and CI testing
0.3.0 (2015-02-12)
Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage
0.2.2 (2014-03-27)
Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs
0.2.1 (2014-03-14)
Fixed major match grouping bug
0.2.0 (2013-03-13)
New utility function find_near_matches() for easier use
Additional documentation
0.1.0 (2013-11-12)
Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha
0.0.1 (2013-11-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for fuzzysearch-0.7.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1d51c0eef53cb991d97b23d067d1be9efd13f7803ea3d90bbbc2d3fe4eba8b8 |
|
MD5 | a238c65cb7f91d1a7f30e64e5a4d95e2 |
|
BLAKE2b-256 | b7ff1f07558993596463b62074106267e7d0303182c21a3e97569ad5b7c46c7b |
Hashes for fuzzysearch-0.7.1-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a46003c1ceb3b9f4ad784b900f8dcafd904ec540eb636750728b2b0a23c6ae05 |
|
MD5 | df3a78a4bf7510e8216850e59e8b226e |
|
BLAKE2b-256 | d03ae8d9a18a031f6efc828f676b30591b7c260d62ccc30ec4e90f3f0b7550f0 |
Hashes for fuzzysearch-0.7.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 447a3b9c6707e70e919ace3a6861215081f7125667d617407d72bab777a6bae9 |
|
MD5 | 8750c2f6c8024e0e39942e074e40b8c7 |
|
BLAKE2b-256 | 4c3b92d389e5f5ffbf1a803ad4b844797f1bee6cbb7a38408362be3ed5ebf898 |
Hashes for fuzzysearch-0.7.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6e48a806fabd37d33ba1bd798cca04d10dfbdfb0c30814632ff77c11cf00ee8 |
|
MD5 | 71c22df0279fe9dd47b75bd23aa6208f |
|
BLAKE2b-256 | 34ad3a8079ad885c8d2227bf6a1d8f654db5b0e1458f4b7ee13a4fe319fe3f7e |
Hashes for fuzzysearch-0.7.1-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3537d583f95d1546c517e55f7dee4f92bfb34647ce13cb3b588b383e3d1f209 |
|
MD5 | 0f19ac159635b60e7bee4a9616d2d0e6 |
|
BLAKE2b-256 | 678e1cad5d532ab5257191ca86a704f7010dc1376facf72263bb2af017cadc22 |
Hashes for fuzzysearch-0.7.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ca7af6636d93d35b6ce7358b3f36d63bb994af685d4f3f2ed170ddfd7462357 |
|
MD5 | e5f5dc978e38ab40cf7856b7e4341b9e |
|
BLAKE2b-256 | 5ed377b73bf6ea16808e3a4081a4ad3219273cd7caf662c4941db5ecce30a49a |
Hashes for fuzzysearch-0.7.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95aa7293d82d61188a4fe50905300d95d7f102f1a736e6edf6057325763df48a |
|
MD5 | 0e9f84786644ce08ac79c0577856c5ed |
|
BLAKE2b-256 | f1fc1b6e0d1fe913a480291f7611be15a9842a556f6f50b2764662a2f00525a0 |
Hashes for fuzzysearch-0.7.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed7c8571f10b6768c15896b4a9019441d99a073e69614d879cd17532e54549d6 |
|
MD5 | b333a475785b80f9b4a3f4f699547f96 |
|
BLAKE2b-256 | 810d3219c2dace68e4ae7edaa5940cd99eb8e9673cdb49892b4bec5a0428a3a5 |
Hashes for fuzzysearch-0.7.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27977d521dce4a7c1b3207510e7388944a5a59479c4f96ee0e98d704c8d45cff |
|
MD5 | 6437734b16660eb6bd04c6d7015ba1e2 |
|
BLAKE2b-256 | a5a80fcc085fe159a04c2913c8f9cb7379d2541366d0608073017b233ead24c1 |
Hashes for fuzzysearch-0.7.1-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2b664472659cb7106cf31a34b1a7dacf0d99f9283613dd50810109ab3add36b |
|
MD5 | 6071d24c554c3f4f29559a87dd210bdc |
|
BLAKE2b-256 | f9dbf50d2def5a7f6df48bc7413173ea2b3cb17ffe060378aa09af6e89b44347 |
Hashes for fuzzysearch-0.7.1-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a37c47fe77dc0b77b40efda065c04d445e05fc7f10a871572a23af78bf0e588 |
|
MD5 | e6b10286b3006d4f82e1474edb71b256 |
|
BLAKE2b-256 | 7e8ee70e2b4260dd4a68aff9b276a80117dbb37f93d02cfd8ad7f1990568cef8 |
Hashes for fuzzysearch-0.7.1-cp35-cp35m-macosx_10_6_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 473f8dbd889435e0381aafbc40d8e832399b782a94ce331d2a49508d8c9d7193 |
|
MD5 | 88ea0c0d231b80898351c24f7d92af17 |
|
BLAKE2b-256 | 20d10b02c92eb79bad6d58aef700f34e3b33144489c3177563e3acc0cec754bd |
Hashes for fuzzysearch-0.7.1-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e74e682525a816f8fee45e933b3736b93719ebf2829e88c38244bb1c6fe8a44 |
|
MD5 | 9c0d3d86206ca267ad05fe03a8137652 |
|
BLAKE2b-256 | 3c44fc5290f8c903c9b042b2a868b610cbc8059d584bf3c63d00478941775a5d |
Hashes for fuzzysearch-0.7.1-cp27-cp27m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8001d9149f4659365b60ef1aff8fed168f7903b1a817531aaeb9f3aae95086b0 |
|
MD5 | 379625959374fc2d3ec7a0e36032c54b |
|
BLAKE2b-256 | f1884d7ee237a4f41b0a455aa1482110bc2a095ca89a9c1a9fa5d70c1eee8377 |
Hashes for fuzzysearch-0.7.1-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 335095a5e0a3318ecbc5cee42b6b31abf913496e1ac2c317c2aa3a0c501006fe |
|
MD5 | 3a55d486ca4a313853fbe5b66b8e52cc |
|
BLAKE2b-256 | 8136d302020688dca099a26fcf671423ac7dfb1d80d3b2f3e0f77ba0fadd57dd |