fuzzysearch is useful for finding approximate subsequence matches
Project description
Fuzzy search: Find parts of long text or data, allowing for some changes/typos.
Easy, fast, and just works!
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
Two simple functions to use: one for in-memory data and one for files
Fastest search algorithm is chosen automatically
Levenshtein Distance metric with configurable parameters
Separately configure the max. allowed distance, substitutions, deletions and/or insertions
Advanced algorithms with optional C and Cython optimizations
Properly handles Unicode; special optimizations for binary data
- Simple installation:
pip install fuzzysearch just works
pure-Python fallbacks for compiled modules
only one dependency (attrs)
Extensively tested
Free software: MIT license
For more info, see the documentation.
Installation
fuzzysearch supports Python versions 2.7 and 3.5+, as well as PyPy 2.7 and 3.6.
$ pip install fuzzysearch
This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks.
Usage
Just call find_near_matches() with the sub-sequence you’re looking for, the sequence to search, and the matching parameters:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
To search in a file, use find_near_matches_in_file() similarly:
>>> from fuzzysearch import find_near_matches_in_file
>>> with open('data_file', 'rb') as f:
... find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
Examples
fuzzysearch is great for ad-hoc searches of genetic data, such as DNA or protein sequences, before reaching for “heavier”, domain-specific tools like BioPython:
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]
BioPython sequences are also supported:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> sequence = Seq('''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG''', IUPAC.unambiguous_dna)
>>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]
Matching Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
maximum Levenshtein distance (max_l_dist)
maximum # of subsitutions
maximum # of deletions (“delete” = skip a character in the sub-sequence)
maximum # of insertions (“insert” = skip a character in the sequence)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]
# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1
# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2, matched="PATERRN")]
When to Use Other Tools
Use case: Search through a list of strings for almost-exactly matching strings. For example, searching through a list of names for possible slight variations of a certain name.
Suggestion: Consider using fuzzywuzzy.
History
0.7.1 (2020-04-05)
Dropped support for Python 3.4.
Removed deprecation warning with Python 3.8.
Fixed a couple of nasty bugs.
0.7.0 (2020-01-14)
Added matched attribue to Match objects containing the matched part of the sequence.
Added support for CPython 3.8. Now supporting CPython 2.7 and 3.4-3.8.
0.6.2 (2019-04-22)
Fix calling search_exact() without passing end_index.
Fix edge case: max. dist >= sub-sequence length.
0.6.1 (2018-12-08)
Fixed some C compiler warnings for the C and Cython modules
0.6.0 (2018-12-07)
Dropped support for Python versions 2.6, 3.2 and 3.3
Added support and testing for Python 3.7
Optimized the n-grams Levenshtein search for long sub-sequences
Further optimized the n-grams Levenshtein search
Cython versions of the optimized parts of the n-grams Levenshtein search
0.5.0 (2017-09-05)
Fixed search_exact_byteslike() to support supplying start and end indexes
Added support for lists, tuples and other Sequence types to search_exact()
Fixed a bug where find_near_matches() could return a wrong Match.end with max_l_dist=0
Added more tests and improved some existing ones.
0.4.0 (2017-07-06)
Added support and testing for Python 3.5 and 3.6
Many small improvements to README, setup.py and CI testing
0.3.0 (2015-02-12)
Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage
0.2.2 (2014-03-27)
Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs
0.2.1 (2014-03-14)
Fixed major match grouping bug
0.2.0 (2013-03-13)
New utility function find_near_matches() for easier use
Additional documentation
0.1.0 (2013-11-12)
Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha
0.0.1 (2013-11-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for fuzzysearch-0.7.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3fc9c7c9ea788cd0a4a832164c00d387782bb797863e812372264ef0c03c6b0f |
|
MD5 | 0f78f707658897e2abe0de79df9a9e0b |
|
BLAKE2b-256 | 75bfbe8d904f34f861e76bee48768b26d05c56cc62c896cb89159a88b91e348c |
Hashes for fuzzysearch-0.7.2-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f530dc1f69c2f59ff436b8dabf0d4681005f0b8785287d66bcc1ad60ac03a3b |
|
MD5 | 0deaea15621b6a9eec3a4a42fd9c4f74 |
|
BLAKE2b-256 | 0659bbd5fb2d8a4e25587b68feef365241ecd9640d5a055b26a9ed31d7264bcc |
Hashes for fuzzysearch-0.7.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f425e763782078bc2872b116712198e5eadbc5d39295b66bf0a95568b315de0 |
|
MD5 | 7d729ab6327df279159cbcaa810e29b8 |
|
BLAKE2b-256 | 8eab437bac785571fdf74b801dde8334972463c817b8e24d4bb16d8bd48bddf2 |
Hashes for fuzzysearch-0.7.2-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d5762526216c95a4d3184b263d41e1917ed16578d0b85d9d3477f1468eff2d4 |
|
MD5 | 4a02c97808f5ee81dad8fb7cf17b7982 |
|
BLAKE2b-256 | bdca9473b5cc4dbdb720bb7cb28a5005f4f15a9d89077b82ad8a4926628afbf9 |
Hashes for fuzzysearch-0.7.2-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7fe0992edca7d1b1847d6bdf27695d67b7d89ddd2f3407b1171d43ae2d85e67 |
|
MD5 | b8597bf5c13d1092d395ac3d9a1c0af6 |
|
BLAKE2b-256 | ae698f8e43e63f37f0c09a015ce2647f0f452b8999a4350d9a3b0d4aaddc7e98 |
Hashes for fuzzysearch-0.7.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d730d0c2f4118cb22e3bffd502557f06339d3adb97dd12fe577b3f140c04800 |
|
MD5 | cb7596e4e34a663c85c75d446415c38d |
|
BLAKE2b-256 | 5e435c7364de887f04954611152ab48538433dc92a6b53d4d356565aceb5aa2e |
Hashes for fuzzysearch-0.7.2-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e161df2fe7847901ce132f75901bc78f6d0154b93999b453755adf5edd760c2 |
|
MD5 | 8fb30185d2d02438d1964037a12dc311 |
|
BLAKE2b-256 | d5631755bf672cc69fbac5bcb1ffb1b5edab76652eac0fc98fb6d19bb266471d |
Hashes for fuzzysearch-0.7.2-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1da0be53a46c4a53596e7c8d43637d9dc1c7de8a636219bd4f0dbb5e9eded4d3 |
|
MD5 | 787d5d8460f943b29d76ff8f5b9132c5 |
|
BLAKE2b-256 | ad55503df2fbb2b88e352039c3be9f1403348e3f4dae63e3550f90dcdbd90e59 |
Hashes for fuzzysearch-0.7.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f3b687a46ed54f83398259fcb0f451c5b4b87eb5297a382bde30219e9c7c87b |
|
MD5 | 58e763eb7cc839d868d38b8be8e05553 |
|
BLAKE2b-256 | f2baf6390bf9f3c7da32cc3dc35effc2d8f7e914951c4c17f2c93cbff4d557e6 |
Hashes for fuzzysearch-0.7.2-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bca5bf26b2ddd53f681bc0650030d390d2bf5aa38b10c8dec1aadd3b87de708f |
|
MD5 | d863edb643d8442b562c50dc8f5c3bdb |
|
BLAKE2b-256 | 6c2ae089f22d07c8b9a906172957f913f13263ac08f6f4ce3367166da535f856 |
Hashes for fuzzysearch-0.7.2-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8d2a004290676f618308aa0d5fade50b0c457ccc062560ac62720810ae139dd |
|
MD5 | cc797a0d34e424dd2c316e2d323c06b3 |
|
BLAKE2b-256 | 3ee676848ca7272804b59cb8eb66ada6fd8703043361f88e03883d89e2808193 |
Hashes for fuzzysearch-0.7.2-cp35-cp35m-macosx_10_6_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45672b67a0397297aa45b45761646e6f01fc7c22e2181c0add091499f86df7d3 |
|
MD5 | 815d68e364950c05e0f71686f474faa7 |
|
BLAKE2b-256 | 4461a8bf9e3332b99f016e72710871557136dd3afbd5cf211a47ef27ce9e9c50 |
Hashes for fuzzysearch-0.7.2-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5fba14029c3af69d2cd91727da7d0d6448c1bb0bcb358e0c2930febdc4e28d74 |
|
MD5 | 6c514eb9b04a3242d89a92e69bae1f75 |
|
BLAKE2b-256 | 5d6343a8cfe6335f29fc0ae12fe8b88706cf50408dd92f6e19200c7a44d66999 |
Hashes for fuzzysearch-0.7.2-cp27-cp27m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fbfe7644293343ba8f772f4826bb64459e0f86d8a1282ebff5b5f8b21d88c2d |
|
MD5 | eb171e5a4d7586d5ff91e9cb77c65142 |
|
BLAKE2b-256 | 1acf3d5688b3fd92951e87cc8daba801905ea60f6e6639795a95c2784b04ccb2 |
Hashes for fuzzysearch-0.7.2-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7cfd13bf66862012a47e19f4951bce8e2ea65a7b9c9229aff4a2e78e7c8c578 |
|
MD5 | c8634fa4baf8a08a672aa40dcb2195ef |
|
BLAKE2b-256 | f239199339135c50e8f102e308126773fb197aefa8b9e315d9cfb41e5a2934df |