fuzzysearch is useful for finding approximate subsequence matches
Project description
fuzzysearch is a Python library for fuzzy substring searches. It implements efficient ad-hoc searching for approximate sub-sequences. Matching is done using a generalized Levenshtein Distance metric, with configurable parameters.
Free software: MIT license
Documentation: http://fuzzysearch.rtfd.org.
Installation
Just install using pip:
$ pip install fuzzysearch
Features
Fuzzy sub-sequence search: Find parts of a sequence which match a given sub-sequence.
Easy to use: A single function to call which returns a list of matches.
Set a maximum Levenshtein Distance for matches, including individual limits for the number of substitutions, insertions and/or deletions allowed for near-matches.
Includes optimized implementations for specific use-cases, e.g. allowing only substitutions.
Simple Examples
Just call find_near_matches() with the sequence to search, the sub-sequence you’re looking for, and the matching parameters:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1)]
Advanced Search Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
maximum Levenshtein distance
maximum # of subsitutions
maximum # of deletions (elements appearing in the pattern search for, which are skipped in the matching sub-sequence)
maximum # of insertions (elements added in the matching sub-sequence which don’t appear in the pattern search for)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]
# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1)] # the Levenshtein distance is still 1
# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2)]
History
0.3.0 (2015-02-12)
Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage
0.2.2 (2014-03-27)
Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs
0.2.1 (2014-03-14)
Fixed major match grouping bug
0.2.0 (2013-03-13)
New utility function find_near_matches() for easier use
Additional documentation
0.1.0 (2013-11-12)
Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha
0.0.1 (2013-11-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for fuzzysearch-0.4.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91ce4b771645e7d1a873cab445ae1ecf3bab196360953d7ba0f0b06df42db668 |
|
MD5 | 42613c3e5166f3ec6baac82b3f46c1c2 |
|
BLAKE2b-256 | 222c7f92da0d59f78b05f6dd2f30d02b852a02d19f6b3a0c1c31bd76e249003f |
Hashes for fuzzysearch-0.4.0-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f330ddfe8ff45ea13bc54af3ce8cb6209c2e5aae170569e9de3c2c6c7de87897 |
|
MD5 | 71145447261856469c17ba6485078d7a |
|
BLAKE2b-256 | 421b49f929081a96340ded49981fc58ed2fdf1d6a9cb754eeb80b6273c15a9e9 |
Hashes for fuzzysearch-0.4.0-cp36-cp36m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1df0e1db5ec765c130ae0fe2191e07bf7847a26fe57dfd31159ae787b18ccab3 |
|
MD5 | 362c0bec4027f170b859b0abc24884b3 |
|
BLAKE2b-256 | 5a80ddb1b4d904334c4397928535b1596f53e2490014988aa0dacdf494eb4b8c |
Hashes for fuzzysearch-0.4.0-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46cc175e908b17c11b0c80de82a5cba13f05408f7f57189804dc5308bf8186c9 |
|
MD5 | 1fa61f9cd11681070d73770b89e9e80b |
|
BLAKE2b-256 | 882edf5683c097507c3a4b0f1ae5323932580fb6b8e70ae5d2506bfec54ec0d3 |
Hashes for fuzzysearch-0.4.0-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 202a7f3876dc2edb55bd10a59d3b3197b6306a174efd2d6f63d7dae3a199a07a |
|
MD5 | 740b211a331d460c20695f167f6a97e8 |
|
BLAKE2b-256 | 4b2d01002cc81f5d5b7ad32f5938a1f3217d75f9b23b8895b3734453e6361c4f |
Hashes for fuzzysearch-0.4.0-cp35-cp35m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 538ecfa346465f42d7ad4e3c6e39995848a0330fd5e9d0731093b2ddc36fe628 |
|
MD5 | 8f353bbee4125e62148a185f6242d432 |
|
BLAKE2b-256 | afd7005f5511b766930faa076ac044827837aba6487f7e429bd18a41f4192a57 |
Hashes for fuzzysearch-0.4.0-cp34-cp34m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05db594f75533d767291acc77caf558b9307c5b757f65b94352f26d4947d8cc7 |
|
MD5 | 2713ad4e062e22a0cd6083c43b6a7119 |
|
BLAKE2b-256 | 0783c097b06b7ff677330bb25df61363dced12b62de74d3e2da611864e9699ef |
Hashes for fuzzysearch-0.4.0-cp34-cp34m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b51fa9abe0afad6790b88557513c37c40bd62580d3dcda9c81e9437f6f0d941 |
|
MD5 | 89fec49a50065afce3116ed3e659c05b |
|
BLAKE2b-256 | 47c34977d7a520347e2c112aa2cee43f74d6c4ee83be4a7851fb3df478b1abca |
Hashes for fuzzysearch-0.4.0-cp33-cp33m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0cce7d17a94fd5cf40559bb2ce9a26a390cc193c609bd8e3f7bdc7e78336da8 |
|
MD5 | b6a95640f010e7e314c9236560f1b84a |
|
BLAKE2b-256 | 3bb7124e28c43a233cf73fd460f6b306fa18a38b28ac19a8a7556d4ec96f926b |
Hashes for fuzzysearch-0.4.0-cp33-cp33m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9e389b2ff58aacf741927edf184a065446c8de0396b4e5f128dccab8a7477f3 |
|
MD5 | b5a1fe05765f268a5b5b68d0b297a99d |
|
BLAKE2b-256 | aa1a4bfd41d26fff9348bbf5a955d1db328bc5dcc3bcbe67571fa7b3a42df336 |
Hashes for fuzzysearch-0.4.0-cp32-cp32m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e65f8e30f6ef2b3fca1e14759903385c8c7496e1c9596147dec5db174b7f3c93 |
|
MD5 | 9675d8b6a6376c14d81d3a18a934cf5d |
|
BLAKE2b-256 | fd6600fac312df7aabfd48f1eb9a23f9dc148f37ebcd8f703369d6592c197395 |
Hashes for fuzzysearch-0.4.0-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ba0980ea1810f4dc770ac91a081e81e1072f84e850eca6c9117fee604ae88d2 |
|
MD5 | 6a19071a03ec685e3c8b27a73aec74a7 |
|
BLAKE2b-256 | c108a1bf2a82ca46c47284cc8aabf703606199e54626f91d655428845019336c |
Hashes for fuzzysearch-0.4.0-cp27-cp27m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d80bb9b501a11e94408e7921eaa54103cef7dd5958482ccbc07938d64eac59b |
|
MD5 | e6b5ea9848490287e44556314c0d5433 |
|
BLAKE2b-256 | 6dc701d613e440d5ae35bc55ce6763c1f77647a49d599582bce6c101361bf81e |
Hashes for fuzzysearch-0.4.0-cp27-cp27m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a344d2cbeb4b50713bdaf6e8213f7a3fd420fc2aff69062c1d45ad423da94c2b |
|
MD5 | 022b50c75fef1d1ca45b7737fffee0a0 |
|
BLAKE2b-256 | 5d251138fadb633b37b26e2a2cccb309ab3fc9e9997da8cae35dc101906313f3 |
Hashes for fuzzysearch-0.4.0-cp26-cp26m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9cc0dbdaddd03b91ba207b927b8e0e7dbeda3b3655d253278baa8df5a3dc4ff9 |
|
MD5 | 2103b8b0e5cb3935e7830b592c48f297 |
|
BLAKE2b-256 | e6e014084170be52672c8e51208b339e33e96a76939e9bfc066d2d1a4b25c745 |
Hashes for fuzzysearch-0.4.0-cp26-cp26m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9413ee9baf2e213fe5ace3d4d78bc110a79a5a999cd2fbc422f695ebb3fa1aef |
|
MD5 | f31cc3761775756627be40dcff94cc00 |
|
BLAKE2b-256 | 051d482965a850aae0cab4dcd7a279f4cdf4bfa541ae0bb5da6763828b3052c8 |
Hashes for fuzzysearch-0.4.0-cp26-cp26m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a12a89ee09198da64aad1ed2a4b71aec5a8eab3f6c96b854af2e94e01d3c15f2 |
|
MD5 | 042fb1fa19f1baf2938f7846a6c10320 |
|
BLAKE2b-256 | e9cf661dc3e6735a482d8b8e10755b5c521b924d2fc2992cf8e7db03b7005ab0 |