fuzzysearch is useful for finding approximate subsequence matches
Project description
fuzzysearch is a Python library for fuzzy substring searches. It implements efficient ad-hoc searching for approximate sub-sequences. Matching is done using a generalized Levenshtein Distance metric, with configurable parameters.
Free software: MIT license
Documentation: http://fuzzysearch.rtfd.org.
Installation
Just install using pip:
$ pip install fuzzysearch
Features
Fuzzy sub-sequence search: Find parts of a sequence which match a given sub-sequence.
Easy to use: A single function to call which returns a list of matches.
Set a maximum Levenshtein Distance for matches, including individual limits for the number of substitutions, insertions and/or deletions allowed for near-matches.
Includes optimized implementations for specific use-cases, e.g. allowing only substitutions.
Simple Examples
Just call find_near_matches() with the sequence to search, the sub-sequence you’re looking for, and the matching parameters:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1)]
Advanced Search Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
maximum Levenshtein distance
maximum # of subsitutions
maximum # of deletions (elements appearing in the pattern search for, which are skipped in the matching sub-sequence)
maximum # of insertions (elements added in the matching sub-sequence which don’t appear in the pattern search for)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]
# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1)] # the Levenshtein distance is still 1
# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2)]
History
0.3.0 (2015-02-12)
Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage
0.2.2 (2014-03-27)
Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs
0.2.1 (2014-03-14)
Fixed major match grouping bug
0.2.0 (2013-03-13)
New utility function find_near_matches() for easier use
Additional documentation
0.1.0 (2013-11-12)
Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha
0.0.1 (2013-11-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for fuzzysearch-0.6.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c90b7fdba6645d8eb3033ee01f434768c6a33539ce620cd7706aca00bb8f298 |
|
MD5 | 515f1a9dfb3e6e6dc29c456ddc177ac7 |
|
BLAKE2b-256 | 4e3dfaa3edfbd5a085d750108ce13d419f0d5339ef5ac880b3c73f762ad73c19 |
Hashes for fuzzysearch-0.6.0-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce2eba97d2c0ea88d229b8154ac3dcd4e716baa9f3dfdff9bc61d5e3a690f442 |
|
MD5 | 7e678c532d64d42a5ec8fb100bccca0b |
|
BLAKE2b-256 | f2d9defa0a135d0a34de443d88c27987c799afd0a041040198a7d69cf8dc0a7e |
Hashes for fuzzysearch-0.6.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 003e6fa4dfadccb23fbab1f99e8f054aab485c9e4fcd5a945760f043a3a82895 |
|
MD5 | 27735ee486a7975534d77bd576020906 |
|
BLAKE2b-256 | 50097d61becde114b2f76402236b95bd99cf4ed3ce73ba019b63a1c480b075d6 |
Hashes for fuzzysearch-0.6.0-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62d141ece81dd4397990c7e7aeaac33235f24ae4222595222dcdee770bec568a |
|
MD5 | 97f6f833f0eeaead4427966303fb0dc6 |
|
BLAKE2b-256 | bbb40d5935e4d6d4bef27bd60f89123a6472a75b78f1ee6a71d1d16bbe5beafa |
Hashes for fuzzysearch-0.6.0-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37256de399fa231808f3412ed2d35d84e28bf8ecf55bdaa0a9c45698bd976fba |
|
MD5 | cbca16f403a4c2c7aa0204212e91007b |
|
BLAKE2b-256 | 297e5dcaba1b8cf7a8b19bf83a1402c9da856c45d9d5d1355b8e1df81e1430ef |
Hashes for fuzzysearch-0.6.0-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bfd4c947d06db792b567103e732a3b3994fa7da192a8dd69dd4eb6d29340a329 |
|
MD5 | eaf10e4e3541e5b745d5633e066d5dbe |
|
BLAKE2b-256 | e84dc99a5e80c738eb8e0f20086fac217edd82d6f5ca217d7cfa89612ce2f87c |
Hashes for fuzzysearch-0.6.0-cp34-cp34m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa04ac28ce3fe453fd52acf9c2ce2f8c4dd6dbfdd9eaa60fa93fed23b722dca8 |
|
MD5 | ab3b350e155646c32a6c3e805ae76101 |
|
BLAKE2b-256 | d06131cae81bca66ddd2f709ac6ffc7cab809b0de76103e7c847418d17c3d80f |
Hashes for fuzzysearch-0.6.0-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d23278149406965fbec614c5cba088cb2bfb61a11463373b6db9ca59a87a5fba |
|
MD5 | 09c4d3a0c2db5d288e3be93ca71191f6 |
|
BLAKE2b-256 | f1bce7c1bdea40040d2e3c244405b8a9247c978560de2b3c7782ec45952284cd |
Hashes for fuzzysearch-0.6.0-cp27-cp27m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20448b0871f98a97cfe3a261cb3f2d224d01f77558531fdfa0d28ae930215949 |
|
MD5 | 30ee390a6194dc8c7fecfa9db886387c |
|
BLAKE2b-256 | 03073f711ee9e62da012b9080530dc6260a2542cbc4cc188c568da8ba6cc3d25 |