fuzzysearch is useful for finding approximate subsequence matches
fuzzysearch is a Python library for fuzzy substring searches. It implements efficient ad-hoc searching for approximate sub-sequences. Matching is done using a generalized Levenshtein Distance metric, with configurable parameters.
Just install using pip:
$ pip install fuzzysearch
- Fuzzy sub-sequence search: Find parts of a sequence which match a given sub-sequence.
- Easy to use: A single function to call which returns a list of matches.
- Set a maximum Levenshtein Distance for matches, including individual limits for the number of substitutions, insertions and/or deletions allowed for near-matches.
- Includes optimized implementations for specific use-cases, e.g. allowing only substitutions.
Just call find_near_matches() with the sequence to search, the sub-sequence you’re looking for, and the matching parameters:
>>> from fuzzysearch import find_near_matches # search for 'PATTERN' with a maximum Levenshtein Distance of 1 >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1) [Match(start=3, end=9, dist=1)]
>>> sequence = '''\ GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG GGGATAGG''' >>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1 >>> find_near_matches(subsequence, sequence, max_l_dist=2) [Match(start=3, end=24, dist=1)]
Advanced Search Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
- maximum Levenshtein distance
- maximum # of subsitutions
- maximum # of deletions (elements appearing in the pattern search for, which are skipped in the matching sub-sequence)
- maximum # of insertions (elements added in the matching sub-sequence which don’t appear in the pattern search for)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1) [Match(start=3, end=9, dist=1)] # this will not match since max-deletions is set to zero >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)  # note that a deletion + insertion may be combined to match a substution >>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0) [Match(start=3, end=10, dist=1)] # the Levenshtein distance is still 1 # ... but deletion + insertion may also match other, non-substitution differences >>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0) [Match(start=3, end=10, dist=2)]
- Added C extensions for several search functions as well as internal functions
- Use C extensions if available, or pure-Python implementations otherwise
- setup.py attempts to build C extensions, but installs without if build fails
- Added --noexts setup.py option to avoid trying to build the C extensions
- Greatly improved testing and coverage
- Added support for searching through BioPython Seq objects
- Added specialized search function allowing only subsitutions and insertions
- Fixed several bugs
- Fixed major match grouping bug
- New utility function find_near_matches() for easier use
- Additional documentation
- Two working implementations
- Extensive test suite; all tests passing
- Full support for Python 2.6-2.7 and 3.1-3.3
- Bumped status from Pre-Alpha to Alpha
- First release on PyPI.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.