Skip to main content

fuzzysearch is useful for finding approximate subsequence matches

Project description

Latest Version Build & Tests Status Test Coverage Downloads Wheels Supported Python versions Supported Python implementations License

Easy fuzzy search that just works, fast!

>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
  • Approximate sub-string searches

  • A single, simple function to use

    • Chooses the fastest available search mechanism based on the given input

  • Uses the Levenshtein Distance metric with configurable parameters

    • Separately configure the max. allowed distance, substitutions, deletions and insertions

  • Advanced algorithms with optional C and Cython optimizations

  • Extensively tested

  • Free software: MIT license

For more info, see the documentation.

Installation

$ pip install fuzzysearch

This will work even if installing the C and Cython extensions fails, using pure-Python fallbacks.

Usage

Just call find_near_matches() with the sub-sequence you’re looking for, the sequence to search, and the matching parameters:

>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1)]

Matching Criteria

The search function supports four possible match criteria, which may be supplied in any combination:

  • maximum Levenshtein distance (max_l_dist)

  • maximum # of subsitutions

  • maximum # of deletions (“delete” = skip a character in the sub-sequence)

  • maximum # of insertions (“insert” = skip a character in the sequence)

Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.

>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]

# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]

# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1)] # the Levenshtein distance is still 1

# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2)]

History

0.6.1 (2018-12-08)

  • Fixed some C compiler warnings for the C and Cython modules

0.6.0 (2018-12-07)

  • Dropped support for Python versions 2.6, 3.2 and 3.3

  • Added support and testing for Python 3.7

  • Optimized the n-grams Levenshtein search for long sub-sequences

  • Further optimized the n-grams Levenshtein search

  • Cython versions of the optimized parts of the n-grams Levenshtein search

0.5.0 (2017-09-05)

  • Fixed search_exact_byteslike() to support supplying start and end indexes

  • Added support for lists, tuples and other Sequence types to search_exact()

  • Fixed a bug where find_near_matches() could return a wrong Match.end with max_l_dist=0

  • Added more tests and improved some existing ones.

0.4.0 (2017-07-06)

  • Added support and testing for Python 3.5 and 3.6

  • Many small improvements to README, setup.py and CI testing

0.3.0 (2015-02-12)

  • Added C extensions for several search functions as well as internal functions

  • Use C extensions if available, or pure-Python implementations otherwise

  • setup.py attempts to build C extensions, but installs without if build fails

  • Added --noexts setup.py option to avoid trying to build the C extensions

  • Greatly improved testing and coverage

0.2.2 (2014-03-27)

  • Added support for searching through BioPython Seq objects

  • Added specialized search function allowing only subsitutions and insertions

  • Fixed several bugs

0.2.1 (2014-03-14)

  • Fixed major match grouping bug

0.2.0 (2013-03-13)

  • New utility function find_near_matches() for easier use

  • Additional documentation

0.1.0 (2013-11-12)

  • Two working implementations

  • Extensive test suite; all tests passing

  • Full support for Python 2.6-2.7 and 3.1-3.3

  • Bumped status from Pre-Alpha to Alpha

0.0.1 (2013-11-01)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzysearch-0.6.2.tar.gz (99.3 kB view hashes)

Uploaded Source

Built Distributions

fuzzysearch-0.6.2-cp37-cp37m-win_amd64.whl (77.9 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

fuzzysearch-0.6.2-cp37-cp37m-win32.whl (70.5 kB view hashes)

Uploaded CPython 3.7m Windows x86

fuzzysearch-0.6.2-cp37-cp37m-macosx_10_9_x86_64.whl (75.4 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

fuzzysearch-0.6.2-cp36-cp36m-win_amd64.whl (77.7 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

fuzzysearch-0.6.2-cp36-cp36m-win32.whl (70.4 kB view hashes)

Uploaded CPython 3.6m Windows x86

fuzzysearch-0.6.2-cp36-cp36m-macosx_10_9_x86_64.whl (74.9 kB view hashes)

Uploaded CPython 3.6m macOS 10.9+ x86-64

fuzzysearch-0.6.2-cp35-cp35m-win_amd64.whl (76.6 kB view hashes)

Uploaded CPython 3.5m Windows x86-64

fuzzysearch-0.6.2-cp35-cp35m-win32.whl (69.3 kB view hashes)

Uploaded CPython 3.5m Windows x86

fuzzysearch-0.6.2-cp35-cp35m-macosx_10_6_intel.whl (124.3 kB view hashes)

Uploaded CPython 3.5m macOS 10.6+ intel

fuzzysearch-0.6.2-cp34-cp34m-win32.whl (64.9 kB view hashes)

Uploaded CPython 3.4m Windows x86

fuzzysearch-0.6.2-cp34-cp34m-macosx_10_6_intel.whl (123.4 kB view hashes)

Uploaded CPython 3.4m macOS 10.6+ intel

fuzzysearch-0.6.2-cp27-cp27m-win_amd64.whl (67.1 kB view hashes)

Uploaded CPython 2.7m Windows x86-64

fuzzysearch-0.6.2-cp27-cp27m-win32.whl (64.1 kB view hashes)

Uploaded CPython 2.7m Windows x86

fuzzysearch-0.6.2-cp27-cp27m-macosx_10_9_x86_64.whl (71.9 kB view hashes)

Uploaded CPython 2.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page