Skip to main content

Fast inexact text searching with suffix arrays

Project description

nearmiss

NEARby MISmatch Search

nearmiss is a fast inexact text matching tool for finding repeats of an area around a specific anchor string throughout text, optionally finding matches with substitutions.

It is primarily intended for finding near-match sections of DNA in the vicinity of specific anchor sequences. The current substitution alphabet is limited to ACGT.

The speed of nearmiss comes from a C extension that uses the SA-IS suffix array library from Yuta Mori and pointer magic instead. The search time for anchors is O(|anchor| log |text|). The search time for repeats is O(a(sw)^d log t), where

  • a is the number of anchors found
  • s is the size of the substitution alphabet
  • w is the size of the matching window
  • d is the maximum desired number of substitutions to allow in the window
  • and t is the size of the search text

Use

>>> from nearmiss import Searcher
>>> seq = "TACTANGGnnnTAAAAGnGG"
>>> searcher = Searcher(seq)
>>> searcher.find_anchors("GG")
[6, 18]
>>> searcher.find_anchors("nGG")
[17]
>>> searcher.find_repeat_counts("GG", (-4, -2), max_distance=1)
{18: [1, 0], 6: [1, 0]}
>>> searcher.find_repeat_counts("GG", (-4, -2), max_distance=2)
{18: [1, 0, 1], 6: [1, 0, 1]}

For more detailed information, see the source documentation with pydoc nearmiss.Searcher or help(nearmiss.Searcher).

To limit the number of threads used outside the source, set the environment variable OMP_NUM_THREADS to the number of desired threads.

Installing

Non-python dependencies

nearmiss uses OpenMP to drastically speed up mismatch searching on many anchors. To install that on Debian/Ubuntu systems, run sudo apt-get install libomp5.

with pip

pip install nearmiss

from source

pip install . in the source directory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nearmiss-0.1.4.tar.gz (25.1 kB view details)

Uploaded Source

File details

Details for the file nearmiss-0.1.4.tar.gz.

File metadata

  • Download URL: nearmiss-0.1.4.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for nearmiss-0.1.4.tar.gz
Algorithm Hash digest
SHA256 e46c603861be4d808cae437d9464fa0799a3f2a0eddc8f4562d778d06cf3ab2a
MD5 4d89414a0648970e70a37f88f8240320
BLAKE2b-256 2a542d692f3249f402b485769b18e93b026e3de50ac5d523afa22fd29788c3fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page