Skip to main content

Fast pattern counting on small-alphabet sequences with GPU acceleration

Project description

Pattern Buffer 🧮

Fast pattern counting on small-alphabet sequences with GPU acceleration, built on PyTorch.

Originally designed to count patterns in DNA sequences with ambiguous bases as defined by the IUPAC code, but can be extended to any pattern counting task where the original sequences and target patterns can be converted into a sensible categorical encoding.

[!NOTE] The preface of "small-alphabet" sequences is given as the additional memory used by categorical encoding grows with the number of unique values in the alphabet. With DNA and IUPAC encoding this is tractable because many of the codes represent combinations of DNA bases, so despite there being 15 DNA nucleotide codes we only need to embed in a vector of length 4 (A, C, G, T).

Usage

[!IMPORTANT] While the query sequences can all be different lengths, currently Pattern Buffer only supports counting these patterns on sequences that are all the same length.

Pattern Buffer can be used with a broadly functional or object-oriented (OO) interface, with the functional interface geared towards one-time use and the OO interface for use multiple times (i.e. file parsing or PyTorch DataLoaders).

To demonstrate, we'll first create some sample sequences and queries. As we're using IUPAC nucleotide sequences, we can use the provided generate_iupac_embedding function to provide the embedding tensor.

from pattern_buffer import generate_iupac_embedding
sequences = ["AACGAATCAAAAT", "AACAGTTCAAAAT", "AACAGTTCGYGGA", "AACAAGATCAGGA"]
queries = ["AAA", "AGT", "AAACA", "AAR", "GYGGA"]
embedding = generate_iupac_embedding()

From here, we can use the function count_queries to count the queries in a single function call:

from pattern_buffer import count_queries
count_queries(sequences, queries, embedding)
# tensor([[2, 0, 0, 2, 0],
#         [2, 1, 0, 2, 0],
#         [0, 1, 0, 0, 0],
#         [0, 0, 0, 1, 0]])

or create an instance of the PatternBuffer class, and use the .count method to count occurrences in new sequences. This has the advantage of not re-calculating the query embeddings or the support tensor each time, so is well suited for fast repeated counting:

from pattern_buffer import PatternBuffer
pb = PatternBuffer(query_strings=queries, embedding=embedding)
pb.count(sequences)
# tensor([[2, 0, 0, 2, 0],
#         [2, 1, 0, 2, 0],
#         [0, 1, 0, 0, 0],
#         [0, 0, 0, 1, 0]])

Limitations

  • Currently, the program expects all input sequences to have the same length, but queries can already be different lengths.
  • If all of your patterns contain unique (non-ambiguous) characters then this encoding scheme is likely overkill and other software would be more efficient.
  • The software is designed for use with GPU acceleration, and will likely under-perform on CPU when compared to CPU-optimised counting schemes.

Future work

  • Allow dynamic input lengths with padding
  • Automatic encoding detection from pattern analysis
  • FFT-based convolutions for long patterns

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pattern_buffer-0.0.3.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

pattern_buffer-0.0.3-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file pattern_buffer-0.0.3.tar.gz.

File metadata

  • Download URL: pattern_buffer-0.0.3.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pattern_buffer-0.0.3.tar.gz
Algorithm Hash digest
SHA256 39aa839a16c2650438fce2fe352e8d498e6424b7cbc51088c09f592de3becdf5
MD5 c103a181f02b906cf2e33aa83eb9a730
BLAKE2b-256 d06a444f2fe11e258cdaab69dc372a6d37b383f72322cdf36326c4759c365911

See more details on using hashes here.

File details

Details for the file pattern_buffer-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for pattern_buffer-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 71d0a56d472bd968c182a988edfb4cf2d49d0537f8d5eb6ca133a88bed3f3f3c
MD5 e8b06cc9dc552b654e5c17997e04745d
BLAKE2b-256 a54124f5fe0f23e56618d6a81e4ea08f6b20e3e0cd524649466b6673b417e613

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page