Fast pattern counting on small-alphabet sequences with GPU acceleration
Project description
Pattern Buffer
Fast pattern counting on small-alphabet sequences with GPU acceleration, built on PyTorch's Tensor
.
Originally designed to count patterns in DNA sequences with ambiguous bases as defined by the IUPAC code, but can be extended to any pattern counting task where the original sequences and target patterns can be converted into a sensible categorical encoding.
[!NOTE] The preface of "small-alphabet" sequences is given as the additional memory used by categorical encoding grows with the number of unique values in the alphabet. With DNA and IUPAC encoding this is tractable because many of the codes represent combinations of DNA bases, so despite there being 15 DNA nucleotide codes we only need to embed in a vector of length 4 (A, C, G, T).
Usage
[!IMPORTANT] While the query sequences can all be different lengths, currently
Pattern Buffer
only supports counting these patterns on sequences that are all the same length.
Pattern Buffer
can be used with a broadly functional
or object-oriented (OO)
interface, with the functional
interface geared towards one-time use and the OO
interface for use multiple times (i.e. file parsing or PyTorch DataLoaders).
To demonstrate, we'll first create some sample sequences and queries. As we're using IUPAC nucleotide sequences, we can use the provided generate_iupac_embedding
function to provide the embedding tensor.
>>> from pattern_buffer import generate_iupac_embedding
# Create sample sequences and queries
>>> sequences = ["AACGAATCAAAAT", "AACAGTTCAAAAT", "AACAGTTCGYGGA", "AACAAGATCAGGA"]
>>> queries = ["AAA", "AGT", "AAACA", "AAR", "GYGGA"]
>>> embedding = generate_iupac_embedding()
From here, we can use the function count_queries
to count the queries in a single function
call:
>>> from pattern_buffer import count_queries
>>> count_queries(sequences, queries, embedding)
tensor([[2, 0, 0, 2, 0],
[2, 1, 0, 2, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0]])
or create an instance of the PatternBuffer
class, and use the .count
method to count occurrences in new sequences. This has the advantage of not re-calculating the query embeddings or the support
tensor each time, so is well suited for fast repeated counting:
>>> from pattern_buffer import PatternBuffer
>>> pb = PatternBuffer(query_strings=queries, embedding=embedding)
>>> pb.count(sequences)
tensor([[2, 0, 0, 2, 0],
[2, 1, 0, 2, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0]])
Limitations
- Currently, the program expects all input sequences to have the same length, but queries can already be different lengths.
- If all of your patterns contain unique (non-ambiguous) characters then this encoding scheme is likely overkill and other software would be more efficient.
- The software is designed for use with GPU acceleration, and will likely under-perform on CPU when compared to CPU-optimised counting schemes.
Future work
- Allow dynamic input lengths with padding
- Automatic encoding detection from pattern analysis
- FFT-based convolutions for long patterns
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pattern_buffer-0.0.1.tar.gz
.
File metadata
- Download URL: pattern_buffer-0.0.1.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c00c7137f36a264dddcb3fa33dc3cef23f5c055c32b4ba33531b68a7e8642ba8 |
|
MD5 | 289c32a8b6d4aaeb197824fd309fef92 |
|
BLAKE2b-256 | 11d663bb3becaa6e1c7231a35f8b3cf5b5ca62465ef911e7c8c2e9e123f37dba |
File details
Details for the file pattern_buffer-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: pattern_buffer-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 527557e88f3ffd14859e466128f8faeaca2b5e29f19941506c409f3487a3fff6 |
|
MD5 | cd9230e9dd440a4f8b9cb90209d3660c |
|
BLAKE2b-256 | ed463396c1a7f57f0269d8adbafca9be6b10bdc0812cefecf1118daa79fcf155 |