Fast pattern counting on small-alphabet sequences with GPU acceleration

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Pattern Buffer 🧮

Fast pattern counting on small-alphabet sequences with GPU acceleration, built on PyTorch.

Originally designed to count patterns in DNA sequences with ambiguous bases as defined by the IUPAC code, but can be extended to any pattern counting task where the original sequences and target patterns can be converted into a sensible categorical encoding.

[!NOTE] The preface of "small-alphabet" sequences is given as the additional memory used by categorical encoding grows with the number of unique values in the alphabet. With DNA and IUPAC encoding this is tractable because many of the codes represent combinations of DNA bases, so despite there being 15 DNA nucleotide codes we only need to embed in a vector of length 4 (A, C, G, T).

Usage

Pattern Buffer can be used with a broadly functional or object-oriented (OO) interface, with the functional interface geared towards one-time use and the OO interface for use multiple times (i.e. file parsing or PyTorch DataLoaders).

To demonstrate, we'll first create some sample sequences and queries. As we're using IUPAC nucleotide sequences, we can use the provided generate_iupac_embedding function to provide the embedding tensor.

from pattern_buffer import generate_iupac_embedding
sequences = [
    "AACGAATCAAAAT",
    "AACAGTTCAAAAATTAGT",
    "AGTTCGYGGA",
    "AACAAGATCAGGAAAGCTGACTTGATG",
]
query_seqs = ["AAA", "AGT", "AAACA", "AAR", "GYGGA"]
embedding = generate_iupac_embedding()

From here, we can use the function count_queries to count the queries in a single function call:

from pattern_buffer import count_queries
count_queries(sequences, queries, embedding)
# tensor([[0, 0, 0, 0, 0],
#         [3, 1, 0, 3, 0],
#         [0, 1, 0, 0, 1],
#         [1, 0, 0, 3, 0]])

or create an instance of the PatternBuffer class, and use the .count method to count occurrences in new sequences. This has the advantage of not re-calculating the query embeddings or the support tensor each time, so is well suited for fast repeated counting:

from pattern_buffer import PatternBuffer
pb = PatternBuffer(query_strings=queries, embedding=embedding)
pb.count(sequences)
# tensor([[0, 0, 0, 0, 0],
#         [3, 1, 0, 3, 0],
#         [0, 1, 0, 0, 1],
#         [1, 0, 0, 3, 0]])

Limitations

If all of your patterns contain unique (non-ambiguous) characters then this encoding scheme is likely overkill and other software would be more efficient.
The software is designed for use with GPU acceleration, and will likely under-perform on CPU when compared to CPU-optimised counting schemes.

Future work

Allow dynamic input lengths with padding
Automatic encoding detection from pattern analysis
FFT-based convolutions for long patterns

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.4

Aug 22, 2024

0.0.3

Nov 11, 2023

0.0.2

Nov 9, 2023

0.0.1

Nov 9, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pattern_buffer-0.0.4.tar.gz (5.3 kB view details)

Uploaded Aug 22, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pattern_buffer-0.0.4-py3-none-any.whl (5.9 kB view details)

Uploaded Aug 22, 2024 Python 3

File details

Details for the file pattern_buffer-0.0.4.tar.gz.

File metadata

Download URL: pattern_buffer-0.0.4.tar.gz
Upload date: Aug 22, 2024
Size: 5.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for pattern_buffer-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`098229fa4553a01f9b6eff9488906f68c7a1a40cf8c076b9bb84c591bb418398`
MD5	`6c389c902f2402e66b29eb53b93e4537`
BLAKE2b-256	`1e0f64e811acec60ba7338e527d7063111468db00d88ca9f7362a099f6bf294d`

See more details on using hashes here.

File details

Details for the file pattern_buffer-0.0.4-py3-none-any.whl.

File metadata

Download URL: pattern_buffer-0.0.4-py3-none-any.whl
Upload date: Aug 22, 2024
Size: 5.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for pattern_buffer-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bec7adf61bf9ce619c1575482bcadefb01c97334e94825e2660b470fa69c0e0a`
MD5	`e56fed8ebde36127e133aaa6b0bd67c7`
BLAKE2b-256	`cdd199163b28eae3eb42a007a6df44dc19ec4000f1740a36be4e0000a3c4baa6`

See more details on using hashes here.

pattern-buffer 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pattern Buffer 🧮

Usage

Limitations

Future work

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes