No project description provided
Project description
torchmers
Efficient k-mer counting implemented with PyTorch primitives and out-of-the-box GPU support.
Features:
- ⚡ Blazingly fast. Count 9-mers over the entire human genome in just about 30 seconds on a desktop GPU.
- ⚡ Flexible. User friendly Python API. CLI soon to come.
- ⚡ GPU acceleration. Supports Nvidia GPUs and most other PyTorch hardware backends.
- ⚡ Deep Learning ready. Easy integration into PyTorch-based deep learning models.
- ⚡ Format agnostic. Works on any kind of sequence. Built-in tokenizers include 4 letter DNA or 20 letter amino acid codes. Natural language can be used as well.
Motivation
K-mer frequency analysis is a simple yet highly important tool for genomics. Practitioners typically have to choose between two options: (1) implementations in scripting languages such as Python that are slow but flexible, or (2) tools in compiled languages that are efficient but might have limiting APIs.
Why can't we get the best of both worlds? While PyTorch and similar frameworks have been developed specifically for deep learning applications, but after all they offer highly efficient implementations of numerical operations optimized for vector operations on CPUs and numerous hardware accelerators. With a bit of hacking, these primitive operations can be utilized for compute-intensive problems in genomics such as k-mer counting, which are typically not treated as numerical problems.
How does it work?
torchmers is based on two primitive PyTorch operations: unfold and scatter_add. unfold extracts sliding local blocks from a tensor, similar to as found in a conv layer. Each value within a block is offset depending on its position and all values within a block are summed up to produce a distinct value for each possible k-mer. scatter_add is then simply used to add the k-mer indices to a count tensor. Some additional hacking is required for handling padding or varying sequence length, but this is a topic for another time.
Usage
Installation
pip install torchmers
Functional API
The main k-mer counting logic of the package is implemented by the count_k_mers function. It expects sequences to be already tokenized, or rather numericalized and to be provided as either NumPy or PyTorch arrays. To convert a biological sequence into this format, a Tokenizer can be used.
from torchmers import Tokenizer, count_k_mers
sequence = 'CGCTATAAAAGGGC'
tokenizer = Tokenizer.from_name('DNA')
tokens = tokenizer.encode(sequence)
counts = count_k_mers(tokens, k=5)
GPU-support
torchmers uses PyTorch primitives under the hood. As long as PyTorch implements the unfold and scatter_add operations, among others, for some hardware backend, it can in principle be used for accelerated k-mer counting. This includes, of course, Nvidia GPUs, TPUs and possibly also Apple M CPUs, although I don't have hardware to test this in the latter case.
The count_k_mers will simply use the device of the sequence tensor it has been provided, but this can also be overwritten with the device argument:
# Both options are equivalent
tokens = torch.tensor(tokenizer.encode(sequence), device='cuda')
counts = count_k_mers(tokens, k=5, device='cuda')
PyTorch Module API
torchmers provides a PyTorch Module which can be integrated into neural networks to encode (DNA) sequences into k-mer spectra.
import torch
import torch.nn as nn
from torchmers.modules import KMerFrequencyEncoder
sequences = torch.randint(0, 3, (32, 256))
encoder = KMerFrequencyEncoder(k=5),
encoder(sequences) # [32, 4 ** 5]
Note that KMerFrequencyEncoder expects input sequences to be numericalized and formatted as PyTorch tensors.
Padding
If sequences have been padded, a tensor with sequences lengths of shape [batch_size] can be passed as seq_lens argument to KMerFrequencyEncoder.forward(). Alternatively, a boolean mask of shape [batch_size, seq_len] is supported as well, where tokens marked with False will not be considered for k-mer counting.
Examples
The examples/ folder contains Jupyter notebooks with different usage examples.
Benchmarks
Note: The "python native" implementation is certainly not highly optimized but should represent a simple Python baseline. The GPU benchmarks have been created on a RTX 4090.
Contributing
Contributions are welcomed. Please run tests with python -m pytest tests/ to test your work before submitting.
Future work
This project started out of frustration with slow on-the-fly computation of k-mer frequency spectra for deep learning applications when done in pure Python. I don't have very many other personal use cases, but I would be glad if users could provide me with their feature requests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file torchmers-0.1.0.tar.gz.
File metadata
- Download URL: torchmers-0.1.0.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.9 Linux/5.15.0-83-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58c4463110c7fe96182f4607bfba3022bebbed1785ab80ebef2126421f58d955
|
|
| MD5 |
b63abe7e5aafb6ed9a580e47e080e29c
|
|
| BLAKE2b-256 |
621e5684850fd7a9575b3cd2cb919e96f65b09ad2155a2a0cc3f3d4bd10a7633
|
File details
Details for the file torchmers-0.1.0-py3-none-any.whl.
File metadata
- Download URL: torchmers-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.9 Linux/5.15.0-83-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77e6e7d09d17bc91b944995ba43904c2196fed9f712d061cbc34ba3b4904cdc4
|
|
| MD5 |
e4b5acc9c92996f4794f512f9e159ca5
|
|
| BLAKE2b-256 |
5d3fb35aaa5cf0dd2c59a01088c3378269a6caec7cdfa4214d45cbea92d7cc38
|