No project description provided

Project description

torchmers

Efficient k-mer counting implemented with PyTorch primitives and out-of-the-box GPU support.

Features:

⚡ Blazingly fast. Count 9-mers over the entire human genome in just about 30 seconds on a desktop GPU.
⚡ Flexible. User friendly Python API. CLI soon to come.
⚡ GPU acceleration. Supports Nvidia GPUs and most other PyTorch hardware backends.
⚡ Deep Learning ready. Easy integration into PyTorch-based deep learning models.
⚡ Format agnostic. Works on any kind of sequence. Built-in tokenizers include 4 letter DNA or 20 letter amino acid codes. Natural language can be used as well.

Motivation

K-mer frequency analysis is a simple yet highly important tool for genomics. Practitioners typically have to choose between two options: (1) implementations in scripting languages such as Python that are slow but flexible, or (2) tools in compiled languages that are efficient but might have limiting APIs.

Why can't we get the best of both worlds? While PyTorch and similar frameworks have been developed specifically for deep learning applications, but after all they offer highly efficient implementations of numerical operations optimized for vector operations on CPUs and numerous hardware accelerators. With a bit of hacking, these primitive operations can be utilized for compute-intensive problems in genomics such as k-mer counting, which are typically not treated as numerical problems.

How does it work?

torchmers is based on two primitive PyTorch operations: unfold and scatter_add. unfold extracts sliding local blocks from a tensor, similar to as found in a conv layer. Each value within a block is offset depending on its position and all values within a block are summed up to produce a distinct value for each possible k-mer. scatter_add is then simply used to add the k-mer indices to a count tensor. Some additional hacking is required for handling padding or varying sequence length, but this is a topic for another time.

Usage

Installation

pip install torchmers

Functional API

The main k-mer counting logic of the package is implemented by the count_k_mers function. It expects sequences to be already tokenized, or rather numericalized and to be provided as either NumPy or PyTorch arrays. To convert a biological sequence into this format, a Tokenizer can be used.

from torchmers import Tokenizer, count_k_mers

sequence = 'CGCTATAAAAGGGC'

tokenizer = Tokenizer.from_name('DNA')
tokens = tokenizer.encode(sequence)

counts = count_k_mers(tokens, k=5)

GPU-support

torchmers uses PyTorch primitives under the hood. As long as PyTorch implements the unfold and scatter_add operations, among others, for some hardware backend, it can in principle be used for accelerated k-mer counting. This includes, of course, Nvidia GPUs, TPUs and possibly also Apple M CPUs, although I don't have hardware to test this in the latter case.

The count_k_mers will simply use the device of the sequence tensor it has been provided, but this can also be overwritten with the device argument:

# Both options are equivalent
tokens = torch.tensor(tokenizer.encode(sequence), device='cuda')

counts = count_k_mers(tokens, k=5, device='cuda')

PyTorch Module API

torchmers provides a PyTorch Module which can be integrated into neural networks to encode (DNA) sequences into k-mer spectra.

import torch
import torch.nn as nn

from torchmers.modules import KMerFrequencyEncoder

sequences = torch.randint(0, 3, (32, 256))

encoder = KMerFrequencyEncoder(k=5),
encoder(sequences)  # [32, 4 ** 5]

Note that KMerFrequencyEncoder expects input sequences to be numericalized and formatted as PyTorch tensors.

Padding

If sequences have been padded, a tensor with sequences lengths of shape [batch_size] can be passed as seq_lens argument to KMerFrequencyEncoder.forward(). Alternatively, a boolean mask of shape [batch_size, seq_len] is supported as well, where tokens marked with False will not be considered for k-mer counting.

Examples

The examples/ folder contains Jupyter notebooks with different usage examples.

Benchmarks

torchmers performance benchmark

Note: The "python native" implementation is certainly not highly optimized but should represent a simple Python baseline. The GPU benchmarks have been created on a RTX 4090.

Contributing

Contributions are welcomed. Please run tests with python -m pytest tests/ to test your work before submitting.

Future work

This project started out of frustration with slow on-the-fly computation of k-mer frequency spectra for deep learning applications when done in pure Python. I don't have very many other personal use cases, but I would be glad if users could provide me with their feature requests.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Sep 27, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchmers-0.1.0.tar.gz (5.8 kB view hashes)

Uploaded Sep 27, 2023 Source

Built Distribution

torchmers-0.1.0-py3-none-any.whl (7.0 kB view hashes)

Uploaded Sep 27, 2023 Python 3

Hashes for torchmers-0.1.0.tar.gz

Hashes for torchmers-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`58c4463110c7fe96182f4607bfba3022bebbed1785ab80ebef2126421f58d955`
MD5	`b63abe7e5aafb6ed9a580e47e080e29c`
BLAKE2b-256	`621e5684850fd7a9575b3cd2cb919e96f65b09ad2155a2a0cc3f3d4bd10a7633`

Hashes for torchmers-0.1.0-py3-none-any.whl

Hashes for torchmers-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77e6e7d09d17bc91b944995ba43904c2196fed9f712d061cbc34ba3b4904cdc4`
MD5	`e4b5acc9c92996f4794f512f9e159ca5`
BLAKE2b-256	`5d3fb35aaa5cf0dd2c59a01088c3378269a6caec7cdfa4214d45cbea92d7cc38`