Skip to main content

No project description provided

Project description

torchmers

Efficient k-mer counting implemented with PyTorch primitives and out-of-the-box GPU support.

Features:

  • Blazingly fast. Count 9-mers over the entire human genome in just about 30 seconds on a desktop GPU.
  • Flexible. User friendly Python API. CLI soon to come.
  • GPU acceleration. Supports Nvidia GPUs and most other PyTorch hardware backends.
  • Deep Learning ready. Easy integration into PyTorch-based deep learning models.
  • Format agnostic. Works on any kind of sequence. Built-in tokenizers include 4 letter DNA or 20 letter amino acid codes. Natural language can be used as well.

Motivation

K-mer frequency analysis is a simple yet highly important tool for genomics. Practitioners typically have to choose between two options: (1) implementations in scripting languages such as Python that are slow but flexible, or (2) tools in compiled languages that are efficient but might have limiting APIs.

Why can't we get the best of both worlds? While PyTorch and similar frameworks have been developed specifically for deep learning applications, but after all they offer highly efficient implementations of numerical operations optimized for vector operations on CPUs and numerous hardware accelerators. With a bit of hacking, these primitive operations can be utilized for compute-intensive problems in genomics such as k-mer counting, which are typically not treated as numerical problems.

How does it work?

torchmers is based on two primitive PyTorch operations: unfold and scatter_add. unfold extracts sliding local blocks from a tensor, similar to as found in a conv layer. Each value within a block is offset depending on its position and all values within a block are summed up to produce a distinct value for each possible k-mer. scatter_add is then simply used to add the k-mer indices to a count tensor. Some additional hacking is required for handling padding or varying sequence length, but this is a topic for another time.

Usage

Installation

pip install torchmers

Functional API

The main k-mer counting logic of the package is implemented by the count_k_mers function. It expects sequences to be already tokenized, or rather numericalized and to be provided as either NumPy or PyTorch arrays. To convert a biological sequence into this format, a Tokenizer can be used.

from torchmers import Tokenizer, count_k_mers

sequence = 'CGCTATAAAAGGGC'

tokenizer = Tokenizer.from_name('DNA')
tokens = tokenizer.encode(sequence)

counts = count_k_mers(tokens, k=5)

GPU-support

torchmers uses PyTorch primitives under the hood. As long as PyTorch implements the unfold and scatter_add operations, among others, for some hardware backend, it can in principle be used for accelerated k-mer counting. This includes, of course, Nvidia GPUs, TPUs and possibly also Apple M CPUs, although I don't have hardware to test this in the latter case.

The count_k_mers will simply use the device of the sequence tensor it has been provided, but this can also be overwritten with the device argument:

# Both options are equivalent
tokens = torch.tensor(tokenizer.encode(sequence), device='cuda')

counts = count_k_mers(tokens, k=5, device='cuda')

PyTorch Module API

torchmers provides a PyTorch Module which can be integrated into neural networks to encode (DNA) sequences into k-mer spectra.

import torch
import torch.nn as nn

from torchmers.modules import KMerFrequencyEncoder

sequences = torch.randint(0, 3, (32, 256))

encoder = KMerFrequencyEncoder(k=5),
encoder(sequences)  # [32, 4 ** 5]

Note that KMerFrequencyEncoder expects input sequences to be numericalized and formatted as PyTorch tensors.

Padding

If sequences have been padded, a tensor with sequences lengths of shape [batch_size] can be passed as seq_lens argument to KMerFrequencyEncoder.forward(). Alternatively, a boolean mask of shape [batch_size, seq_len] is supported as well, where tokens marked with False will not be considered for k-mer counting.

Examples

The examples/ folder contains Jupyter notebooks with different usage examples.

Benchmarks

torchmers performance benchmark

Note: The "python native" implementation is certainly not highly optimized but should represent a simple Python baseline. The GPU benchmarks have been created on a RTX 4090.

Contributing

Contributions are welcomed. Please run tests with python -m pytest tests/ to test your work before submitting.

Future work

This project started out of frustration with slow on-the-fly computation of k-mer frequency spectra for deep learning applications when done in pure Python. I don't have very many other personal use cases, but I would be glad if users could provide me with their feature requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchmers-0.1.0.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

torchmers-0.1.0-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file torchmers-0.1.0.tar.gz.

File metadata

  • Download URL: torchmers-0.1.0.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.9 Linux/5.15.0-83-generic

File hashes

Hashes for torchmers-0.1.0.tar.gz
Algorithm Hash digest
SHA256 58c4463110c7fe96182f4607bfba3022bebbed1785ab80ebef2126421f58d955
MD5 b63abe7e5aafb6ed9a580e47e080e29c
BLAKE2b-256 621e5684850fd7a9575b3cd2cb919e96f65b09ad2155a2a0cc3f3d4bd10a7633

See more details on using hashes here.

File details

Details for the file torchmers-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: torchmers-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.9 Linux/5.15.0-83-generic

File hashes

Hashes for torchmers-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 77e6e7d09d17bc91b944995ba43904c2196fed9f712d061cbc34ba3b4904cdc4
MD5 e4b5acc9c92996f4794f512f9e159ca5
BLAKE2b-256 5d3fb35aaa5cf0dd2c59a01088c3378269a6caec7cdfa4214d45cbea92d7cc38

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page