Pipeline for efficient genomic data processing.
Project description
GenVarLoader
GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer) or train sequence to function models with genetic variation (e.g. BigRNA).
Features
- Avoids writing any sequences to disk
- Generates haplotypes up to 1,000 times faster than reading a FASTA file
- Generates tracks up to 450 times faster than reading a BigWig
- Supports indels and re-aligns tracks to haplotypes that have them
- Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig
Tutorial
Installation
pip install genvarloader
A PyTorch dependency is not included since it may require special instructions.
Write a gvl.Dataset
GenVarLoader has both a CLI and Python API for writing datasets. The Python API provides some extra flexibility, for example for a multi-task objective.
genvarloader cool_dataset.gvl interesting_regions.bed --variants cool_variants.vcf --bigwig-table samples_to_bigwigs.csv --length 2048 --max-jitter 128
Where samples_to_bigwigs.csv
has columns sample
and path
mapping each sample to its BigWig.
This could equivalently be done in Python as:
import genvarloader as gvl
gvl.write(
path="cool_dataset.gvl",
bed="interesting_regions.bed",
variants="cool_variants.vcf",
bigwigs=gvl.BigWigs.from_table("bigwig", "samples_to_bigwigs.csv"),
length=2048,
max_jitter=128,
)
Open a gvl.Dataset
and get a PyTorch DataLoader
import genvarloader as gvl
dataset = gvl.Dataset.open(path="cool_dataset.gvl", reference="hg38.fa")
train_dataset = dataset.subset_to(regions=train_regions, samples=train_samples)
train_dataloader = train_dataset.to_dataloader(batch_size=32, shuffle=True, num_workers=1)
# use it in your training loop
for haplotypes, tracks in train_dataloader:
...
Inspect specific instances
dataset[99] # 100-th instance of the raveled dataset
dataset[0, 9] # first region, 10th sample
dataset.isel(regions=0, samples=9)
dataset.sel(regions=dataset.get_bed()[0], samples=dataset.samples[9])
dataset[:10] # first 10 instances
dataset[:10, :5] # first 10 regions and 5 samples
Transform the data on-the-fly
import seqpro as sp
from einops import rearrange
def transform(haplotypes, tracks):
ohe = sp.DNA.ohe(haplotypes)
ohe = rearrange(ohe, "batch length alphabet -> batch alphabet length")
return ohe, tracks
transformed_dataset = dataset.with_settings(transform=transform)
Pre-computing transformed tracks
Suppose we want to return tracks that are the z-scored, log(CPM + 1) version of the original. Sometimes it is better to write this to disk to avoid having to recompute it during training or inference.
import numpy as np
# We'll assume we already have an array of total counts for each sample.
# This usually can't be derived from a gvl.Dataset since it only has data for specific regions.
total_counts = np.load('total_counts.npy') # shape: (samples) float32
# We'll compute the mean and std log(CPM + 1) using the training split
means = np.empty((train_dataset.n_regions, train_dataset.region_length), np.float32)
stds = np.empty_like(means)
just_tracks = train_dataset.with_settings(return_sequences=False, jitter=0)
for i in range(len(means)):
cpm = np.log1p(just_tracks[i, :] / total_counts[:, None])
means[i] = cpm.mean(0)
stds[i] = cpm.std(0)
# Define our transformation
def z_log_cpm(dataset_indices, region_indices, sample_indices, tracks: gvl.Ragged[np.float32]):
# In the event that the dataset only has SNPs, the full length tracks will all be the same length.
# So, we can reshape the ragged data into a regular array.
_tracks = tracks.data.reshape(-1, dataset.region_length)
# Otherwise, we would have to leave `tracks`as a gvl.Ragged array to accommodate different lengths.
# In that case, we could do the transformation with a Numba compiled function instead.
# original tracks -> log(CPM + 1) -> z-score
_tracks = np.log1p(_tracks / total_counts[sample_indices, None])
_tracks = (_tracks - means[region_indices]) / stds[region_indices]
return gvl.Ragged.from_offsets(_tracks.ravel(), tracks.shape, tracks.offsets)
# This can take about as long as writing the original tracks or longer, depending on the transformation.
dataset_with_zlogcpm = dataset.write_transformed_track("z-log-cpm", "bigwig", transform=z_log_cpm)
# The dataset now has both tracks available, "bigwig" and "z-log-cpm", and we can choose to return either one or both.
haps_and_zlogcpm = dataset_with_zlogcpm.with_settings(return_tracks="z-log-cpm")
# If we re-opened the dataset after running this then we could write...
dataset = gvl.Dataset.open("cool_dataset.gvl", "hg38.fa", return_tracks="z-log-cpm")
Performance tips
- GenVarLoader uses multithreading extensively, so it's best to use 0 or 1 workers with your PyTorch
DataLoader
. - A GenVarLoader
Dataset
is most efficient when given batches of indices, rather than one at a time. PyTorchDataLoader
by default uses one index at a time, so if you want to use a custom PyTorchSampler
you should wrap it with a PyTorchBatchSampler
before passing it toDataset.to_dataloader()
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genvarloader-0.4.0.tar.gz
.
File metadata
- Download URL: genvarloader-0.4.0.tar.gz
- Upload date:
- Size: 137.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.5.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a74b8b7acad7c980c1cf9f88a760d9fcfc9fa8c98d7259d6d0133ea5c96f54d9 |
|
MD5 | a24419565be44985352e7028d405dab6 |
|
BLAKE2b-256 | 7bc0518dda75616636ae243f867ad184c6fc94cc46745bc2fe731febee34b702 |
File details
Details for the file genvarloader-0.4.0-cp39-abi3-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: genvarloader-0.4.0-cp39-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 408.4 kB
- Tags: CPython 3.9+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.5.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18c08f48e1cf98811913134c406c26fba27f96fe27f1ac5d79e6512350f3de5f |
|
MD5 | 98f3ab53c58c4aede7db58338ca2d297 |
|
BLAKE2b-256 | 57c926b0013f1b95eca99dbcee9f2c9f634c3358d9f23f586f0823d33d77dd97 |