genvarloader

Pipeline for efficient genomic data processing.

Project description

GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer) or train sequence to function models with genetic variation (e.g. BigRNA).

Features

Avoids writing any sequences to disk
Generates haplotypes up to 1,000 times faster than reading a FASTA file
Generates tracks up to 450 times faster than reading a BigWig
Supports indels and re-aligns tracks to haplotypes that have them
Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig

Tutorial

Installation

pip install genvarloader

A PyTorch dependency is not included since it may require special instructions.

Write a `gvl.Dataset`

GenVarLoader has both a CLI and Python API for writing datasets. The Python API provides some extra flexibility, for example for a multi-task objective.

genvarloader cool_dataset.gvl interesting_regions.bed --variants cool_variants.vcf --bigwig-table samples_to_bigwigs.csv --length 2048 --max-jitter 128

Where samples_to_bigwigs.csv has columns sample and path mapping each sample to its BigWig.

This could equivalently be done in Python as:

import genvarloader as gvl

gvl.write(
    path="cool_dataset.gvl",
    bed="interesting_regions.bed",
    variants="cool_variants.vcf",
    bigwigs=gvl.BigWigs.from_table("bigwig", "samples_to_bigwigs.csv"),
    length=2048,
    max_jitter=128,
)

Open a `gvl.Dataset` and get a PyTorch DataLoader

import genvarloader as gvl

dataset = gvl.Dataset.open(path="cool_dataset.gvl", reference="hg38.fa")
train_samples = ["David", "Aaron"]
train_dataset = dataset.subset_to(regions="train_regions.bed", samples=train_samples)
train_dataloader = train_dataset.to_dataloader(batch_size=32, shuffle=True, num_workers=1)

# use it in your training loop
for haplotypes, tracks in train_dataloader:
    ...

Inspect specific instances

dataset[99]  # 100-th instance of the raveled dataset
dataset[0, 9]  # first region, 10th sample
dataset.isel(regions=0, samples=9)
dataset.sel(regions=dataset.get_bed()[0], samples=dataset.samples[9])
dataset[:10]  # first 10 instances
dataset[:10, :5]  # first 10 regions and 5 samples

Transform the data on-the-fly

import seqpro as sp
from einops import rearrange

def transform(haplotypes, tracks):
    ohe = sp.DNA.ohe(haplotypes)
    ohe = rearrange(ohe, "batch length alphabet -> batch alphabet length")
    return ohe, tracks

transformed_dataset = dataset.with_settings(transform=transform)

Pre-computing transformed tracks

Suppose we want to return tracks that are the z-scored, log(CPM + 1) version of the original. Sometimes it is better to write this to disk to avoid having to recompute it during training or inference.

import numpy as np

# We'll assume we already have an array of total counts for each sample.
# This usually can't be derived from a gvl.Dataset since it only has data for specific regions.
total_counts = np.load('total_counts.npy')  # shape: (samples) float32

# We'll compute the mean and std log(CPM + 1) using the training split
means = np.empty((train_dataset.n_regions, train_dataset.region_length), np.float32)
stds = np.empty_like(means)
just_tracks = train_dataset.with_settings(return_sequences=False, jitter=0)
for region in range(len(means)):
    cpm = np.log1p(just_tracks[region, :] / total_counts[:, None] * 1e6)
    means[region] = cpm.mean(0)
    stds[region] = cpm.std(0)

# Define our transformation
def z_log_cpm(dataset_indices, region_indices, sample_indices, tracks: gvl.Ragged[np.float32]):
    # In the event that the dataset only has SNPs, the full length tracks will all be the same length.
    # So, we can reshape the ragged data into a regular array.
    _tracks = tracks.data.reshape(-1, dataset.region_length)
    
    # Otherwise, we would have to leave `tracks`as a gvl.Ragged array to accommodate different lengths.
    # In that case, we could do the transformation with a Numba compiled function instead.

    # original tracks -> log(CPM + 1) -> z-score
    _tracks = np.log1p(_tracks / total_counts[sample_indices, None] * 1e6)
    _tracks = (_tracks - means[region_indices]) / stds[region_indices]

    return gvl.Ragged.from_offsets(_tracks.ravel(), tracks.shape, tracks.offsets)

# This can take about as long as writing the original tracks or longer, depending on the transformation.
dataset_with_zlogcpm = dataset.write_transformed_track("z-log-cpm", "bigwig", transform=z_log_cpm)

# The dataset now has both tracks available, "bigwig" and "z-log-cpm", and we can choose to return either one or both.
haps_and_zlogcpm = dataset_with_zlogcpm.with_settings(return_tracks="z-log-cpm")

# If we re-opened the dataset after running this then we could write...
dataset = gvl.Dataset.open("cool_dataset.gvl", "hg38.fa", return_tracks="z-log-cpm")

Performance tips

GenVarLoader uses multithreading extensively, so it's best to use 0 or 1 workers with your PyTorch DataLoader.
A GenVarLoader Dataset is most efficient when given batches of indices, rather than one at a time. PyTorch DataLoader by default uses one index at a time, so if you want to use a custom PyTorch Sampler you should wrap it with a PyTorch BatchSampler before passing it to Dataset.to_dataloader().

Project details

Release history Release notifications | RSS feed

0.6.0

Sep 4, 2024

0.5.6

Aug 8, 2024

0.5.5

Aug 2, 2024

0.5.4.post1

Jul 6, 2024

0.5.4 yanked

Jul 6, 2024

Reason this release was yanked:

incomplete fix

This version

0.5.3

Jul 6, 2024

0.5.2

Jul 5, 2024

0.5.1

Jun 30, 2024

0.5.0

Jun 13, 2024

0.4.1

Jun 12, 2024

0.4.0

Jun 6, 2024

0.3.3

Jun 1, 2024

0.3.2

Apr 29, 2024

0.3.1

Apr 16, 2024

0.3.0 yanked

Mar 15, 2024

Reason this release was yanked:

premature release

0.3.0rc9 pre-release

Mar 30, 2024

0.3.0rc8 pre-release

Mar 21, 2024

0.3.0rc7 pre-release

Mar 19, 2024

0.3.0rc6 pre-release

Mar 12, 2024

0.3.0rc5 pre-release

Mar 5, 2024

0.3.0rc4 pre-release

Mar 1, 2024

0.3.0rc3 pre-release

Feb 29, 2024

0.3.0rc2 pre-release

Feb 29, 2024

0.3.0rc1 pre-release

Feb 29, 2024

0.2.5

Feb 27, 2024

0.2.4

Feb 26, 2024

0.2.3

Feb 12, 2024

0.2.2

Feb 8, 2024

0.2.1

Feb 2, 2024

0.2.0

Dec 30, 2023

0.1.18

Dec 21, 2023

0.1.17

Dec 17, 2023

0.1.16

Nov 29, 2023

0.1.15

Nov 20, 2023

0.1.14

Nov 20, 2023

0.1.13

Nov 18, 2023

0.1.12

Nov 17, 2023

0.1.11

Nov 15, 2023

0.1.10

Nov 8, 2023

0.1.9

Nov 7, 2023

0.1.8

Nov 3, 2023

0.1.7

Nov 2, 2023

0.1.6

Nov 1, 2023

0.1.5

Oct 31, 2023

0.1.4

Oct 31, 2023

0.1.3

Oct 31, 2023

0.1.2

Oct 31, 2023

0.1.1

Oct 30, 2023

0.1.0

Oct 28, 2023

0.0.3

Oct 13, 2023

0.0.2

Oct 12, 2023

0.0.1 yanked

Oct 12, 2023

Reason this release was yanked:

Add metadata.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genvarloader-0.5.3.tar.gz (464.3 kB view details)

Uploaded Jul 6, 2024 Source

Built Distribution

genvarloader-0.5.3-cp39-abi3-manylinux_2_28_x86_64.whl (472.8 kB view details)

Uploaded Jul 6, 2024 CPython 3.9+ manylinux: glibc 2.28+ x86-64

File details

Details for the file genvarloader-0.5.3.tar.gz.

File metadata

Download URL: genvarloader-0.5.3.tar.gz
Upload date: Jul 6, 2024
Size: 464.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.6.0

File hashes

Hashes for genvarloader-0.5.3.tar.gz
Algorithm	Hash digest
SHA256	`30f76824df7f2eb0a571c78ec38efc9f41892dbcc0647cbe6203aa6b0a9458f4`
MD5	`4971def2a17719fb81c0cbe0b75c5aed`
BLAKE2b-256	`8d2f977863e09e98883ac4691fe539490adc9cede0eb4e59332ae8aaefc4810e`

See more details on using hashes here.

File details

Details for the file genvarloader-0.5.3-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

Download URL: genvarloader-0.5.3-cp39-abi3-manylinux_2_28_x86_64.whl
Upload date: Jul 6, 2024
Size: 472.8 kB
Tags: CPython 3.9+, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.6.0

File hashes

Hashes for genvarloader-0.5.3-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`5bdd7178a3875a70da6132952157f3651eaf2ca1a0fc96c3bac98d00b8b5b1b9`
MD5	`340b6ea750f7b8b95971cd5223f85166`
BLAKE2b-256	`ef274e3b55d3d118245e16156ea0248bc5982c7aa13c34461e46b19f43658c9b`

See more details on using hashes here.

genvarloader 0.5.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Features

Tutorial

Installation

Write a `gvl.Dataset`

Open a `gvl.Dataset` and get a PyTorch DataLoader

Inspect specific instances

Transform the data on-the-fly

Pre-computing transformed tracks

Performance tips

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

genvarloader 0.5.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Features

Tutorial

Installation

Write a gvl.Dataset

Open a gvl.Dataset and get a PyTorch DataLoader

Inspect specific instances

Transform the data on-the-fly

Pre-computing transformed tracks

Performance tips

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Write a `gvl.Dataset`

Open a `gvl.Dataset` and get a PyTorch DataLoader