Skip to main content

Pipeline for efficient genomic data processing.

Project description

GenVarLoader

GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer).

Features

  • Respects memory budget
  • Supports insertions and deletions
  • Scales to 100,000s of individuals
  • Fast!
  • Extensible to new file formats (drop a feature request!)
  • Coming soon: re-aligning tracks (e.g. expression, chromatin accessibility) to genetic variation (e.g. BigRNA)

Installation

pip install genvarloader

A PyTorch dependency is not included since it requires special instructions.

An optional dependency is TensorStore(version >=0.1.50) for writing genotypes as a Zarr store and using TensorStore for I/O. This dramatically speeds up dataloading performance when training a model on genetic variation, for which approximately uniform random sampling across the genome is required. Standard bioinformatics variant formats like VCF, BCF, and PGEN unfortunately do not have a data layout conducive for this. TensorStore is not included as a dependency due a dependency conflict that, within the scope of GenVarLoader, does not cause any issues. GenVarLoader is developed with Poetry and I am waiting for the ability to override/ignore sub-dependencies to include TensorStore as an explicit dependency.

Quick Start

import genvarloader as gvl

ref_fasta = 'reference.fasta'
variants = 'variants.pgen' # highly recommended to convert VCFs to PGEN
regions_of_interest = 'regions.bed'

Create readers for each file providing sequence data:

ref = gvl.Fasta(name='ref', path=ref_fasta, pad='N')
var = gvl.Pgen(variants)
varseq = gvl.FastaVariants(name='varseq', reference=ref, variants=var)

Put them together and get a torch.DataLoader:

gvloader = gvl.GVL(
    readers=varseq,
    bed=regions_of_interest,
    fixed_length=1000,
    batch_size=16,
    max_memory_gb=8,
    batch_dims=['sample', 'ploid'],
    shuffle=True,
)

dataloader = gvloader.torch_dataloader()

And now you're ready to use the dataloader however you need to:

# implement your training loop
for batch in dataloader:
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genvarloader-0.3.0rc1.tar.gz (40.9 kB view details)

Uploaded Source

Built Distribution

genvarloader-0.3.0rc1-py3-none-any.whl (47.1 kB view details)

Uploaded Python 3

File details

Details for the file genvarloader-0.3.0rc1.tar.gz.

File metadata

  • Download URL: genvarloader-0.3.0rc1.tar.gz
  • Upload date:
  • Size: 40.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.3.0rc1.tar.gz
Algorithm Hash digest
SHA256 1e506000657b8111e42740fa613b3403a9c73c7d7222fdadcc7812e422d717d3
MD5 9e8e4e67ed0021c38c829c8d0cb3d450
BLAKE2b-256 9aaff8cb82d83824f74e7c2f3c00cf1acbcc91be31f1a608a9e5b5c61ebff5f9

See more details on using hashes here.

File details

Details for the file genvarloader-0.3.0rc1-py3-none-any.whl.

File metadata

  • Download URL: genvarloader-0.3.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 47.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.3.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 42ed3df6dca4a51dc1743d414b5aaba7737e09a639155cb7abd38a6076f69c8e
MD5 c4d101b18634c97fcae8f71fc65b58e4
BLAKE2b-256 6070cb9824e5bfb52d113eb05e588af954e101e4672ce8cab0566017e52892f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page