Skip to main content

Pipeline for efficient genomic data processing.

Project description

GenVarLoader

GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer).

Features

  • Respects memory budget
  • Supports insertions and deletions
  • Scales to 100,000s of individuals
  • Fast!
  • Extensible to new file formats (drop a feature request!)
  • Coming soon: re-aligning tracks (e.g. expression, chromatin accessibility) to genetic variation (e.g. BigRNA)

Installation

pip install genvarloader

A PyTorch dependency is not included since it requires special instructions.

Quick Start

import genvarloader as gvl

ref_fasta = 'reference.fasta'
variants = 'variants.pgen' # highly recommended to convert VCFs to PGEN
regions_of_interest = 'regions.bed'

Create readers for each file providing sequence data:

ref = gvl.Fasta(name='ref', path=ref_fasta, pad='N')
var = gvl.Pgen(variants)
varseq = gvl.FastaVariants(name='varseq', reference=ref, variants=var)

Put them together and get a torch.DataLoader:

gvloader = gvl.GVL(
    readers=varseq,
    bed=regions_of_interest,
    fixed_length=1000,
    batch_size=16,
    max_memory_gb=8,
    batch_dims=['sample', 'ploid'],
    shuffle=True,
)

dataloader = gvloader.torch_dataloader()

And now you're ready to use the dataloader however you need to:

# implement your training loop
for batch in dataloader:
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genvarloader-0.1.12.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

genvarloader-0.1.12-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file genvarloader-0.1.12.tar.gz.

File metadata

  • Download URL: genvarloader-0.1.12.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.1.12.tar.gz
Algorithm Hash digest
SHA256 462548345d70ec5e45960fa5ce02d1103c06507972c9b167d98a9b3edf837733
MD5 9d7a68b42e369ef90c82cf3ab073c6fa
BLAKE2b-256 8d77e25c1a83f09179d7f25e78caa5f6020aeca2bf9c18c085d2e0cc1a80d767

See more details on using hashes here.

File details

Details for the file genvarloader-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: genvarloader-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 7a03de39d705921bdf3b79cd6e9fbc5bb2aaf6ec78984ee05d5e46d9a118d108
MD5 b6f9b1de933b30d5b128e62f6233cd0a
BLAKE2b-256 e4cc49089403d3a5beb2930d882eba54d4b7050f6354d0d2cefc52cc6ebb9384

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page