Skip to main content

Pipeline for efficient genomic data processing.

Project description

GenVarLoader

GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer).

Features

  • Respects memory budget
  • Supports insertions and deletions
  • Scales to 100,000s of individuals
  • Fast!
  • Extensible to new file formats (drop a feature request!)
  • Coming soon: re-aligning tracks (e.g. expression, chromatin accessibility) to genetic variation (e.g. BigRNA)

Installation

pip install genvarloader

A PyTorch dependency is not included since it requires special instructions.

Quick Start

import genvarloader as gvl

ref_fasta = 'reference.fasta'
variants = 'variants.pgen' # highly recommended to convert VCFs to PGEN
regions_of_interest = 'regions.bed'

Create readers for each file providing sequence data:

ref = gvl.Fasta(name='ref', path=ref_fasta, pad='N')
var = gvl.Pgen(variants)
varseq = gvl.FastaVariants(name='varseq', reference=ref, variants=var)

Put them together and get a torch.DataLoader:

gvloader = gvl.GVL(
    readers=varseq,
    bed=regions_of_interest,
    fixed_length=1000,
    batch_size=16,
    max_memory_gb=8,
    batch_dims=['sample', 'ploid'],
    shuffle=True,
)

dataloader = gvloader.torch_dataloader()

And now you're ready to use the dataloader however you need to:

# implement your training loop
for batch in dataloader:
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genvarloader-0.1.14.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

genvarloader-0.1.14-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file genvarloader-0.1.14.tar.gz.

File metadata

  • Download URL: genvarloader-0.1.14.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.1.14.tar.gz
Algorithm Hash digest
SHA256 b3c1711de904494f4327eace032575f1944dfca476d98d67b581a2471fc7061c
MD5 72d833480eb3682b498173668fed94cb
BLAKE2b-256 cbdf05eefd66fa1e78f218e892c131313eb8469c8fcbc53be56f16b34a63fb58

See more details on using hashes here.

File details

Details for the file genvarloader-0.1.14-py3-none-any.whl.

File metadata

  • Download URL: genvarloader-0.1.14-py3-none-any.whl
  • Upload date:
  • Size: 35.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 6e22a87098fc18570a135cc3680477ebc4712b6b8f690fb6fb0d8569d94d300b
MD5 2f1cdbd511a2a5777366f69813b26f84
BLAKE2b-256 ce98ff7d13ff5aedc7ab5df1175813742e2637cb638cb966c22e99101627d39a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page