Skip to main content

Pipeline for efficient genomic data processing.

Project description

GenVarLoader

GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer).

Features

  • Respects memory budget
  • Supports insertions and deletions
  • Scales to 100,000s of individuals
  • Fast!
  • Extensible to new file formats (drop a feature request!)
  • Coming soon: re-aligning tracks (e.g. expression, chromatin accessibility) to genetic variation (e.g. BigRNA)

Installation

pip install genvarloader

A PyTorch dependency is not included since it requires special instructions.

Quick Start

import genvarloader as gvl

ref_fasta = 'reference.fasta'
variants = 'variants.pgen' # highly recommended to convert VCFs to PGEN
regions_of_interest = 'regions.bed'

Create readers for each file providing sequence data:

ref = gvl.Fasta(name='ref', path=ref_fasta, pad='N')
var = gvl.Pgen(variants)
varseq = gvl.FastaVariants(name='varseq', reference=ref, variants=var)

Put them together and get a torch.DataLoader:

gvloader = gvl.GVL(
    readers=varseq,
    bed=regions_of_interest,
    fixed_length=1000,
    batch_size=16,
    max_memory_gb=8,
    batch_dims=['sample', 'ploid'],
    shuffle=True,
)

dataloader = gvloader.torch_dataloader()

And now you're ready to use the dataloader however you need to:

# implement your training loop
for batch in dataloader:
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genvarloader-0.1.16.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

genvarloader-0.1.16-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file genvarloader-0.1.16.tar.gz.

File metadata

  • Download URL: genvarloader-0.1.16.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.1.16.tar.gz
Algorithm Hash digest
SHA256 531ec43a64c5f351c3cb4cadde16a5294112d24482a63ce6a17690d2f2decf9b
MD5 29d6d54be1c2eaeb949147bcf0bdb1cc
BLAKE2b-256 3d68f5e9b14707f4b589779bee5f95ee0b85431ea64b8cbf5c2f84a7a80051e3

See more details on using hashes here.

File details

Details for the file genvarloader-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: genvarloader-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 fb56b3dddf3379d4994df2ac7c349a92d9828338cd23a0122e05c40d1f804107
MD5 bb231e7b08fa54a886adfad976ed9269
BLAKE2b-256 24a3211ce74e053b05ac434a24137dc84669b135f5209537fbdf1035c7c396c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page