Skip to main content

Pipeline for efficient genomic data processing.

Project description

GenVarLoader

GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer).

Features

  • Respects memory budget
  • Supports insertions and deletions
  • Scales to 100,000s of individuals
  • Fast!
  • Extensible to new file formats (drop a feature request!)
  • Coming soon: re-aligning tracks (e.g. expression, chromatin accessibility) to genetic variation (e.g. BigRNA)

Installation

pip install genvarloader

A PyTorch dependency is not included since it requires special instructions.

Quick Start

import genvarloader as gvl

ref_fasta = 'reference.fasta'
variants = 'variants.pgen' # highly recommended to convert VCFs to PGEN
regions_of_interest = 'regions.bed'

Create readers for each file providing sequence data:

ref = gvl.Fasta(name='ref', path=ref_fasta, pad='N')
var = gvl.Pgen(variants)
varseq = gvl.FastaVariants(name='varseq', reference=ref, variants=var)

Put them together and get a torch.DataLoader:

gvloader = gvl.GVL(
    readers=varseq,
    bed=regions_of_interest,
    fixed_length=1000,
    batch_size=16,
    max_memory_gb=8,
    batch_dims=['sample', 'ploid'],
    shuffle=True,
)

dataloader = gvloader.torch_dataloader()

And now you're ready to use the dataloader however you need to:

# implement your training loop
for batch in dataloader:
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genvarloader-0.1.11.tar.gz (30.2 kB view details)

Uploaded Source

Built Distribution

genvarloader-0.1.11-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file genvarloader-0.1.11.tar.gz.

File metadata

  • Download URL: genvarloader-0.1.11.tar.gz
  • Upload date:
  • Size: 30.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.1.11.tar.gz
Algorithm Hash digest
SHA256 7c8cfc775c6e5aa6c3d98dcbe743f5f7ba63fddf674991083f3ab6502667e916
MD5 6f2a1a767b7ebf0f25c6fb2310444ce8
BLAKE2b-256 5265938db301cc11da367ed90dd628bf5ac428b0004f60c6e32900904404ad50

See more details on using hashes here.

File details

Details for the file genvarloader-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: genvarloader-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 35.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 49c3b3c6f45a67b74e8a9609bf37fcc4be2548d145356f04713758f50949486d
MD5 664f3a90c6b0a2b7307aa945c014361c
BLAKE2b-256 45ea6d0efbb87599cb2ceee643868b93d9e6a1b403234f2ec2949e023a6dd0b4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page