Pipeline for efficient genomic data processing.
Project description
GenVarLoader
GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer).
Features
- Respects memory budget
- Supports insertions and deletions
- Scales to 100,000s of individuals
- Fast!
- Extensible to new file formats (drop a feature request!)
- Coming soon: re-aligning tracks (e.g. expression, chromatin accessibility) to genetic variation (e.g. BigRNA)
Installation
pip install genvarloader
A PyTorch dependency is not included since it requires special instructions.
Quick Start
import genvarloader as gvl
ref_fasta = 'reference.fasta'
variants = 'variants.pgen' # highly recommended to convert VCFs to PGEN
regions_of_interest = 'regions.bed'
Create readers for each file providing sequence data:
ref = gvl.Fasta(name='ref', path=ref_fasta, pad='N')
var = gvl.Pgen(variants)
varseq = gvl.FastaVariants(name='varseq', reference=ref, variants=var)
Put them together and get a torch.DataLoader
:
gvloader = gvl.GVL(
readers=varseq,
bed=regions_of_interest,
fixed_length=1000,
batch_size=16,
max_memory_gb=8,
batch_dims=['sample', 'ploid'],
shuffle=True,
)
dataloader = gvloader.torch_dataloader()
And now you're ready to use the dataloader
however you need to:
# implement your training loop
for batch in dataloader:
...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genvarloader-0.1.12.tar.gz
.
File metadata
- Download URL: genvarloader-0.1.12.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.0 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 462548345d70ec5e45960fa5ce02d1103c06507972c9b167d98a9b3edf837733 |
|
MD5 | 9d7a68b42e369ef90c82cf3ab073c6fa |
|
BLAKE2b-256 | 8d77e25c1a83f09179d7f25e78caa5f6020aeca2bf9c18c085d2e0cc1a80d767 |
File details
Details for the file genvarloader-0.1.12-py3-none-any.whl
.
File metadata
- Download URL: genvarloader-0.1.12-py3-none-any.whl
- Upload date:
- Size: 2.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.0 CPython/3.8.18 Linux/4.18.0-477.21.1.el8_8.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a03de39d705921bdf3b79cd6e9fbc5bb2aaf6ec78984ee05d5e46d9a118d108 |
|
MD5 | b6f9b1de933b30d5b128e62f6233cd0a |
|
BLAKE2b-256 | e4cc49089403d3a5beb2930d882eba54d4b7050f6354d0d2cefc52cc6ebb9384 |