Pipeline for efficient genomic data processing.
Project description
GenVarLoader
GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Nucleotide Transformer).
Features
- Respects memory budget
- Supports insertions and deletions
- Scales to 100,000s of individuals
- Fast!
- Extensible to new file formats (drop a feature request!)
- Coming soon: re-aligning tracks (e.g. expression, chromatin accessibility) to genetic variation (e.g. BigRNA)
Installation
pip install genvarloader
A PyTorch dependency is not included since it requires special instructions.
An optional dependency is TensorStore(version >=0.1.50) for writing genotypes as a Zarr store and using TensorStore for I/O. This dramatically speeds up dataloading performance when training a model on genetic variation, for which approximately uniform random sampling across the genome is required. Standard bioinformatics variant formats like VCF, BCF, and PGEN unfortunately do not have a data layout conducive for this. TensorStore is not included as a dependency due a dependency conflict that, within the scope of GenVarLoader, does not cause any issues. GenVarLoader is developed with Poetry and I am waiting for the ability to override/ignore sub-dependencies to include TensorStore as an explicit dependency.
Quick Start
import genvarloader as gvl
ref_fasta = 'reference.fasta'
variants = 'variants.pgen' # highly recommended to convert VCFs to PGEN
regions_of_interest = 'regions.bed'
Create readers for each file providing sequence data:
ref = gvl.Fasta(name='ref', path=ref_fasta, pad='N')
var = gvl.Pgen(variants)
varseq = gvl.FastaVariants(name='varseq', reference=ref, variants=var)
Put them together and get a torch.DataLoader
:
gvloader = gvl.GVL(
readers=varseq,
bed=regions_of_interest,
fixed_length=1000,
batch_size=16,
max_memory_gb=8,
batch_dims=['sample', 'ploid'],
shuffle=True,
)
dataloader = gvloader.torch_dataloader()
And now you're ready to use the dataloader
however you need to:
# implement your training loop
for batch in dataloader:
...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genvarloader-0.3.2.tar.gz
.
File metadata
- Download URL: genvarloader-0.3.2.tar.gz
- Upload date:
- Size: 150.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.5.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48f6625036355c1175da80cd9734a6319f1d8da55e85cd46e7dd4d83b44f03a5 |
|
MD5 | ed17c2e027959273040eca7b31828de8 |
|
BLAKE2b-256 | 07ce37f4d063e0bf14cacd102380dda368b44894c9baa7ea96c951cac2e3694a |
File details
Details for the file genvarloader-0.3.2-cp39-abi3-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: genvarloader-0.3.2-cp39-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 424.9 kB
- Tags: CPython 3.9+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.5.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a0e4159c22416a3ed19b0c58ede05268122e3b4246322f6df327401ccc22620 |
|
MD5 | 1b08f42def4e07e042564e96d9bd153a |
|
BLAKE2b-256 | 219aa2e17830601cdc1c3eea52684de71a9373965f8055164a6cb08a41a4490e |