Package for loading data from bgen files
Project description
Another bgen parser
This is a package for reading and writing bgen files.
This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU). Decompressing the genotype probabilities can be the slow step, zlib decompression takes 80% of the total time, using zstd compressed genotypes is ~2X faster.
This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).
Install
pip install bgen
Usage
from bgen import BgenReader, BgenWriter
bfile = BgenReader(BGEN_PATH)
rsids = bfile.rsids()
# select a variant by indexing
var = bfile[1000]
# pull out genotype probabilities
probs = var.probabilities # returns 2D numpy array
dosage = var.minor_allele_dosage # returns 1D numpy array for biallelic variant
# iterate through every variant in the file
with BgenReader(BGEN_PATH, delay_parsing=True) as bfile:
for var in bfile:
dosage = var.minor_allele_dosage
# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)
# or for writing bgen files
import numpy as np
from bgen import BgenWriter
geno = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]).astype(np.float64)
with BgenWriter(BGEN_PATH, n_samples=3) as bfile:
bfile.add_variant(varid='var1', rsid='rs1', chrom='chr1', pos=1,
alleles=['A', 'G'], genotypes=geno)
API documentation
class BgenReader(path, sample_path='', delay_parsing=False)
# opens a bgen file. If a bgenix index exists for the file, the index file
# will be opened automatically for quicker access of specific variants.
Arguments:
path: path to bgen file
sample_path: optional path to sample file. Samples will be given integer IDs
if sample file is not given and sample IDs not found in the bgen file
delay_parsing: True/False option to allow for not loading all variants into
memory when the BgenFile is opened. This can save time when iterating
across variants in the file
Attributes:
samples: list of sample IDs
header: BgenHeader with info about the bgen version and compression.
Methods:
slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
fetch(chrom, start=None, stop=None): get all variants within a genomic region
drop_variants(list[int]): drops variants by index from being used in analyses
with_rsid(rsid): returns BgenVar with given rsid
at_position(pos): returns BgenVar at a given position
varids(): returns list of varids for variants in the bgen file.
rsids(): returns list of rsids for variants in the bgen file. NOTE: this
order can differ from the order of variants in the bfile if an index
file is present.
chroms(): returns list of chromosomes for variants in the bgen file. NOTE:
order can differ from variant order if index file is present.
positions(): returns list of positions for variants in the bgen file.
class BgenVar(handle, offset, layout, compression, n_samples):
# Note: this isn't called directly, but instead returned from BgenFile methods
Attributes:
varid: ID for variant
rsid: reference SNP ID for variant
chrom: chromosome variant is on
pos: nucleotide position variant is at
alleles: list of alleles for variant
is_phased: True/False for whether variant has phased genotype data
ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
minor_allele: the least common allele (for biallelic variants)
minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
alt_dosage: 1D numpy array of alt allele dosages for each sample
probabilitiies: 2D numpy array of genotype probabilities, one sample per row
BgenVars can be pickled e.g. pickle.dumps(var)
class BgenWriter(path, n_samples, samples=[], compression='zstd' layout=2, metadata=None)
# opens a bgen file to write variants to. Automatically makes a bgenix index file
Arguments:
path: path to write data to
n_samples: number of samples that you have data for
samples: list of sample IDs (same length as n_samples)
compression: compression type: None, 'zstd', or 'zlib' (default='zstd')
layout: bgen layout format (default=2)
metadata: any additional metadata you want o include in the file (as str)
Methods:
add_variant(varid, rsid, chrom, pos, alleles, genotypes, ploidy=2,
phased=False, bit_depth=8)
Arguments:
varid: variant ID e.g. 'var1'
rsid: reference SNP ID e.g. 'rs1'
chrom: chromosome the variant is on e.g 'chr1'
pos: nucleotide position of the variant e.g. 100
alleles: list of allele strings e.g. ['A', 'C']
genotypes: numpy array of genotype probabilities, ordered as per the
bgen samples e.g. np.array([[0, 0, 1], [0.5, 0.5, 0]])
ploidy: ploidy state, either as integer to indicate constant ploidy
(e.g. 2), or numpy array of ploidy values per sample, e.g. np.array([1, 2, 2])
phased: whether the genotypes are for phased data or not (default=False)
bit_depth: how many bits to store each genotype as (1-32, default=8)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for bgen-1.5.9-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c13909e8c215a5dbdef16afd5251096d303fe6ef29ccaf4003eeb7af10344dc |
|
MD5 | 129d1bce8bf7a95ff8116ba7a753a562 |
|
BLAKE2b-256 | c4ad0d4a884b45cdb2e3b8eb95258413145e1bb08581548890bd2b5f81b17696 |
Hashes for bgen-1.5.9-cp312-cp312-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48ef0a80e0c93c4dd7566c01c19c033e7dd1d4d309489f1ec62af2f0caa280a7 |
|
MD5 | f05620f5af2a6ee231680cf95ad5fafe |
|
BLAKE2b-256 | 5f66a7fed78e422131f72107c94cd95eed782934b94e587f612baa66900563c6 |
Hashes for bgen-1.5.9-cp312-cp312-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa41046f3791d8de16813f25995c33c331af08fbb650f9f790bbec413ac01c35 |
|
MD5 | 9b2067ca88354621267684cc02d682d9 |
|
BLAKE2b-256 | 80fd13923d9f52bce7eabaa6b5184371b21735283ea4a8fbff4c4b919c2c0b4a |
Hashes for bgen-1.5.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b23c1826e99cc3b273fe7bb5a9a08aa19c548421ee6f55c9e6309bf5b11386c |
|
MD5 | 93ac2eb07ec4c1d4fb9537dff3860752 |
|
BLAKE2b-256 | b93e3968aa5b30c35591ad370871c48ef55fce94c6295e72edc675c4a80180c2 |
Hashes for bgen-1.5.9-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40b41d82768fd581f52ea0d3bd0159d36f80b5773e90e5029bca9e5a5d8e8bde |
|
MD5 | 09c295e29624e6e90d5aa236f2ecc35d |
|
BLAKE2b-256 | 155585d42b5e8432d01bbfedb6456a1d4bd155769d5fbcb957e6850c3cf8ca56 |
Hashes for bgen-1.5.9-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 155c5ab8346b07b6ec652bb0e4aff932ef66ee6cc6bd5b58487c4a33ebd79919 |
|
MD5 | 757ce14401a36af51adf45409e87eb6e |
|
BLAKE2b-256 | c558d91cd2430106449cd7ffe22bd015428ae002c27e357f4dd3ad8b8d895bba |
Hashes for bgen-1.5.9-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a1784595c7118228b566ba9841bd90cb05f02a8167802b7fb1e8003c62eeb48 |
|
MD5 | 1581b4f88bcacbb2a771075a54ced4c6 |
|
BLAKE2b-256 | 777a05230e0d9099f3946fcbc592236e5e15989a55317e3c4e49671f51e08a45 |
Hashes for bgen-1.5.9-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98b8e6e150d3b28a01b9d03d01e7c103fc9498d08fdc27388c0e30eea271079d |
|
MD5 | 65d405d74c31d030764df626ca8459c9 |
|
BLAKE2b-256 | 11434bbadd4721640c72e044faf88f28a8070d001ff50d4b52a5f6b42399e845 |
Hashes for bgen-1.5.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2bc5dd88cf73ad17cb1ae183d2d16d20709ee598209379593c628fc578ec71a |
|
MD5 | 5fecf16415feae86c52d267d838af716 |
|
BLAKE2b-256 | da7f772fdeb82031ca980e3422f44a9278f36ec11f18c87b016e2e09d044654b |
Hashes for bgen-1.5.9-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76b6f214d62739a7ffd5ffb2ee58dccb05b59cafaa486d987678b12f7b420fbc |
|
MD5 | d1403b788225fc754dcc67f4a664eff0 |
|
BLAKE2b-256 | 693a8e1aead6357ba069bfd579caf77aa2bb671fa35a49e7fc36cfc278556741 |
Hashes for bgen-1.5.9-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f9d9dda2930af603eb77fa19aae8590bbcad8528d6688c9ade8ba3289e82e38 |
|
MD5 | 9e91a69d076674cb51bd30b2e927d71a |
|
BLAKE2b-256 | 9e594e2bea6848b215ef7001682a1a3ea8d776cf0110e275645c594f40401e4f |
Hashes for bgen-1.5.9-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1178fa50e100238f7defb1ea932ca44aff3b10ca82bdfb84e7b4eec41f72d0a9 |
|
MD5 | 7a16419e951b017ef4168a1d943b5ae6 |
|
BLAKE2b-256 | 3964319246b1a795fa4fbf99ca5cd6827fa2588e93862f7e0f860ef23b374e51 |
Hashes for bgen-1.5.9-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0519c1e8524a5f000266e63e30389d55b9e492f3c398416371f0754c4ca47f83 |
|
MD5 | 4cbd758d4e8bd0213a2e7cd0c9e0db29 |
|
BLAKE2b-256 | 6c7f42311aca8e45b28b2a23c4eb00c92c65d601383a34ba33c3180680bd6542 |
Hashes for bgen-1.5.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b05f4cd174daa5dc7bccd8a2b3264e149cafd741413dd858ee51b0f940086062 |
|
MD5 | 51d140d17db8d4669fc71a9b5b50ef45 |
|
BLAKE2b-256 | 7d24325d6da6881f439cafa698bbd46ea39cfe98095a708d1b8602f0caadcf2b |
Hashes for bgen-1.5.9-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ffadd8f048c6b51a20463ab731d65401c977cf83b9ce6afdb8d16777e818609 |
|
MD5 | 52748acfbdacecd85d50e7d15253edc3 |
|
BLAKE2b-256 | a9700e4bafdb1e26a1b1a4abe891f5cc35fc3e26a71ebbfa8d9b342229ee4e57 |
Hashes for bgen-1.5.9-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c3adfceda67eb406149689020eea273252d0e175071bf2c4466ac5d55a9db12 |
|
MD5 | 9101d9b6f5a49d1c3e97b004f8980c85 |
|
BLAKE2b-256 | 69c7d2edcbcc660abec89d3963e3df8d01ab1693dfa0970b357e5b40475bf85d |
Hashes for bgen-1.5.9-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63947dc65e775f9df4fe9c088e4113b7a92cc3795c7e09f3ce9e749606466c56 |
|
MD5 | 695dbe2894c2efbfde6c85d98af1fd19 |
|
BLAKE2b-256 | 405a8b073a6b1768e878de6f8605af2f10a82b2834ba24bc4fdab421ef7347cd |
Hashes for bgen-1.5.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d71a5fd001e890f341ca57a3359f91b56e2b4a901dc05521e0d51eff0925c96d |
|
MD5 | d6349dae7a451d36f6d17c2e3620eb09 |
|
BLAKE2b-256 | 7885c2eff78bc00d9ad466560573adc381a5db9bf2cb19d97117290d198b4426 |
Hashes for bgen-1.5.9-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 391b0678187221974bda0bef758cc019d41af74a917cf03cbe04ba693e9fe62b |
|
MD5 | 5765296ba197affc8558a598d6cefeef |
|
BLAKE2b-256 | 92b9383bdc469a734e4096a6709281213986a8d7f7f380eff2ce509ae138b06a |
Hashes for bgen-1.5.9-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8bb943e9ddb0cc486be4b83205640401b644c4053566356ad442455967e2866 |
|
MD5 | 48d03813a08558348070f2403a2134b7 |
|
BLAKE2b-256 | 91bf5b1324f0b90d1dcc76ff55237d4f070746796815f6ece48d5a37c6b51715 |
Hashes for bgen-1.5.9-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eee9c3e656df5eef229e2a13d47422be75dc9f4bae6d237786188bdbca02fabc |
|
MD5 | 8ef615c8a071c0e8d1a24ce128ac7f79 |
|
BLAKE2b-256 | 4a075f1144f66c0431f59eecb840fe6410bbe66bc049afa8face9777a9cb05b2 |
Hashes for bgen-1.5.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91b4cbadff4073d1d1261be5146f485a9671947e554b61142b28a889fdadf77e |
|
MD5 | 20caf1285a7efa1f155cffdf26950c88 |
|
BLAKE2b-256 | 19d1e5266d1f089a3b2e53c0bdb38597f8bdb32d124bbbf728b8aef40a3bf18c |
Hashes for bgen-1.5.9-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1dd616fe43b6a757a269b183a9088f36391d2fc29f744df239f6c5fbc301f46 |
|
MD5 | 8bf0607fabb57c3eb9b6da20fd94300e |
|
BLAKE2b-256 | 40fb4d80f22ea2f2eb853b488d2966417c1449f111873b11e043af892dd49904 |