Package for loading data from bgen files
Project description
Another bgen parser
This is a package for reading and writing bgen files.
This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU). Decompressing the genotype probabilities can be the slow step, zlib decompression takes 80% of the total time, using zstd compressed genotypes is ~2X faster.
This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).
Install
pip install bgen
Usage
from bgen import BgenReader, BgenWriter
bfile = BgenReader(BGEN_PATH)
rsids = bfile.rsids()
# select a variant by indexing
var = bfile[1000]
# pull out genotype probabilities
probs = var.probabilities # returns 2D numpy array
dosage = var.minor_allele_dosage # returns 1D numpy array for biallelic variant
# iterate through every variant in the file
with BgenReader(BGEN_PATH, delay_parsing=True) as bfile:
for var in bfile:
dosage = var.minor_allele_dosage
# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)
# or for writing bgen files
import numpy as np
from bgen import BgenWriter
geno = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]).astype(np.float64)
with BgenWriter(BGEN_PATH, n_samples=3) as bfile:
bfile.add_variant(varid='var1', rsid='rs1', chrom='chr1', pos=1,
alleles=['A', 'G'], genotypes=geno)
API documentation
class BgenReader(path, sample_path='', delay_parsing=False)
# opens a bgen file. If a bgenix index exists for the file, the index file
# will be opened automatically for quicker access of specific variants.
Arguments:
path: path to bgen file
sample_path: optional path to sample file. Samples will be given integer IDs
if sample file is not given and sample IDs not found in the bgen file
delay_parsing: True/False option to allow for not loading all variants into
memory when the BgenFile is opened. This can save time when iterating
across variants in the file
Attributes:
samples: list of sample IDs
header: BgenHeader with info about the bgen version and compression.
Methods:
slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
fetch(chrom, start=None, stop=None): get all variants within a genomic region
drop_variants(list[int]): drops variants by index from being used in analyses
with_rsid(rsid): returns BgenVar with given rsid
at_position(pos): returns BgenVar at a given position
varids(): returns list of varids for variants in the bgen file.
rsids(): returns list of rsids for variants in the bgen file.
chroms(): returns list of chromosomes for variants in the bgen file.
positions(): returns list of positions for variants in the bgen file.
class BgenVar(handle, offset, layout, compression, n_samples):
# Note: this isn't called directly, but instead returned from BgenFile methods
Attributes:
varid: ID for variant
rsid: reference SNP ID for variant
chrom: chromosome variant is on
pos: nucleotide position variant is at
alleles: list of alleles for variant
is_phased: True/False for whether variant has phased genotype data
ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
minor_allele: the least common allele (for biallelic variants)
minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
alt_dosage: 1D numpy array of alt allele dosages for each sample
probabilitiies: 2D numpy array of genotype probabilities, one sample per row
BgenVars can be pickled e.g. pickle.dumps(var)
class BgenWriter(path, n_samples, samples=[], compression='zstd' layout=2, metadata=None)
# opens a bgen file to write variants to. Automatically makes a bgenix index file
Arguments:
path: path to write data to
n_samples: number of samples that you have data for
samples: list of sample IDs (same length as n_samples)
compression: compression type: None, 'zstd', or 'zlib' (default='zstd')
layout: bgen layout format (default=2)
metadata: any additional metadata you want o include in the file (as str)
Methods:
add_variant(varid, rsid, chrom, pos, alleles, genotypes, ploidy=2,
phased=False, bit_depth=8)
Arguments:
varid: variant ID e.g. 'var1'
rsid: reference SNP ID e.g. 'rs1'
chrom: chromosome the variant is on e.g 'chr1'
pos: nucleotide position of the variant e.g. 100
alleles: list of allele strings e.g. ['A', 'C']
genotypes: numpy array of genotype probabilities, ordered as per the
bgen samples e.g. np.array([[0, 0, 1], [0.5, 0.5, 0]])
ploidy: ploidy state, either as integer to indicate constant ploidy
(e.g. 2), or numpy array of ploidy values per sample, e.g. np.array([1, 2, 2])
phased: whether the genotypes are for phased data or not (default=False)
bit_depth: how many bits to store each genotype as (1-32, default=8)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for bgen-1.5.8-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2dfbcf9e0c87996b807a007ebd532ccd799731d07ed4c5db2fe77900ac314191 |
|
MD5 | 911794617ed7f3e3d5eaebebad994f33 |
|
BLAKE2b-256 | 228c8a74495631461c88666f7cae0f923126ddafae0a41e2ea064e8086f63e64 |
Hashes for bgen-1.5.8-cp312-cp312-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 717b9ab3cb38ebb48ac6127810ec7433a28faa1c92649057752d6215a79ba746 |
|
MD5 | fed7bada1baee46bc768c99f82107075 |
|
BLAKE2b-256 | d50dd89f6aa1d9630f46a35a787aa127f43d0f30a0bd9e95ffc70602846caee4 |
Hashes for bgen-1.5.8-cp312-cp312-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 898fa2ef416709e997404736b2e0c2791d66ebe8b7889d76c3b59a3f8916a301 |
|
MD5 | 75c86b252900bc214d1e2f1c254c545b |
|
BLAKE2b-256 | 1c1c298618fbd5befe021c7c0bf1cd7dd0cb9c489c5013d8b318a7314bce0e12 |
Hashes for bgen-1.5.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed58b107678bbc4327c89af0796ba7430b90005ad0c8b2f5cd9de18e4ad4444d |
|
MD5 | 7a2db367aaf68330347b1657305eb741 |
|
BLAKE2b-256 | 8ee95e37689d8d6eea75f9a8829397134b2c3be7b771a29675510b8384cf21d8 |
Hashes for bgen-1.5.8-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c57d062083eaf1a57bf0f3e00dd8a40d8e0e05ecad8312921f92055062dd59e |
|
MD5 | ae2fdf76c307311f804e5fd524ae1cbd |
|
BLAKE2b-256 | 426cf31b325c57469e295578117a4e01dfd1da3c03ed1466e9fe897c2c5ee7e9 |
Hashes for bgen-1.5.8-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3ab92ed2192d263010be932ad6181a82265cbfaa6bf76fdcb11ac2c38449f38 |
|
MD5 | df137d9de8e0cb869d93c1693e7e01ec |
|
BLAKE2b-256 | 722c5d97b4eccdec48e468a780cda4a5807f9c0a5e3e27414a68144ef3c06583 |
Hashes for bgen-1.5.8-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5de2809c36bf9e3de46ff0a2e2f74953f33650ce7619c589946fcf41cad9f43f |
|
MD5 | 9b15c77263e239df5440f450dba81e25 |
|
BLAKE2b-256 | 04335b3d48525d01476d3c4734b810d99061febe83f49bc8f31eb9fe3c5b38cc |
Hashes for bgen-1.5.8-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c020447ecef325d0b9d3b51157dd8a222cbbad06df60846a0d426fb862e8060 |
|
MD5 | 0cdb83487716d33c3754da2dfe71dbd9 |
|
BLAKE2b-256 | 7260ba10e4c3c564e4ba496b7404e575eebd1dd69c3f561a2a05bdaaa3a7bad1 |
Hashes for bgen-1.5.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5523551a7f36817694d96fd37b36cac7a871c1ebcb18d55aa5f34fbf41bd236 |
|
MD5 | 6425ed7ddd2fc529f8507e88a200b715 |
|
BLAKE2b-256 | e841d25d962c15efd4862e3ba5f194fa651bea89f74379ec4f938d0265154d93 |
Hashes for bgen-1.5.8-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e780eb8093dfd372eb4db2c9961b6c30ea89a63664de21d4b85077f8ac5c3513 |
|
MD5 | d6d473d8abde22595274962f345e25b6 |
|
BLAKE2b-256 | dc14f7aeab8f61ea455929f6662ca910b3c8c2448436eb106d362340903af193 |
Hashes for bgen-1.5.8-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47eca6c7b248764eb7ef642be6c44140ffa7e8b5b14901dbca256de80389282d |
|
MD5 | a81f00ccaaadfc1e95c0e53dd0536a25 |
|
BLAKE2b-256 | 9f6386a3659c38466f759161cc6a193089eda62de027efed96fc35e9fc27c51c |
Hashes for bgen-1.5.8-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6e605384ee13054377128f4885466400202d42d9730d372d1e1826b0c714c32 |
|
MD5 | fb973afa2a87881fd86dce28bfade9ba |
|
BLAKE2b-256 | 986acc1c0084d87bd1fd2907b17ee7dfa52055ff8107934c2f42cfe649ebef2e |
Hashes for bgen-1.5.8-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 04d3e2e03de76e3eee7bc89176945da1fad6754262a074003f37882575b20cee |
|
MD5 | f511c38378af194353d3e69ae3ba1a95 |
|
BLAKE2b-256 | 39bfb010c41c44c930bd43c290d8c6c33bfa58b59ae3c668fbe1fa980cab3ece |
Hashes for bgen-1.5.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70be8f603f53f0658a10eb446709deaefecefd645210d424df688bb21c1124ee |
|
MD5 | 20a7b6e7d186e581a3ae92726d9c255e |
|
BLAKE2b-256 | 108b551a834bc2f1a0570b48eaf9effde22f2174a501d729ccd384b9b1eb8559 |
Hashes for bgen-1.5.8-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b57739ba7699a27d3a5cf8e48da12708a357be7502808f90e4353ba8058cec1 |
|
MD5 | 70190a0193dadcf1b2f8dd937ac33a00 |
|
BLAKE2b-256 | 0072dae5a684fb1c34de700b100c9318253c65d6d437b2b850c55a34fe573a32 |
Hashes for bgen-1.5.8-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a8656d9776bd7f5b08331e5535baf39f46d6fd4182c0227d475f628911cae22 |
|
MD5 | 1da730e2e5c6ba4adb616c3c1dd437f3 |
|
BLAKE2b-256 | 12b08c39c9824af05ccc3756e9afc3365c5ae5f9fd33a47cb4c4a10219ba491f |
Hashes for bgen-1.5.8-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cfb499612e9c432d5eeaf02fe8140100f265c3688af2a02ac1e090b7732d7775 |
|
MD5 | 51c7550302605e4a85fe9da522e24d15 |
|
BLAKE2b-256 | 8b39ab8e0d7dea72252088cce8a2d33e45cc52d7dc4d2e31852b45cd070c6fcd |
Hashes for bgen-1.5.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0797d8cc2a18ca7e9cf11b1e9a896c4c5fb1ec32f7007152da164d0757e94e4 |
|
MD5 | 00a9fca197c8d4a2f552bae4457fe260 |
|
BLAKE2b-256 | 855965d3892b5c87977bfa46275dcfdc497c8c424f38d4aaa13bf17352f0c067 |
Hashes for bgen-1.5.8-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ca3433020d56e72ef675ee5ff40985d878c075081019ee96ac75c1b7522fdd7 |
|
MD5 | 51b87024b5e6418d8e020d57a0fb2d46 |
|
BLAKE2b-256 | c78888b7edc80eb50aa54bff02fa12dee43a9a73ebde4e0895372def32a31a4a |
Hashes for bgen-1.5.8-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae87217e2a447ba1932611a4bf3fd8f7fce1b02ab9529ac3373d35f008e51bcc |
|
MD5 | ab0d9bd9e7fab70f2d183a6358131b37 |
|
BLAKE2b-256 | 169a9030f4cae357d354d63218b068bb1aff2e589007437655a2028c6a01abea |
Hashes for bgen-1.5.8-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a61f2e1cdc50326090a8ed34e82dc894a4e8c03dc74299bbddc3fef407df695e |
|
MD5 | efbda572fc99b2dc2107c9333bce0ae3 |
|
BLAKE2b-256 | fc4112970eb1fdabcc6a84cadfc07bee212d3ce05b4d7ca7c7d3cd2498d43070 |
Hashes for bgen-1.5.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90b853d03a021181927599dcf824c9d2ae469ef61781e0b34067e736e938dad7 |
|
MD5 | 01c81fac593c0d186850df185fa9b704 |
|
BLAKE2b-256 | 3984b738c832d0c836301fd5a1ae87d74d632232c7c233036020083a64d1d454 |
Hashes for bgen-1.5.8-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b6f5fbfc3155bfd0b28d17fcb7e4e16dbbba52846c463b1f38cf6b2150361ae |
|
MD5 | 620db000ec5ce9aa43504321875e4884 |
|
BLAKE2b-256 | 46f1128858786a23180abddad91296e11c905cd1b7a3b3a6725d6ee61395c475 |