Package for loading data from bgen files
Project description
Another bgen reader
This is a package for reading bgen files.
This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU).
This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).
Install
pip install bgen
Usage
from bgen import BgenFile
bfile = BgenFile(BGEN_PATH)
rsids = bfile.rsids()
# select a variant by indexing
var = bfile[1000]
# pull out genotype probabilities
probs = var.probabilities # returns 2D numpy array
dosage = var.minor_allele_dosage # returns 1D numpy array for biallelic variant
# iterate through every variant in the file
with BgenFile(BGEN_PATH, delay_parsing=True) as bfile:
for var in bfile:
dosage = var.minor_allele_dosage
# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)
API documentation
class BgenFile(path, sample_path='', delay_parsing=False)
# opens a bgen file. If a bgenix index exists for the file, the index file
# will be opened automatically for quicker access of specific variants.
Arguments:
path: path to bgen file
sample_path: optional path to sample file. Samples will be given integer IDs
if sample file is not given and sample IDs not found in the bgen file
delay_parsing: True/False option to allow for not loading all variants into
memory when the BgenFile is opened. This can save time when iterating
across variants in the file
Attributes:
samples: list of sample IDs
header: BgenHeader with info about the bgen version and compression.
Methods:
slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
fetch(chrom, start=None, stop=None): get all variants within a genomic region
drop_variants(list[int]): drops variants by index from being used in analyses
with_rsid(pos): returns BgenVar with given position
at_position(rsid): returns BgenVar with given rsid
varids(): returns list of varids for variants in the bgen file.
rsids(): returns list of rsids for variants in the bgen file.
chroms(): returns list of chromosomes for variants in the bgen file.
positions(): returns list of positions for variants in the bgen file.
class BgenVar(handle, offset, layout, compression, n_samples):
# Note: this isn't called directly, but instead returned from BgenFile methods
Attributes:
varid: ID for variant
rsid: reference SNP ID for variant
chrom: chromosome variant is on
pos: nucleotide position variant is at
alleles: list of alleles for variant
is_phased: True/False for whether variant has phased genotype data
ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
minor_allele: the least common allele (for biallelic variants)
minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
probabilitiies: 2D numpy array of genotype probabilities, one sample per row
BgenVars can be pickled e.g. pickle.dumps(var)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bgen-1.2.8.tar.gz
(664.9 kB
view hashes)
Built Distributions
Close
Hashes for bgen-1.2.8-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 876707635cfd94aaa33c17055677a2f36d2dde0bf76d11aa28f2a427e04a61ec |
|
MD5 | 8886c92b00eb95f3b3c33793ad5b256e |
|
BLAKE2b-256 | fb7d41dd99c0e28fcd8d618b8cc181527ab91cc2f396d80442ae6ddb08f8d243 |
Close
Hashes for bgen-1.2.8-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f4eee4e633ed57eb988be1bab0c29b72a18b4a95b341c34bdba36c52907ff4b |
|
MD5 | d033241b98d90be62952f6a1d40f2102 |
|
BLAKE2b-256 | d8dbbc7314ad6c1e388af75e335e63e8dcacf1fbc4ab2c70b775ca2e039b2035 |
Close
Hashes for bgen-1.2.8-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bca6408d4c87f1c391b7cb50b6f6d7cefe46102e375f3839a90ff359391f32c |
|
MD5 | f9f48ea16dfc1170b900101a38eaebc7 |
|
BLAKE2b-256 | 42ba126c0b874f6b2ca10805e86752b92cc1cdcea4d30f89d6cdaad5b2e10744 |
Close
Hashes for bgen-1.2.8-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2766fbccba1f257d1c8dbef2930ae6bdbc80bc28f1358c51e0d94215d2690afe |
|
MD5 | 6d46617256840f7cc1fb7f35c044099b |
|
BLAKE2b-256 | aa651f2294fe41d1379f1fc3d135e790fedb12c226aa53c6c6f357aeb0682814 |
Close
Hashes for bgen-1.2.8-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cfec8bcfc08bef881088b3a8073d98736d64d0941ee7410246f74bb38ed312c9 |
|
MD5 | 5251537ecafd79cc3c8b6ee77acadecd |
|
BLAKE2b-256 | d8c929d18be82fb62e583ad4fc77adf5a02b0b8f1917de8a84896842ff3b0314 |
Close
Hashes for bgen-1.2.8-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | daf51016d013e9aa1e8224896528fa66ed30f389a90088ed54c3d6dfff406013 |
|
MD5 | c90c25cc3abc56c5ff19482d7f8565e4 |
|
BLAKE2b-256 | 8cb889f6c1e36c817d33cb57aab1486e8a472fa8996c228aaf0ecf5d3b399514 |
Close
Hashes for bgen-1.2.8-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8c71932876c6c59f8949cd7b57873f0fbf066900ff454cbb3db0aa55c09cd2c |
|
MD5 | db59887a0fe33eaf8c5cac91efc367be |
|
BLAKE2b-256 | ea311f392ec395dbdf54a2f9b89e871cd9ceb7265c37a0bfb0c08e3d21c920a9 |
Close
Hashes for bgen-1.2.8-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a43e87afc71acabafaa22f25c473b2ff35d071618896f965dbd5bb1df715799 |
|
MD5 | 442793a6b87cd016beb5c982fe49cc6b |
|
BLAKE2b-256 | 3f885f7999bca2540b9d9c22f3bd95742eec425a423bb5fe34947b2ebd836778 |
Close
Hashes for bgen-1.2.8-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0625bc1cd1e2f782cf85ba4ca6d5af7b2be76869f269283787ae6620384251ca |
|
MD5 | a8a6b824e0ae0c8802679b4addedbed3 |
|
BLAKE2b-256 | 364c7aafbe1fb7268b7c482d900c50f8bf7cdaea9b35b4b96e373b8ef5cfe159 |