Package for loading data from bgen files
Project description
Another bgen reader
This is a package for reading bgen files.
This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU). Decompressing the genotype probabilities is the slow step, zlib decompression takes 80% of the total time, using zstd compressed genotypes would be much faster, maybe 2-3X faster?
This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).
Install
pip install bgen
Usage
from bgen import BgenFile
bfile = BgenFile(BGEN_PATH)
rsids = bfile.rsids()
# select a variant by indexing
var = bfile[1000]
# pull out genotype probabilities
probs = var.probabilities # returns 2D numpy array
dosage = var.minor_allele_dosage # returns 1D numpy array for biallelic variant
# iterate through every variant in the file
with BgenFile(BGEN_PATH, delay_parsing=True) as bfile:
for var in bfile:
dosage = var.minor_allele_dosage
# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)
API documentation
class BgenFile(path, sample_path='', delay_parsing=False)
# opens a bgen file. If a bgenix index exists for the file, the index file
# will be opened automatically for quicker access of specific variants.
Arguments:
path: path to bgen file
sample_path: optional path to sample file. Samples will be given integer IDs
if sample file is not given and sample IDs not found in the bgen file
delay_parsing: True/False option to allow for not loading all variants into
memory when the BgenFile is opened. This can save time when iterating
across variants in the file
Attributes:
samples: list of sample IDs
header: BgenHeader with info about the bgen version and compression.
Methods:
slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
fetch(chrom, start=None, stop=None): get all variants within a genomic region
drop_variants(list[int]): drops variants by index from being used in analyses
with_rsid(pos): returns BgenVar with given position
at_position(rsid): returns BgenVar with given rsid
varids(): returns list of varids for variants in the bgen file.
rsids(): returns list of rsids for variants in the bgen file.
chroms(): returns list of chromosomes for variants in the bgen file.
positions(): returns list of positions for variants in the bgen file.
class BgenVar(handle, offset, layout, compression, n_samples):
# Note: this isn't called directly, but instead returned from BgenFile methods
Attributes:
varid: ID for variant
rsid: reference SNP ID for variant
chrom: chromosome variant is on
pos: nucleotide position variant is at
alleles: list of alleles for variant
is_phased: True/False for whether variant has phased genotype data
ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
minor_allele: the least common allele (for biallelic variants)
minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
probabilitiies: 2D numpy array of genotype probabilities, one sample per row
BgenVars can be pickled e.g. pickle.dumps(var)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for bgen-1.2.13-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c41f3ec5af7f26062b5d9546ed9ab533b073553b737ede58d98aff0c1ad1f139 |
|
MD5 | de0f51a0b2e01e00f55199f5750977cc |
|
BLAKE2b-256 | fc8348408d46875bb4af99ce131ae33b0de6d15a30c1776cc7d6388805cbfd8c |
Hashes for bgen-1.2.13-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09af7aa9d86751e859ad3eac7fd2a2c88db6413a1049d32db3bc1809500437ad |
|
MD5 | 1f733b1df753c3ea3cc66ad99afe7740 |
|
BLAKE2b-256 | 1cb9ad3c2daed20770bf16e9fbf1c8ba6d5e8c08a76a6103885332201e7f9c3a |
Hashes for bgen-1.2.13-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a7b0c602f7ad7a2538cfa9a1ee81bef987452f861226469279b04e942255394 |
|
MD5 | b4724dcb391db8ae0c29a0d67b0efbcb |
|
BLAKE2b-256 | 283f71414cd253002d38363332224cceb1be1ac3893a6151125b16b82b83b56e |
Hashes for bgen-1.2.13-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f07311ca409a5ad9ad5a4196195f7948b7d7697864c8f365180ea77465fb7bc1 |
|
MD5 | 6b3eea499a46563fbf791f1daa2b636e |
|
BLAKE2b-256 | f68dc45a3e8d0725977fad32f53bc3f004f15ce380dfaf0f919b369679241627 |
Hashes for bgen-1.2.13-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32e30247701647a37ebf4104cc95aaf5c67e0b7b8d84183b4bfad9e1e8c9f17d |
|
MD5 | 8fb06e7b304d56462cd85d134fd567b2 |
|
BLAKE2b-256 | 055b0be16571f20581815a275f3dcd1b66b7da6698b98596fbd337e81b248960 |
Hashes for bgen-1.2.13-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0bc958c84594f3d3611bbd6ef81ed7878a4f9bece8580a757f10458d0f205c61 |
|
MD5 | f98d67aa4102126f0428fc671d9f38e0 |
|
BLAKE2b-256 | 51fe883e45fa9d1f8c45244b3d85480df6868fa2d9b92cd152a7394ef58ae5e4 |