Package for loading data from bgen files
Project description
Another bgen reader
This is a package for reading bgen files.
This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU).
This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).
Install
pip install bgen
Usage
from bgen import BgenFile
bfile = BgenFile(BGEN_PATH)
rsids = bfile.rsids()
# select a variant by indexing
var = bfile[1000]
# pull out genotype probabilities
probs = var.probabilities # returns 2D numpy array
dosage = var.minor_allele_dosage # returns 1D numpy array for biallelic variant
# iterate through every variant in the file
with BgenFile(BGEN_PATH, delay_parsing=True) as bfile:
for var in bfile:
dosage = var.minor_allele_dosage
# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)
API documentation
class BgenFile(path, sample_path='', delay_parsing=False)
# opens a bgen file. If a bgenix index exists for the file, the index file
# will be opened automatically for quicker access of specific variants.
Arguments:
path: path to bgen file
sample_path: optional path to sample file. Samples will be given integer IDs
if sample file is not given and sample IDs not found in the bgen file
delay_parsing: True/False option to allow for not loading all variants into
memory when the BgenFile is opened. This can save time when iterating
across variants in the file
Attributes:
samples: list of sample IDs
header: BgenHeader with info about the bgen version and compression.
Methods:
slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
fetch(chrom, start=None, stop=None): get all variants within a genomic region
drop_variants(list[int]): drops variants by index from being used in analyses
with_rsid(pos): returns BgenVar with given position
at_position(rsid): returns BgenVar with given rsid
varids(): returns list of varids for variants in the bgen file.
rsids(): returns list of rsids for variants in the bgen file.
chroms(): returns list of chromosomes for variants in the bgen file.
positions(): returns list of positions for variants in the bgen file.
class BgenVar(handle, offset, layout, compression, n_samples):
# Note: this isn't called directly, but instead returned from BgenFile methods
Attributes:
varid: ID for variant
rsid: reference SNP ID for variant
chrom: chromosome variant is on
pos: nucleotide position variant is at
alleles: list of alleles for variant
is_phased: True/False for whether variant has phased genotype data
ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
minor_allele: the least common allele (for biallelic variants)
minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
probabilitiies: 2D numpy array of genotype probabilities, one sample per row
BgenVars can be pickled e.g. pickle.dumps(var)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bgen-1.2.9.tar.gz
(664.9 kB
view hashes)
Built Distributions
Close
Hashes for bgen-1.2.9-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55e2a9fd09b9939c3494d2ce80a462451d73f2f322baaaef6c2a1ff946b720bf |
|
MD5 | 1b70694de3a3f39d6c50d333c72925c2 |
|
BLAKE2b-256 | e376b0e4b4c1bdff15dbac0555d6782537cbfdce4c4d4166fcad19325becd4dd |
Close
Hashes for bgen-1.2.9-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 19451f4e9b7700c0263d78ca90ca0473ce21eac57b07a859caf1992029ef55d0 |
|
MD5 | 759b2aa128d68447f2d8f2ba279c7d99 |
|
BLAKE2b-256 | f23561ab859e0ad90bd6f04562bb85815d6c0776dafdf9f8fb738f57f6fb2711 |
Close
Hashes for bgen-1.2.9-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4268d1d861e7c76199f3e878c8064195006dbad384405283e5d026178cade470 |
|
MD5 | a1f494d47a0d5afc92a7c5ce0058d7bf |
|
BLAKE2b-256 | 18862291ed941494d46b15863d805e4ac0f089d38696867022236c4477f69cec |
Close
Hashes for bgen-1.2.9-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 10067886b2f45fa0319e3803c07960d9d5e9ae8fa213d9acf1732ed8db4b8095 |
|
MD5 | a81882000fff754fb8a603ae803e3b7e |
|
BLAKE2b-256 | 6ca03ae1e6d4040e6933ba6a50dace2050d0699c2d01f9492c47a1e16f2161a4 |
Close
Hashes for bgen-1.2.9-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e9f165724794878f1d4ebfc879fc723ff2e34fb8c1931d384e6c18e30a1abcf |
|
MD5 | c45d8c95ea34c04711003c049b5fecb6 |
|
BLAKE2b-256 | 74b9d0bf423d7594eff4c4e3c3661b7da7e93807acff2a7bf3f6497f60dad5e6 |
Close
Hashes for bgen-1.2.9-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1eff417dbe4bc7ce1a27a919b7217f8628cce20b638a7953e0baff63e053d777 |
|
MD5 | 462454083a3a35b61e27646034c64b99 |
|
BLAKE2b-256 | 7aff33c371d4d667eb2cd10dec038368c0d79a7e2dc5e58c92daaed60a316c8f |
Close
Hashes for bgen-1.2.9-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 776a5bc3c414a7264b73ed54c5952f3a741314b68fe3c99004d300e7fda80067 |
|
MD5 | 036e4e231346429d84bd0ff28e5df298 |
|
BLAKE2b-256 | e37b7dada7e5e7b61ec4555311b7d0911dece0fb2f4648f4a8797255b65bd07b |
Close
Hashes for bgen-1.2.9-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed94ecef2a7afee34214c90c770bb6892dc87183e4b1cf875b64b6e208e013c0 |
|
MD5 | 7a89f48c5ddab5d8d7b82dffea33fe7f |
|
BLAKE2b-256 | c7bb418069bb83ffd16d6c8a4d7930c6173f58127a1166451e6d24043d3043cf |
Close
Hashes for bgen-1.2.9-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3055f83b684db0cc1726b91efeb095714700a764aa55b333d7f5056eb7fd9271 |
|
MD5 | c5aea3ac76b1d571d71e897e4c162698 |
|
BLAKE2b-256 | 02d35631dc52be02c212e2eb9cb14eab26d77cd1066dc3859cc08b062f307ea2 |