Package for loading data from bgen files
Project description
Another bgen parser
This is a package for reading and writing bgen files.
This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU). Decompressing the genotype probabilities can be the slow step, zlib decompression takes 80% of the total time, using zstd compressed genotypes is ~2X faster.
This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).
Install
pip install bgen
Usage
from bgen import BgenReader, BgenWriter
bfile = BgenReader(BGEN_PATH)
rsids = bfile.rsids()
# select a variant by indexing
var = bfile[1000]
# pull out genotype probabilities
probs = var.probabilities # returns 2D numpy array
dosage = var.minor_allele_dosage # returns 1D numpy array for biallelic variant
# iterate through every variant in the file
with BgenReader(BGEN_PATH, delay_parsing=True) as bfile:
for var in bfile:
dosage = var.minor_allele_dosage
# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)
# or for writing bgen files
import numpy as np
from bgen import BgenWriter
geno = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]).astype(np.float64)
with BgenWriter(BGEN_PATH, n_samples=3) as bfile:
bfile.add_variant(varid='var1', rsid='rs1', chrom='chr1', pos=1,
alleles=['A', 'G'], genotypes=geno)
API documentation
class BgenReader(path, sample_path='', delay_parsing=False)
# opens a bgen file. If a bgenix index exists for the file, the index file
# will be opened automatically for quicker access of specific variants.
Arguments:
path: path to bgen file
sample_path: optional path to sample file. Samples will be given integer IDs
if sample file is not given and sample IDs not found in the bgen file
delay_parsing: True/False option to allow for not loading all variants into
memory when the BgenFile is opened. This can save time when iterating
across variants in the file
Attributes:
samples: list of sample IDs
header: BgenHeader with info about the bgen version and compression.
Methods:
slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
fetch(chrom, start=None, stop=None): get all variants within a genomic region
drop_variants(list[int]): drops variants by index from being used in analyses
with_rsid(rsid): returns BgenVar with given rsid
at_position(pos): returns BgenVar at a given position
varids(): returns list of varids for variants in the bgen file.
rsids(): returns list of rsids for variants in the bgen file.
chroms(): returns list of chromosomes for variants in the bgen file.
positions(): returns list of positions for variants in the bgen file.
class BgenVar(handle, offset, layout, compression, n_samples):
# Note: this isn't called directly, but instead returned from BgenFile methods
Attributes:
varid: ID for variant
rsid: reference SNP ID for variant
chrom: chromosome variant is on
pos: nucleotide position variant is at
alleles: list of alleles for variant
is_phased: True/False for whether variant has phased genotype data
ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
minor_allele: the least common allele (for biallelic variants)
minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
alt_dosage: 1D numpy array of alt allele dosages for each sample
probabilitiies: 2D numpy array of genotype probabilities, one sample per row
BgenVars can be pickled e.g. pickle.dumps(var)
class BgenWriter(path, n_samples, samples=[], compression='zstd' layout=2, metadata=None)
# opens a bgen file to write variants to. Automatically makes a bgenix index file
Arguments:
path: path to write data to
n_samples: number of samples that you have data for
samples: list of sample IDs (same length as n_samples)
compression: compression type: None, 'zstd', or 'zlib' (default='zstd')
layout: bgen layout format (default=2)
metadata: any additional metadata you want o include in the file (as str)
Methods:
add_variant(varid, rsid, chrom, pos, alleles, genotypes, ploidy=2,
phased=False, bit_depth=8)
Arguments:
varid: variant ID e.g. 'var1'
rsid: reference SNP ID e.g. 'rs1'
chrom: chromosome the variant is on e.g 'chr1'
pos: nucleotide position of the variant e.g. 100
alleles: list of allele strings e.g. ['A', 'C']
genotypes: numpy array of genotype probabilities, ordered as per the
bgen samples e.g. np.array([[0, 0, 1], [0.5, 0.5, 0]])
ploidy: ploidy state, either as integer to indicate constant ploidy
(e.g. 2), or numpy array of ploidy values per sample, e.g. np.array([1, 2, 2])
phased: whether the genotypes are for phased data or not (default=False)
bit_depth: how many bits to store each genotype as (1-32, default=8)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for bgen-1.6.0-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 833fb63e5e0b361879ba1ed852e3802f2966d1c6e333f31a9b8a1110cd46cb7b |
|
MD5 | b58d5123b3cecf981e39d1653638cfd9 |
|
BLAKE2b-256 | 1732a96afe360b432954a18b30f4fec23fa632aaccbbed5b59e712cac12330cd |
Hashes for bgen-1.6.0-cp312-cp312-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4e45f778066c7f3a2d15eaee7c1341fe30a5e18832a0f95136eee9229066d4f |
|
MD5 | d675683f1939da3a3c2c51f0859a14b2 |
|
BLAKE2b-256 | 6519553a75ca57342806b31759c1f59ea2c06c8fe5489b6704572d039bd1c109 |
Hashes for bgen-1.6.0-cp312-cp312-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48351fc5c5842e4f4a700a6280a7165e2bcc38b11793cdf712f72e6ee6021c90 |
|
MD5 | 8d353e9a70b0fe9a731eb292c86b0546 |
|
BLAKE2b-256 | b24df9158bb2ca07cd3a49906d27758408bbda3090975b7b0b33d54fb4db7c74 |
Hashes for bgen-1.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5a61b8b1a941a897792426190ee29add4631882608950de6a5aca5568b04595 |
|
MD5 | 7b033d220203725c2eb7f6d86a97b6b5 |
|
BLAKE2b-256 | f57e18fd7537150723493cbc8ef260b1e86c4ef68a49776893a416aaa661159f |
Hashes for bgen-1.6.0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 69cba1cc9fcc7f7c858d0981ceab7d8390a66ee190dbf570b9f13f0c9718cf0c |
|
MD5 | 25a048c9e52db65c83a9d7ae47637209 |
|
BLAKE2b-256 | ae3a32d17e5f24a921f20ccf0ba75f463336fd3b98de2312208c97c011f727c6 |
Hashes for bgen-1.6.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83953a3aaafd8e19bc415d5c102a1672711a2f74808d36be7dfec68c59ff6794 |
|
MD5 | ed7b3b19eb1249410705f9301e70c656 |
|
BLAKE2b-256 | 496abcc5754e4535edec18cdf69644e512d3a0c3e8dd33b62c20311d83c231df |
Hashes for bgen-1.6.0-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eefd2e7c2dc9b6b565032d21f7634a9976e0e2640e84acfc615af4b5117f9884 |
|
MD5 | dd51611c5729a99e3e9d444bb42224cd |
|
BLAKE2b-256 | 169e8c5a91b4af84c3a9945e5db8842b3c0682ac4987e93c5536d30be929d038 |
Hashes for bgen-1.6.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6fc00a55f0462aa22d9e5763fd0a9c4d0e3e6937433169e7e047453e78aaabb6 |
|
MD5 | 4b08595d15264e6b2dce35f7b78be05a |
|
BLAKE2b-256 | a024f72140e1310eb507b57b504622fdf9c743b7e310288bccc8ac74802bf6e6 |
Hashes for bgen-1.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d701fab65126154e34325386808254d4c0c74689cf87a81d0b58cd0364b6b2b |
|
MD5 | ec3c1e7df9b07d3ce42e443f70e1bf96 |
|
BLAKE2b-256 | b98b9c32230b399113051aef29bf34e0627b25da3e9e014ba17d6b8321db50de |
Hashes for bgen-1.6.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 040600253b6ffded5ac55416e4670c8a4f08ab6e2f6379f7dff1ea1eee978a92 |
|
MD5 | 31cb2784330de80089781bbb89c6709e |
|
BLAKE2b-256 | f276611a4239b023ff3535b45f917449ab1b13052d7f349681b02ae614d02e4c |
Hashes for bgen-1.6.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39a4d821cbe0afd7dc97742bf213b2eeddfd279e30074a477d461e31b9cdfe45 |
|
MD5 | 463ae84e160d77888a090aea04be4bb5 |
|
BLAKE2b-256 | dc122041a270be8acb12f802bd1495a3238197f78dfb6b919aba70e600635443 |
Hashes for bgen-1.6.0-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44a9baf25a9944506590a97376c9bab186fcc72ae1d9fc0b4dc4d833d54de2bb |
|
MD5 | 08f6d5643a5adfe50dc99e1b74e4f3b4 |
|
BLAKE2b-256 | fa148c12bc5bcc3d01228fc233c54b1978e6d4d9b4fabc004899d10effa09ffd |
Hashes for bgen-1.6.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43740ae8f4e85db7d79bb68e4dda3634494b4d481d825e41f8799838a0febb27 |
|
MD5 | 3026faa85e6fad1e1d2d3ea03032b5ff |
|
BLAKE2b-256 | 19d209b04927feb17c4d7c07ef818ce225ccf9b4b5344d79b8b479f5decb9ff3 |
Hashes for bgen-1.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbff64c91da1a4be2bad20aa1f3f724fce7972b802c0e3da738a8044e22571e7 |
|
MD5 | 3defc02b802562eda81dc6f0982e363b |
|
BLAKE2b-256 | d65ec2d6302715c0c1d075dc787f2e8e5a5a94fbf5419f23e45cf6e40ce35b77 |
Hashes for bgen-1.6.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59ea6d811a829117e6da494b60f6b02e92b36609e640fe7d45950c60324ee846 |
|
MD5 | aee736ee5eebd1271b1c79166515771a |
|
BLAKE2b-256 | bef76af448f0a592b7a473f4973b0e344f1af00a6e70de7e6361ace9c26743f9 |
Hashes for bgen-1.6.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a733442b4d89b14364a1edecdd2225ecab0f5aca3e27fbc79db9db59d37fc41 |
|
MD5 | d32303d4f3df1ceb19e155bbd7662fb2 |
|
BLAKE2b-256 | 219d73fc87ba13e24d9fd17820d4c2c5fa89eb236f05532318397e002e3dff20 |
Hashes for bgen-1.6.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ed9884720855193d51476b71bcb3caed157014ea87900cd47c8d18eef0dd31f |
|
MD5 | e2d942be061a1647a9be2d4e302aa9ab |
|
BLAKE2b-256 | 4b95bb669b94f5361c5a6e70e94d24c85d67faead7e304691b2f75496dc24373 |
Hashes for bgen-1.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e370b8f3229120f7f2ed35ba3effc144b0f8bff1e704f942a3297e790682e4c1 |
|
MD5 | 6a7ff48be9b0d5afc53fc062fdc09d8d |
|
BLAKE2b-256 | 7f2836911e52a47f5af17a93e8e1cb3ed6d01aa1cd8bdc3f32cb647ba70fccfa |
Hashes for bgen-1.6.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67ce02665d2bee29a372370ff07d7ff702d76de72be8d1381ab968d32e888d61 |
|
MD5 | 39d515b68123f724ca53c42b895223b4 |
|
BLAKE2b-256 | b82657ceb26b52a98cf332ef9fe93cd0a8e0af8146e82c3a1d4e579551afe264 |
Hashes for bgen-1.6.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a0e8da8f2c21cee990a37cf98d3092ad754dfee709bbfb2569602ee8fbf0cd8 |
|
MD5 | 1e69933a3fbaa40a1106d26c9a8f4e19 |
|
BLAKE2b-256 | cd6a36ffab63557d1c67bc13c58a0740ebd4c6a69bada167d63bd599f67dfa53 |
Hashes for bgen-1.6.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d25870dd5da12b5ba26c748c7046eb5ab7df4817fdb0900987671636a56bccb8 |
|
MD5 | 202e42c9a98d1c4b6b73e7f66708824c |
|
BLAKE2b-256 | 98a6756a725a6a6a2b60dcc5f82d6b679bc832020f40cd06e3825256b99f6c1d |
Hashes for bgen-1.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f42b39990a51ee6acfd83f520b92ae2fc6a57ac3c127bc126394465c619ca11 |
|
MD5 | 6382f7d0c77e45fc895d4eb4c31ea5c6 |
|
BLAKE2b-256 | b3e962d997b17e3aba3b9e389f978f204169ffd8bb12e2658b12892d6cca32d0 |
Hashes for bgen-1.6.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdc126b2a78ebd29b89ff178a431257fc2b8c689ae51e643a0c4538bddc63dfd |
|
MD5 | dbf8af3aff01d6a4785b7a82405e68b7 |
|
BLAKE2b-256 | 7df3682d07daf7a91239ff1ddb6ce0097c90b0739483da93548fd5dd5e071c21 |