Package for loading data from bgen files
Project description
Another bgen parser
This is a package for reading and writing bgen files.
This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU). Decompressing the genotype probabilities can be the slow step, zlib decompression takes 80% of the total time, using zstd compressed genotypes is ~2X faster.
This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).
Install
pip install bgen
Usage
from bgen import BgenReader, BgenWriter
bfile = BgenReader(BGEN_PATH)
rsids = bfile.rsids()
# select a variant by indexing
var = bfile[1000]
# pull out genotype probabilities
probs = var.probabilities # returns 2D numpy array
dosage = var.minor_allele_dosage # returns 1D numpy array for biallelic variant
# iterate through every variant in the file
with BgenReader(BGEN_PATH, delay_parsing=True) as bfile:
for var in bfile:
dosage = var.minor_allele_dosage
# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)
# or for writing bgen files
import numpy as np
from bgen import BgenWriter
geno = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]).astype(np.float64)
with BgenWriter(BGEN_PATH, n_samples=3) as bfile:
bfile.add_variant(varid='var1', rsid='rs1', chrom='chr1', pos=1,
alleles=['A', 'G'], genotypes=geno)
API documentation
class BgenReader(path, sample_path='', delay_parsing=False)
# opens a bgen file. If a bgenix index exists for the file, the index file
# will be opened automatically for quicker access of specific variants.
Arguments:
path: path to bgen file
sample_path: optional path to sample file. Samples will be given integer IDs
if sample file is not given and sample IDs not found in the bgen file
delay_parsing: True/False option to allow for not loading all variants into
memory when the BgenFile is opened. This can save time when iterating
across variants in the file
Attributes:
samples: list of sample IDs
header: BgenHeader with info about the bgen version and compression.
Methods:
slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
fetch(chrom, start=None, stop=None): get all variants within a genomic region
drop_variants(list[int]): drops variants by index from being used in analyses
with_rsid(rsid): returns BgenVar with given rsid
at_position(pos): returns BgenVar at a given position
varids(): returns list of varids for variants in the bgen file.
rsids(): returns list of rsids for variants in the bgen file.
chroms(): returns list of chromosomes for variants in the bgen file.
positions(): returns list of positions for variants in the bgen file.
class BgenVar(handle, offset, layout, compression, n_samples):
# Note: this isn't called directly, but instead returned from BgenFile methods
Attributes:
varid: ID for variant
rsid: reference SNP ID for variant
chrom: chromosome variant is on
pos: nucleotide position variant is at
alleles: list of alleles for variant
is_phased: True/False for whether variant has phased genotype data
ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
minor_allele: the least common allele (for biallelic variants)
minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
alt_dosage: 1D numpy array of alt allele dosages for each sample
probabilitiies: 2D numpy array of genotype probabilities, one sample per row
BgenVars can be pickled e.g. pickle.dumps(var)
class BgenWriter(path, n_samples, samples=[], compression='zstd' layout=2, metadata=None)
# opens a bgen file to write variants to. Automatically makes a bgenix index file
Arguments:
path: path to write data to
n_samples: number of samples that you have data for
samples: list of sample IDs (same length as n_samples)
compression: compression type: None, 'zstd', or 'zlib' (default='zstd')
layout: bgen layout format (default=2)
metadata: any additional metadata you want o include in the file (as str)
Methods:
add_variant(varid, rsid, chrom, pos, alleles, genotypes, ploidy=2,
phased=False, bit_depth=8)
Arguments:
varid: variant ID e.g. 'var1'
rsid: reference SNP ID e.g. 'rs1'
chrom: chromosome the variant is on e.g 'chr1'
pos: nucleotide position of the variant e.g. 100
alleles: list of allele strings e.g. ['A', 'C']
genotypes: numpy array of genotype probabilities, ordered as per the
bgen samples e.g. np.array([[0, 0, 1], [0.5, 0.5, 0]])
ploidy: ploidy state, either as integer to indicate constant ploidy
(e.g. 2), or numpy array of ploidy values per sample, e.g. np.array([1, 2, 2])
phased: whether the genotypes are for phased data or not (default=False)
bit_depth: how many bits to store each genotype as (1-32, default=8)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for bgen-1.6.1-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad0e1aaa3b06dc9e3fe37a2049b0cba425c966f9ee4e6457e9f8432da9494117 |
|
MD5 | eff35a9f232dce021cd0a632183c5387 |
|
BLAKE2b-256 | f998a307f3883608807946ab1368ef067e075ea4914fdeb64a341c1bcaa06781 |
Hashes for bgen-1.6.1-cp312-cp312-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5b9777891cb2e844065bd696968cbe80357c7c8f055199b0987b974ff994c29 |
|
MD5 | 01e243ca40dcd226eb5b27967e9b3f4b |
|
BLAKE2b-256 | 628da2b0b86e572aa8a74201bfe310833ff41cf28462d89cdeb798eb5bc86640 |
Hashes for bgen-1.6.1-cp312-cp312-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95bd664cc3b038df4a80c025f4e88405c09f2ce05141d5aac3bcdec2fd9e98c2 |
|
MD5 | 9ee401ac6b5c301aa116f3de25bf1337 |
|
BLAKE2b-256 | 5785495c1910d8b97d2117d9720d2c9ee7228349f45a4c89c6d47c45454d9a90 |
Hashes for bgen-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e07c99f4b2f1bca2a21c631005691e97cc2b03eddcb483f0b4f8b82cc959201 |
|
MD5 | ec1fa0eada04cc40e1f828453f590ff1 |
|
BLAKE2b-256 | 1caf86b39b6d53c41891774291ab692e75a45578891224924b5fcc195239ab5b |
Hashes for bgen-1.6.1-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a73636618dde140cb3a0633c6e6eae3b17658c2916cbce9afc96c86b82a83f5e |
|
MD5 | b01138ae5640c19df28a5a5a57e79559 |
|
BLAKE2b-256 | 1ecb7cc976040469dcaaef8c35bc6b09fe928fd3e36d362c0bb2074e3a5c6e47 |
Hashes for bgen-1.6.1-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2be8aa3c3aaa37a1e3144300d71843413421738deaefe1cf948e3d9c524db55c |
|
MD5 | fb3ab28a144d8896845f2c91635d169d |
|
BLAKE2b-256 | ba881997dcf2bfdebaed919dbdb17c1d166bf78cde479186e5c9f0521267fb0c |
Hashes for bgen-1.6.1-cp311-cp311-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b8dd3cc3ab6cb1ed8d5cdc726260ae9d5c2e23a781c6d35c11369cecea0033e |
|
MD5 | a937adb9794643d6f27fc74ecdc6d52c |
|
BLAKE2b-256 | 6c689e9c9454b7b5bf94de3ad77b014a5f7a4e035073acb5f34f74c778b53a65 |
Hashes for bgen-1.6.1-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 376e2a4d8ace514a2c4ad0f8fc83944fb4fa6475eca841e691ae58885e90e299 |
|
MD5 | 83a26fef0cfe198f09eb92b77604eeb0 |
|
BLAKE2b-256 | 5c6b46d9c6c4995f2061d7b8cc0b91bcc72de70c3a523300a53546a895179411 |
Hashes for bgen-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12dc0b918f83aff1d7b76001bbb65df389353eb8b87ac4cc98757e7e9222e8e0 |
|
MD5 | c899557124c237a7a9f81372f23fdf42 |
|
BLAKE2b-256 | e4a20f2a3d533b375af3dd8fd18294619db74768efd5a36837c326489ae5c932 |
Hashes for bgen-1.6.1-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0611581010adefd7dbe363d3bfec548d5b356d54030d369a34c6b294c01ff274 |
|
MD5 | f429055c0d667e376a8a4d0d8f2d4427 |
|
BLAKE2b-256 | 9001cbd3467326c76ea17e4c3649d30c2b0e7647093bbf58b572aec792822bca |
Hashes for bgen-1.6.1-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 714eba164205b414fd79639ea23a9bc5d9b2c8b1bebc2d9d39910911665c40d0 |
|
MD5 | ee6f88810abddd2fcaeee5e746b01568 |
|
BLAKE2b-256 | bda71c2c2d63f3a3a108cb62728805711bd9a32e95cc44756bb68448610b44db |
Hashes for bgen-1.6.1-cp310-cp310-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea16fbca087fcb1f8d31d7cd69cc4baf59b645d8e1bc360ef05992fe9d822060 |
|
MD5 | c0677cf6b32b324b8693657ffba28d7e |
|
BLAKE2b-256 | 7ce98a69d509caeeb8bc59c50c182e249396de4c7a469bc1cba1bc631adb5b77 |
Hashes for bgen-1.6.1-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdab991bc6d21ebca68cbdd67e67b75b76327026e8bdc37adad5b735470a0eed |
|
MD5 | 75ba8ff8767ebf4642faa023ce606664 |
|
BLAKE2b-256 | 0deae83d3fc234154748718ee48a51948ebeddeb06d9fdf5359c878ffe95e852 |
Hashes for bgen-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4595b74aa81794e3127e6178ae077a8b14ae268b7a129a2e900238d6e37ceeaa |
|
MD5 | d810716545992470cee4001560902d13 |
|
BLAKE2b-256 | e81d1002a26c1284e4971408019ca24ca7fe22158ae10716c0c3d8415c4ad41b |
Hashes for bgen-1.6.1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7435a111a519d927dc728b9b03c50fcc89f8e2057408020917e98dc3a258d613 |
|
MD5 | 4d91b161665c3b9d3bf921911ed24e0f |
|
BLAKE2b-256 | 72a520fd5e9ad79afc72d2a8a0ef22a430cdc1004a5fa145e432b9d6aa000d5a |
Hashes for bgen-1.6.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d5efbb1b3aff365bde05e31274889845434cd4e4bca6e8286e3f59fad448d4a |
|
MD5 | 2a18195c499c974dba2d95dfb12b9c84 |
|
BLAKE2b-256 | a0c2babdcf42c9a45548371e8aa988131af491cac227c4b10c651a4fd43d1794 |
Hashes for bgen-1.6.1-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ecf775a1417bc8fa701591303dbc244a237fa9f10a0be05fdf4e8b0ea0d1ca01 |
|
MD5 | be9bca58eedd2347818deb681268d72b |
|
BLAKE2b-256 | 6b10355a333fa6ceb8cb370f77abf79216ef8f1de1da0b6a7086a1f104d7d083 |
Hashes for bgen-1.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a40b86d0b781292ae3ae4f58ef809d1e144da55ee46824f8308271c149a34bb7 |
|
MD5 | 71c5b3077104b3c85e555f70fe565556 |
|
BLAKE2b-256 | 17eadd925142af89bca393b0e49087635d973039f6c2c75b02290fddd5900608 |
Hashes for bgen-1.6.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 051d09cba7909d81db8f56fb1b90420b2455d2d0e822cfdc90535b0b6a7837f4 |
|
MD5 | 6f4504df4a3bfb9a6b93d2c936479216 |
|
BLAKE2b-256 | dc756417578f54285a44b96ac619fff5eefb585145aaf241e1d219798970fbfc |
Hashes for bgen-1.6.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 910a12d41f16012c5f630d7067483b981fa37dc09a26c3d994ce09ebc6ba1b32 |
|
MD5 | eacc40cdddf2c0b1c7fa3ae38e6bdea8 |
|
BLAKE2b-256 | 8662e6e7e30c31c8748e0588407e894816a6a1c52ccacb76612e08e0d3a9d9f1 |
Hashes for bgen-1.6.1-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bf6a21557a88ba7a38eab4519089e9021d52ad32b73d415b31efc10b3b75d30 |
|
MD5 | 9749c851cbbd8652b8faaaeafc9bf17a |
|
BLAKE2b-256 | 01fafec092d3062a927c3900ce800148be6a0e635cbbd69ddf91993871ccdfb0 |
Hashes for bgen-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bf6d52991caa4811c2f3ea236d866a419b1e530e5aadaed27b08445649b8257 |
|
MD5 | 8090fa9bc95cbb51f8dbc11ba02d2e44 |
|
BLAKE2b-256 | e29a1ca44f9a8b3dcb36a0e96b0adcb51558f8ebcdc6d77126b93ea7aca31390 |
Hashes for bgen-1.6.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c86d792816136f9b947958a2f35a929caa0e78cdc267b857d903e04c9d5b0d2 |
|
MD5 | af5de3181dc854648cc6c52faf2be180 |
|
BLAKE2b-256 | 83008bd9f728980d9026a13c63480f6810c7521809d7614b470b89ac37a4141e |