Bgen file format reader
Project description
bgen-reader
A BGEN file format reader.
BGEN is a file format for storing large genetic datasets. It supports both unphased genotypes and phased haplotype data with variable ploidy and number of alleles. It was designed to provides a compact data representation without sacrificing variant access performance.
This Python package is a wrapper around the bgen library, a low-memory footprint reader that efficiently reads BGEN files. It fully supports the BGEN format specifications: 1.2 and 1.3; as well as their optional compressed formats.
Table of Contents
Install
The recommended way to install this package is via conda
conda install -c conda-forge bgen-reader
Alternatively, it can be installed using the pip command
pip install bgen-reader
However, this method will require that the bgen C library has been installed before.
Usage
The following examples assume you have downloaded the example.bgen
,
haplotypes.bgen
, and complex.bgen
files (found in this repository) to the
directory you are executing Python.
Unphased genotype
>>> from bgen_reader import read_bgen
>>>
>>> bgen = read_bgen("example.bgen", verbose=False)
>>>
>>> print(bgen["variants"].head())
id rsid chrom pos nalleles allele_ids
0 SNPID_2 RSID_2 01 2000 2 A,G
1 SNPID_3 RSID_3 01 3000 2 A,G
2 SNPID_4 RSID_4 01 4000 2 A,G
3 SNPID_5 RSID_5 01 5000 2 A,G
4 SNPID_6 RSID_6 01 6000 2 A,G
>>> print(bgen["samples"].head())
id
0 sample_001
1 sample_002
2 sample_003
3 sample_004
4 sample_005
>>> print(len(bgen["genotype"]))
199
>>> p = bgen["genotype"][0].compute()
>>> print(p)
[[ nan nan nan]
[0.02780236 0.00863674 0.9635609 ]
[0.01736504 0.04968414 0.93295083]
...
[0.01419069 0.02810669 0.95770262]
[0.91949463 0.05206298 0.02844239]
[0.00244141 0.98410029 0.0134583 ]]
>>> print(p.shape)
(500, 3)
The example.bgen
file can be found in the example
folder, as
well as the next ones.
Phased genotype
>>> from bgen_reader import read_bgen
>>> bgen = read_bgen("haplotypes.bgen", verbose=False)
>>>
>>> print(bgen["variants"].head())
id rsid chrom pos nalleles allele_ids
0 SNP1 RS1 1 1 2 A,G
1 SNP2 RS2 1 2 2 A,G
2 SNP3 RS3 1 3 2 A,G
3 SNP4 RS4 1 4 2 A,G
>>> print(bgen["samples"].head())
id
0 sample_0
1 sample_1
2 sample_2
3 sample_3
>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> print(bgen["genotype"][0, 1].compute())
[0. 1. 1. 0.]
>>> # Is it a phased one?
>>> print(bgen["X"][0, 1].compute().sel(data="phased").item())
1
>>> # How many haplotypes?
>>> print(bgen["X"][0, 1].compute().sel(data="ploidy").item())
2
>>> # And how many alleles?
>>> print(bgen["variants"].loc[0, "nalleles"])
2
>>> # Therefore, the first haplotype has probability 100%
>>> # of having the allele
>>> print(bgen["variants"].loc[0, "allele_ids"].split(",")[1])
G
>>> # And the second haplotype has probability 100% of having
>>> # the first allele
>>> print(bgen["variants"].loc[0, "allele_ids"].split(",")[0])
A
Complex file
>>> from bgen_reader import read_bgen, convert_to_dosage
>>>
>>> bgen = read_bgen("complex.bgen", verbose=False)
>>>
>>> print(bgen["variants"])
id rsid chrom pos nalleles allele_ids
0 V1 01 1 2 A,G
1 V2.1 V2 01 2 2 A,G
2 V3 01 3 2 A,G
3 M4 01 4 3 A,G,T
4 M5 01 5 2 A,G
5 M6 01 7 4 A,G,GT,GTT
6 M7 01 7 6 A,G,GT,GTT,GTTT,GTTTT
7 M8 01 8 7 A,G,GT,GTT,GTTT,GTTTT,GTTTTT
8 M9 01 9 8 A,G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT
9 M10 01 10 2 A,G
>>> print(bgen["samples"])
id
0 sample_0
1 sample_1
2 sample_2
3 sample_3
>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> print(bgen["genotype"][0, 1].compute())
[ 1. 0. 0. nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
>>> # The NaN elements are a by-product of the heterogenous
>>> # ploidy and number of alleles across variants and samples.
>>> # For example, the 9th variant for the 4th individual
>>> # has ploidy
>>> ploidy = bgen["X"][8, 3].compute().sel(data="ploidy").item()
>>> print(ploidy)
2
>>> # and number of alleles equal to
>>> nalleles = bgen["variants"].loc[8, "nalleles"]
>>> print(nalleles)
8
>>> # Its probability distribution is given by the array
>>> p = bgen["genotype"][8, 3].compute()
>>> print(p)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> # of size
>>> print(len(p))
36
>>> # Since the 9th variant for the 4th individual is
>>> # unphased,
>>> print(bgen["X"][8, 3].compute().sel(data="phased").item())
0
>>> # the estimated probabilities imply the dosage
>>> # (or expected number of alleles)
>>> print(convert_to_dosage(p, nalleles, ploidy))
[0. 1. 0. 0. 0. 1. 0. 0.]
Dosage
For a genotype with ploidy two and locus with two possible alleles, the dosage is defined as the expectation of the number of the reference alleles. It is common to define the reference allele as being the one has lower frequency under the given dataset. The following example demonstrate that case.
>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>> from bgen_reader import compute_dosage
>>>
>>> with example_files("example.32bits.bgen") as filepath:
... bgen = read_bgen(filepath, verbose=False)
... e = allele_expectation(bgen["genotype"], nalleles=2, ploidy=2)
... dosage = compute_dosage(e)
... print(dosage.shape)
... print(dosage)
(199, 500)
[[ nan 0.06424146 0.08441421 ... 0.05648808 1.89105224 0.98898311]
[1.98779296 1.97802735 0.02111815 ... 1.95492412 1.00897216 1.02255316]
[ nan 0.06424146 0.08441421 ... 0.05648808 1.89105224 0.98898311]
...
[ nan 0.06424146 0.08441421 ... 0.05648808 1.89105224 0.98898311]
[1.98779296 1.97802735 0.02111815 ... 1.95492412 1.00897216 1.02255316]
[1.98779296 1.97802735 0.02111815 ... 1.95492412 1.00897216 1.02255316]]
The function compute_dosage
also accepts the argument ref
from which the reference
alleles can be specified. (Consult help(bgen_reader.compute_dosage)
for the full
specification.)
Another example now querying specific locus and sample.
>>> from texttable import Texttable
>>>
>>> from bgen_reader import (
>>> read_bgen,
>>> allele_expectation,
>>> example_files,
>>> compute_dosage,
>>> allele_frequency,
>>> )
>>>
>>> sampleid = "sample_005"
>>> rsid = "RSID_6"
>>>
>>> with example_files("example.32bits.bgen") as filepath:
... bgen = read_bgen(filepath, verbose=False)
...
... locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index[0]
... sample = bgen["samples"].query("id == '{}'".format(sampleid)).index[0]
...
... nalleles = bgen["variants"].loc[locus, "nalleles"].item()
... ploidy = 2
...
... p = bgen["genotype"][locus, sample].compute()
... # For unphased genotypes only.
... e = allele_expectation(bgen["genotype"][locus, sample], nalleles, ploidy)
...
... alleles = bgen["variants"].loc[locus, "allele_ids"].split(",")
...
... tab = Texttable()
...
... tab.add_rows(
... [
... ["", "AA", "AG", "GG", "E[.]"],
... ["p"] + list(p) + [1.0],
... ["#" + alleles[0], 2, 1, 0, e[0]],
... ["#" + alleles[1], 0, 1, 2, e[1]],
... ]
... )
>>>
>>> print(tab.draw())
>>> print("variant: {}".format(rsid))
>>> print("sample : {}".format(sampleid))
>>>
>>> e = allele_expectation(bgen["genotype"], nalleles, ploidy)
>>>
>>> freq = allele_frequency(e)[locus]
>>> print("Frequency of locus {}:".format(rsid))
>>> print(" {}: {:f}".format(alleles[0], freq[0]))
>>> print(" {}: {:f}".format(alleles[1], freq[1]))
>>>
>>> # Alleles with minor allele frequencies accordong to the provided expections are used
>>> # references by default.
>>> dos = compute_dosage(e)
>>> print()
>>> print("Dosage: {:f}".format(dos[locus, sample]))
>>> print()
+----+-------+-------+-------+-------+
| | AA | AG | GG | E[.] |
+====+=======+=======+=======+=======+
| p | 0.012 | 0.987 | 0.001 | 1 |
+----+-------+-------+-------+-------+
| #A | 2 | 1 | 0 | 1.011 |
+----+-------+-------+-------+-------+
| #G | 0 | 1 | 2 | 0.989 |
+----+-------+-------+-------+-------+
variant: RSID_6
sample : sample_005
Frequency of locus RSID_6:
A: 0.458462
G: 0.541538
Dosage: 0.088409
Troubleshooting
fatal error: bgen.h: No such file or directory
This means that bgen C library is not installed (or could not be found). Please, follow the instructions in https://github.com/limix/bgen to install it, and try installing bgen-reader again.
Problems
If you encounter any issue, please, submit it.
Authors
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for bgen_reader-2.0.8-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 615a715b9ed08fe593cec597b0b7fb3af94031ad089c71379f955ed6c4f13425 |
|
MD5 | 1fc90850d669666ca45b1c4a27466b47 |
|
BLAKE2b-256 | 6455c6d0322f59ed7e85cd43b86fac56d53edf18a51d56bb90a0f34d17103f74 |
Hashes for bgen_reader-2.0.8-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd27e54864810d4db02b705169f4bcf78f1e16ba4d07c4b25ae2cd499e78f01b |
|
MD5 | 3d1bb5a5ccf330bbc99eedbfeed0a23d |
|
BLAKE2b-256 | 814262e506b20659f275a891692b3dba6703d1154c442e927a9217f3275a3617 |
Hashes for bgen_reader-2.0.8-cp37-cp37m-macosx_10_6_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c20038dcc710337c6d5a83f85ac97662b0b2c4a071c6b2cac4eab6b73c5f50f3 |
|
MD5 | 31e1fd4988c8e8091832eb0c7f245845 |
|
BLAKE2b-256 | f2d930e8149e2c30da1a47bf15f19575a43daa2209528f5fd274cea1dc714ad8 |
Hashes for bgen_reader-2.0.8-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4676db81b0b9729663f42392e8a0738c0596cd5bb45f216c90a45cea8739ec97 |
|
MD5 | 5e8ad6043a56421c974c1ddd267f161d |
|
BLAKE2b-256 | a69e5a92b3fd8c623aaa422504d52abfcf8dab0db72ba246f596e1ad8fe69a58 |
Hashes for bgen_reader-2.0.8-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7ecd7e82b27fa860bcad91c5260ae10a0f29b2db8acb4fea590f9cc34a7a832 |
|
MD5 | 55bf0af7033d8c64c003f9c9f7a197d1 |
|
BLAKE2b-256 | 0e754c8694a55a7a7a576d8cc753613f63444191d4f90893eb9bb385d22ed7aa |
Hashes for bgen_reader-2.0.8-cp36-cp36m-macosx_10_6_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d928949c4c0014a1f2f7977f2b6907dbadb36cc8483ac49d42d495d6b5adccd2 |
|
MD5 | 115810aa526602ede13ca25110115ef4 |
|
BLAKE2b-256 | a573ab1899137589d5cf1f11da82412e001d8f598dac2665fb55e7a1cd7cf85b |
Hashes for bgen_reader-2.0.8-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b9698859e73b8e2f9e3e85f005e9ce294a8fb6fb4d2319e9d37fbba80443c34 |
|
MD5 | 6b88029e95173a1947d06c7b7096df8b |
|
BLAKE2b-256 | 47b94ce17dd26f609f04f4172295063cae81f8b735f0f7a7c463bc8fb5c74c5d |
Hashes for bgen_reader-2.0.8-cp27-cp27m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 525f9cd1b8dd949b3bf059f803774f0ce428fa04ddb36b6ad83bfb59e1bb7514 |
|
MD5 | 853a0c0b8e0d767a0aed80f3e9026100 |
|
BLAKE2b-256 | 58de61852f713b0d8262eda7022743841d56a0b4f2c37633d4fa8c82d03fee3a |
Hashes for bgen_reader-2.0.8-cp27-cp27m-macosx_10_6_intel.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa25a69fa45acc8ef8ae6947ec6f763f4d1382aabb59dfba3a81e99a658074af |
|
MD5 | c0ed65715de86cb50144329bc1b63dfd |
|
BLAKE2b-256 | 661c32ea7c04dfce35ad7c35ea9d2c046ec8de628255eacc161178e6a1deb597 |