Bgen file format reader

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Environment
- Console
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python

Project description

bgen-reader

BGEN is a file format for storing large genetic datasets. It supports both unphased genotypes and phased haplotype data with variable ploidy and number of alleles. It was designed to provides a compact data representation without sacrificing variant access performance.

This Python package is a wrapper around the bgen library, a low-memory footprint reader that efficiently reads BGEN files. It fully supports the BGEN format specifications: 1.2 and 1.3; as well as their optional compressed formats.

Install
Usage
Troubleshooting
- fatal error: bgen.h: No such file or directory
Problems
Authors
License

Install

The recommended way to install this package is via conda

conda install -c conda-forge bgen-reader

Alternatively, it can be installed using the pip command

pip install bgen-reader

However, this method will require that the bgen C library has been installed before.

Usage

The following examples assume you have downloaded the example.bgen, haplotypes.bgen, and complex.bgen files (found in this repository) to the directory you are executing Python.

Unphased genotype

>>> from bgen_reader import read_bgen
>>>
>>> bgen = read_bgen("example.bgen", verbose=False)
>>>
>>> print(bgen["variants"].head())

        id    rsid chrom   pos  nalleles allele_ids
0  SNPID_2  RSID_2    01  2000         2        A,G
1  SNPID_3  RSID_3    01  3000         2        A,G
2  SNPID_4  RSID_4    01  4000         2        A,G
3  SNPID_5  RSID_5    01  5000         2        A,G
4  SNPID_6  RSID_6    01  6000         2        A,G

>>> print(bgen["samples"].head())

           id
0  sample_001
1  sample_002
2  sample_003
3  sample_004
4  sample_005

>>> print(len(bgen["genotype"]))

>>> p = bgen["genotype"][0].compute()
>>> print(p)

[[       nan        nan        nan]
 [0.02780236 0.00863674 0.9635609 ]
 [0.01736504 0.04968414 0.93295083]
 ...
 [0.01419069 0.02810669 0.95770262]
 [0.91949463 0.05206298 0.02844239]
 [0.00244141 0.98410029 0.0134583 ]]

>>> print(p.shape)

(500, 3)

The example.bgen file can be found in the example folder, as well as the next ones.

Phased genotype

>>> from bgen_reader import read_bgen
>>> bgen = read_bgen("haplotypes.bgen", verbose=False)
>>>
>>> print(bgen["variants"].head())

     id rsid chrom  pos  nalleles allele_ids
0  SNP1  RS1     1    1         2        A,G
1  SNP2  RS2     1    2         2        A,G
2  SNP3  RS3     1    3         2        A,G
3  SNP4  RS4     1    4         2        A,G

>>> print(bgen["samples"].head())

         id
0  sample_0
1  sample_1
2  sample_2
3  sample_3

>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> print(bgen["genotype"][0, 1].compute())

[0. 1. 1. 0.]

>>> # Is it a phased one?
>>> print(bgen["X"][0, 1].compute().sel(data="phased").item())

>>> # How many haplotypes?
>>> print(bgen["X"][0, 1].compute().sel(data="ploidy").item())

>>> # And how many alleles?
>>> print(bgen["variants"].loc[0, "nalleles"])

>>> # Therefore, the first haplotype has probability 100%
>>> # of having the allele
>>> print(bgen["variants"].loc[0, "allele_ids"].split(",")[1])

>>> # And the second haplotype has probability 100% of having
>>> # the first allele
>>> print(bgen["variants"].loc[0, "allele_ids"].split(",")[0])

Complex file

>>> from bgen_reader import read_bgen, convert_to_dosage
>>>
>>> bgen = read_bgen("complex.bgen", verbose=False)
>>>
>>> print(bgen["variants"])

     id rsid chrom  pos  nalleles                            allele_ids
0         V1    01    1         2                                   A,G
1  V2.1   V2    01    2         2                                   A,G
2         V3    01    3         2                                   A,G
3         M4    01    4         3                                 A,G,T
4         M5    01    5         2                                   A,G
5         M6    01    7         4                            A,G,GT,GTT
6         M7    01    7         6                 A,G,GT,GTT,GTTT,GTTTT
7         M8    01    8         7          A,G,GT,GTT,GTTT,GTTTT,GTTTTT
8         M9    01    9         8  A,G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT
9        M10    01   10         2                                   A,G

>>> print(bgen["samples"])

         id
0  sample_0
1  sample_1
2  sample_2
3  sample_3

>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> print(bgen["genotype"][0, 1].compute())

[ 1.  0.  0. nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]

>>> # The NaN elements are a by-product of the heterogenous
>>> # ploidy and number of alleles across variants and samples.
>>> # For example, the 9th variant for the 4th individual
>>> # has ploidy
>>> ploidy = bgen["X"][8, 3].compute().sel(data="ploidy").item()
>>> print(ploidy)

>>> # and number of alleles equal to
>>> nalleles = bgen["variants"].loc[8, "nalleles"]
>>> print(nalleles)

>>> # Its probability distribution is given by the array
>>> p = bgen["genotype"][8, 3].compute()
>>> print(p)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

>>> # of size
>>> print(len(p))

>>> # Since the 9th variant for the 4th individual is
>>> # unphased,
>>> print(bgen["X"][8, 3].compute().sel(data="phased").item())

>>> # the estimated probabilities imply the dosage
>>> # (or expected number of alleles)
>>> print(convert_to_dosage(p, nalleles, ploidy))

[0. 1. 0. 0. 0. 1. 0. 0.]

Dosage

For a genotype with ploidy two and locus with two possible alleles, the dosage is defined as the expectation of the number of the reference alleles. It is common to define the reference allele as being the one has lower frequency under the given dataset. The following example demonstrate that case.

>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>> from bgen_reader import compute_dosage
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...     e = allele_expectation(bgen["genotype"], nalleles=2, ploidy=2)
...     dosage = compute_dosage(e)
...     print(dosage.shape)
...     print(dosage)

(199, 500)
[[       nan 0.06424146 0.08441421 ... 0.05648808 1.89105224 0.98898311]
[1.98779296 1.97802735 0.02111815 ... 1.95492412 1.00897216 1.02255316]
[       nan 0.06424146 0.08441421 ... 0.05648808 1.89105224 0.98898311]
...
[       nan 0.06424146 0.08441421 ... 0.05648808 1.89105224 0.98898311]
[1.98779296 1.97802735 0.02111815 ... 1.95492412 1.00897216 1.02255316]
[1.98779296 1.97802735 0.02111815 ... 1.95492412 1.00897216 1.02255316]]

The function compute_dosage also accepts the argument ref from which the reference alleles can be specified. (Consult help(bgen_reader.compute_dosage) for the full specification.)

Another example now querying specific locus and sample.

>>> from texttable import Texttable
>>>
>>> from bgen_reader import (
>>>     read_bgen,
>>>     allele_expectation,
>>>     example_files,
>>>     compute_dosage,
>>>     allele_frequency,
>>> )
>>>
>>> sampleid = "sample_005"
>>> rsid = "RSID_6"
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...
...     locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index[0]
...     sample = bgen["samples"].query("id == '{}'".format(sampleid)).index[0]
...
...     nalleles = bgen["variants"].loc[locus, "nalleles"].item()
...     ploidy = 2
...
...     p = bgen["genotype"][locus, sample].compute()
...     # For unphased genotypes only.
...     e = allele_expectation(bgen["genotype"][locus, sample], nalleles, ploidy)
...
...     alleles = bgen["variants"].loc[locus, "allele_ids"].split(",")
...
...     tab = Texttable()
...
...     tab.add_rows(
...         [
...             ["", "AA", "AG", "GG", "E[.]"],
...             ["p"] + list(p) + [1.0],
...             ["#" + alleles[0], 2, 1, 0, e[0]],
...             ["#" + alleles[1], 0, 1, 2, e[1]],
...         ]
...     )
>>>
>>> print(tab.draw())
>>> print("variant: {}".format(rsid))
>>> print("sample : {}".format(sampleid))
>>>
>>> e = allele_expectation(bgen["genotype"], nalleles, ploidy)
>>>
>>> freq = allele_frequency(e)[locus]
>>> print("Frequency of locus {}:".format(rsid))
>>> print("    {}: {:f}".format(alleles[0], freq[0]))
>>> print("    {}: {:f}".format(alleles[1], freq[1]))
>>>
>>> # Alleles with minor allele frequencies accordong to the provided expections are used
>>> # references by default.
>>> dos = compute_dosage(e)
>>> print()
>>> print("Dosage: {:f}".format(dos[locus, sample]))
>>> print()

+----+-------+-------+-------+-------+
|    |  AA   |  AG   |  GG   | E[.]  |
+====+=======+=======+=======+=======+
| p  | 0.012 | 0.987 | 0.001 | 1     |
+----+-------+-------+-------+-------+
| #A | 2     | 1     | 0     | 1.011 |
+----+-------+-------+-------+-------+
| #G | 0     | 1     | 2     | 0.989 |
+----+-------+-------+-------+-------+
variant: RSID_6
sample : sample_005

Frequency of locus RSID_6:
    A: 0.458462
    G: 0.541538

Dosage: 0.088409

Troubleshooting

fatal error: bgen.h: No such file or directory

This means that bgen C library is not installed (or could not be found). Please, follow the instructions in https://github.com/limix/bgen to install it, and try installing bgen-reader again.

Problems

If you encounter any issue, please, submit it.

Authors

Danilo Horta

License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Environment
- Console
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python

Release history Release notifications | RSS feed

4.0.9

Oct 28, 2024

4.0.8

Jun 23, 2021

4.0.7

Nov 5, 2020

4.0.6

Sep 16, 2020

4.0.5

Aug 11, 2020

4.0.4

Apr 8, 2020

4.0.3

Apr 7, 2020

4.0.2

Apr 7, 2020

4.0.1

Apr 7, 2020

4.0.0

Apr 7, 2020

3.0.7

Oct 22, 2019

3.0.6

May 9, 2019

3.0.5

Apr 8, 2019

3.0.4

Feb 2, 2019

3.0.3

Feb 1, 2019

3.0.2

Jan 18, 2019

2.0.9

May 9, 2019

This version

2.0.8

Nov 17, 2018

2.0.7

Oct 8, 2018

2.0.6

Sep 26, 2018

2.0.5

Jul 17, 2018

2.0.4

Jul 12, 2018

2.0.3

Jul 10, 2018

2.0.2

Jul 9, 2018

2.0.1

Jun 14, 2018

2.0.0

Jun 13, 2018

1.1.6

Jun 9, 2018

1.1.5

Jun 9, 2018

1.1.4

Jun 8, 2018

1.1.3

Jun 7, 2018

1.1.2

Apr 5, 2018

1.1.1

Apr 5, 2018

1.1.0

Apr 5, 2018

1.0.4

Jan 28, 2018

1.0.1

Nov 7, 2017

1.0.0

Sep 20, 2017

0.1.17

Aug 30, 2017

0.1.16

Aug 16, 2017

0.1.15

Aug 15, 2017

0.1.14

Aug 14, 2017

0.1.13

Aug 14, 2017

0.1.12

Aug 10, 2017

0.1.11

Aug 10, 2017

0.1.10

Jun 28, 2017

0.1.9

Jun 26, 2017

0.1.8

Jun 26, 2017

0.1.7

Jun 23, 2017

0.1.6

Jun 23, 2017

0.1.5

Jun 23, 2017

0.1.4

Jun 23, 2017

0.1.2

Jun 23, 2017

0.1.1

Jun 23, 2017

0.1.0

Jun 22, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bgen-reader-2.0.8.tar.gz (1.3 MB view hashes)

Uploaded Nov 17, 2018 Source

Built Distributions

bgen_reader-2.0.8-cp37-cp37m-win_amd64.whl (887.1 kB view hashes)

Uploaded Nov 17, 2018 CPython 3.7m Windows x86-64

bgen_reader-2.0.8-cp37-cp37m-manylinux1_x86_64.whl (1.2 MB view hashes)

Uploaded Nov 17, 2018 CPython 3.7m

bgen_reader-2.0.8-cp37-cp37m-macosx_10_6_intel.whl (920.6 kB view hashes)

Uploaded Nov 17, 2018 CPython 3.7m macOS 10.6+ intel

bgen_reader-2.0.8-cp36-cp36m-win_amd64.whl (887.1 kB view hashes)

Uploaded Nov 17, 2018 CPython 3.6m Windows x86-64

bgen_reader-2.0.8-cp36-cp36m-manylinux1_x86_64.whl (1.2 MB view hashes)

Uploaded Nov 17, 2018 CPython 3.6m

bgen_reader-2.0.8-cp36-cp36m-macosx_10_6_intel.whl (920.6 kB view hashes)

Uploaded Nov 17, 2018 CPython 3.6m macOS 10.6+ intel

bgen_reader-2.0.8-cp27-cp27mu-manylinux1_x86_64.whl (1.2 MB view hashes)

Uploaded Nov 17, 2018 CPython 2.7mu

bgen_reader-2.0.8-cp27-cp27m-manylinux1_x86_64.whl (1.2 MB view hashes)

Uploaded Nov 17, 2018 CPython 2.7m

bgen_reader-2.0.8-cp27-cp27m-macosx_10_6_intel.whl (920.5 kB view hashes)

Uploaded Nov 17, 2018 CPython 2.7m macOS 10.6+ intel

Hashes for bgen-reader-2.0.8.tar.gz

Hashes for bgen-reader-2.0.8.tar.gz
Algorithm	Hash digest
SHA256	`801f217e4f22fe5acc67c09b47af1ce9f5e944c039b9b544420ca8fe20895542`
MD5	`b1b5eb007d367c5be0e94737a1342258`
BLAKE2b-256	`3f91d3d02092e10a36c1abd71191c5b109541c114499b845bb80117e3408eb15`

Hashes for bgen_reader-2.0.8-cp37-cp37m-win_amd64.whl

Hashes for bgen_reader-2.0.8-cp37-cp37m-win_amd64.whl
Algorithm	Hash digest
SHA256	`615a715b9ed08fe593cec597b0b7fb3af94031ad089c71379f955ed6c4f13425`
MD5	`1fc90850d669666ca45b1c4a27466b47`
BLAKE2b-256	`6455c6d0322f59ed7e85cd43b86fac56d53edf18a51d56bb90a0f34d17103f74`

Hashes for bgen_reader-2.0.8-cp37-cp37m-manylinux1_x86_64.whl

Hashes for bgen_reader-2.0.8-cp37-cp37m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`dd27e54864810d4db02b705169f4bcf78f1e16ba4d07c4b25ae2cd499e78f01b`
MD5	`3d1bb5a5ccf330bbc99eedbfeed0a23d`
BLAKE2b-256	`814262e506b20659f275a891692b3dba6703d1154c442e927a9217f3275a3617`

Hashes for bgen_reader-2.0.8-cp37-cp37m-macosx_10_6_intel.whl

Hashes for bgen_reader-2.0.8-cp37-cp37m-macosx_10_6_intel.whl
Algorithm	Hash digest
SHA256	`c20038dcc710337c6d5a83f85ac97662b0b2c4a071c6b2cac4eab6b73c5f50f3`
MD5	`31e1fd4988c8e8091832eb0c7f245845`
BLAKE2b-256	`f2d930e8149e2c30da1a47bf15f19575a43daa2209528f5fd274cea1dc714ad8`

Hashes for bgen_reader-2.0.8-cp36-cp36m-win_amd64.whl

Hashes for bgen_reader-2.0.8-cp36-cp36m-win_amd64.whl
Algorithm	Hash digest
SHA256	`4676db81b0b9729663f42392e8a0738c0596cd5bb45f216c90a45cea8739ec97`
MD5	`5e8ad6043a56421c974c1ddd267f161d`
BLAKE2b-256	`a69e5a92b3fd8c623aaa422504d52abfcf8dab0db72ba246f596e1ad8fe69a58`

Hashes for bgen_reader-2.0.8-cp36-cp36m-manylinux1_x86_64.whl

Hashes for bgen_reader-2.0.8-cp36-cp36m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`d7ecd7e82b27fa860bcad91c5260ae10a0f29b2db8acb4fea590f9cc34a7a832`
MD5	`55bf0af7033d8c64c003f9c9f7a197d1`
BLAKE2b-256	`0e754c8694a55a7a7a576d8cc753613f63444191d4f90893eb9bb385d22ed7aa`

Hashes for bgen_reader-2.0.8-cp36-cp36m-macosx_10_6_intel.whl

Hashes for bgen_reader-2.0.8-cp36-cp36m-macosx_10_6_intel.whl
Algorithm	Hash digest
SHA256	`d928949c4c0014a1f2f7977f2b6907dbadb36cc8483ac49d42d495d6b5adccd2`
MD5	`115810aa526602ede13ca25110115ef4`
BLAKE2b-256	`a573ab1899137589d5cf1f11da82412e001d8f598dac2665fb55e7a1cd7cf85b`

Hashes for bgen_reader-2.0.8-cp27-cp27mu-manylinux1_x86_64.whl

Hashes for bgen_reader-2.0.8-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`8b9698859e73b8e2f9e3e85f005e9ce294a8fb6fb4d2319e9d37fbba80443c34`
MD5	`6b88029e95173a1947d06c7b7096df8b`
BLAKE2b-256	`47b94ce17dd26f609f04f4172295063cae81f8b735f0f7a7c463bc8fb5c74c5d`

Hashes for bgen_reader-2.0.8-cp27-cp27m-manylinux1_x86_64.whl

Hashes for bgen_reader-2.0.8-cp27-cp27m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`525f9cd1b8dd949b3bf059f803774f0ce428fa04ddb36b6ad83bfb59e1bb7514`
MD5	`853a0c0b8e0d767a0aed80f3e9026100`
BLAKE2b-256	`58de61852f713b0d8262eda7022743841d56a0b4f2c37633d4fa8c82d03fee3a`

Hashes for bgen_reader-2.0.8-cp27-cp27m-macosx_10_6_intel.whl

Hashes for bgen_reader-2.0.8-cp27-cp27m-macosx_10_6_intel.whl
Algorithm	Hash digest
SHA256	`aa25a69fa45acc8ef8ae6947ec6f763f4d1382aabb59dfba3a81e99a658074af`
MD5	`c0ed65715de86cb50144329bc1b63dfd`
BLAKE2b-256	`661c32ea7c04dfce35ad7c35ea9d2c046ec8de628255eacc161178e6a1deb597`