Skip to main content

Bgen file format reader

Project description

bgen-reader

Travis AppVeyor

A BGEN file format reader.

BGEN is a file format for storing large genetic datasets. It supports both unphased genotypes and phased haplotype data with variable ploidy and number of alleles. It was designed to provides a compact data representation without sacrificing variant access performance.

This Python package is a wrapper around the bgen library, a low-memory footprint reader that efficiently reads BGEN files. It fully supports the BGEN format specifications: 1.2 and 1.3; as well as their optional compressed formats.

Install

The recommended way to install this package is via conda

conda install -c conda-forge bgen-reader

Alternatively, it can be installed using the pip command

pip install bgen-reader

However, this method will require that the bgen C library has been installed before.

Usage

Unphased genotype

>>> from bgen_reader import read_bgen
>>>
>>> bgen = read_bgen("example.bgen", verbose=False)
>>>
>>> print(bgen["variants"].head())
        id    rsid chrom   pos  nalleles allele_ids
0  SNPID_2  RSID_2    01  2000         2        A,G
1  SNPID_3  RSID_3    01  3000         2        A,G
2  SNPID_4  RSID_4    01  4000         2        A,G
3  SNPID_5  RSID_5    01  5000         2        A,G
4  SNPID_6  RSID_6    01  6000         2        A,G
>>> print(bgen["samples"].head())
           id
0  sample_001
1  sample_002
2  sample_003
3  sample_004
4  sample_005
>>> print(len(bgen["genotype"]))
199
>>> p = bgen["genotype"][0].compute()
>>> print(p)
[[       nan        nan        nan]
 [0.02780236 0.00863674 0.9635609 ]
 [0.01736504 0.04968414 0.93295083]
 ...
 [0.01419069 0.02810669 0.95770262]
 [0.91949463 0.05206298 0.02844239]
 [0.00244141 0.98410029 0.0134583 ]]
>>> print(p.shape)
(500, 3)

The example.bgen file can be found in the example folder, as well as the next ones.

Phased genotype

>>> from bgen_reader import read_bgen
>>> bgen = read_bgen("haplotypes.bgen", verbose=False)
>>>
>>> print(bgen["variants"].head())
     id rsid chrom  pos  nalleles allele_ids
0  SNP1  RS1     1    1         2        A,G
1  SNP2  RS2     1    2         2        A,G
2  SNP3  RS3     1    3         2        A,G
3  SNP4  RS4     1    4         2        A,G
>>> print(bgen["samples"].head())
         id
0  sample_0
1  sample_1
2  sample_2
3  sample_3
>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> print(bgen["genotype"][0, 1].compute())
[0. 1. 1. 0.]
>>> # Is it a phased one?
>>> print(bgen["X"][0, 1].compute().sel(data="phased").item())
1
>>> # How many haplotypes?
>>> print(bgen["X"][0, 1].compute().sel(data="ploidy").item())
2
>>> # And how many alleles?
>>> print(bgen["variants"].loc[0, "nalleles"])
2
>>> # Therefore, the first haplotype has probability 100%
>>> # of having the allele
>>> print(bgen["variants"].loc[0, "allele_ids"].split(",")[1])
G
>>> # And the second haplotype has probability 100% of having
>>> # the first allele
>>> print(bgen["variants"].loc[0, "allele_ids"].split(",")[0])
A

Complex file

>>> from bgen_reader import read_bgen, convert_to_dosage
>>>
>>> bgen = read_bgen("complex.bgen", verbose=False)
>>>
>>> print(bgen["variants"])
     id rsid chrom  pos  nalleles                            allele_ids
0         V1    01    1         2                                   A,G
1  V2.1   V2    01    2         2                                   A,G
2         V3    01    3         2                                   A,G
3         M4    01    4         3                                 A,G,T
4         M5    01    5         2                                   A,G
5         M6    01    7         4                            A,G,GT,GTT
6         M7    01    7         6                 A,G,GT,GTT,GTTT,GTTTT
7         M8    01    8         7          A,G,GT,GTT,GTTT,GTTTT,GTTTTT
8         M9    01    9         8  A,G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT
9        M10    01   10         2                                   A,G
>>> print(bgen["samples"])
         id
0  sample_0
1  sample_1
2  sample_2
3  sample_3
>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> print(bgen["genotype"][0, 1].compute())
[ 1.  0.  0. nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
>>> # The NaN elements are a by-product of the heterogenous
>>> # ploidy and number of alleles across variants and samples.
>>> # For example, the 9th variant for the 4th individual
>>> # has ploidy
>>> ploidy = bgen["X"][8, 3].compute().sel(data="ploidy").item()
>>> print(ploidy)
2
>>> # and number of alleles equal to
>>> nalleles = bgen["variants"].loc[8, "nalleles"]
>>> print(nalleles)
8
>>> # Its probability distribution is given by the array
>>> p = bgen["genotype"][8, 3].compute()
>>> print(p)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> # of size
>>> print(len(p))
36
>>> # Since the 9th variant for the 4th individual is
>>> # unphased,
>>> print(bgen["X"][8, 3].compute().sel(data="phased").item())
0
>>> # the estimated probabilities imply the dosage
>>> # (or expected number of alleles)
>>> print(convert_to_dosage(p, nalleles, ploidy))
[0. 1. 0. 0. 0. 1. 0. 0.]

Problems

If you encounter any issue, please, submit it.

Authors

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bgen_reader-2.0.2-cp37-cp37m-macosx_10_6_intel.whl (895.9 kB view details)

Uploaded CPython 3.7mmacOS 10.6+ Intel (x86-64, i386)

bgen_reader-2.0.2-cp36-cp36m-win_amd64.whl (884.1 kB view details)

Uploaded CPython 3.6mWindows x86-64

bgen_reader-2.0.2-cp36-cp36m-macosx_10_6_intel.whl (895.9 kB view details)

Uploaded CPython 3.6mmacOS 10.6+ Intel (x86-64, i386)

File details

Details for the file bgen_reader-2.0.2-cp37-cp37m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for bgen_reader-2.0.2-cp37-cp37m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 e37006b133cdb5a6fa82c85b83f96fa10dcd866cd7cf2977f7be29f2d85f88f8
MD5 f2032427b507bca8b2d59f302b9102ab
BLAKE2b-256 72d7f08b31e42255cb1c2118b82cf2f192f2d87b1563d86c95bc4903ac10349b

See more details on using hashes here.

File details

Details for the file bgen_reader-2.0.2-cp36-cp36m-win_amd64.whl.

File metadata

File hashes

Hashes for bgen_reader-2.0.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 d6ac7b600cb2655a5ebd09f1ae33d9d5261262ed2524d956364e83a2992532ab
MD5 83b326d9ee5a497bf788870219d4296c
BLAKE2b-256 f8272d45aa2e4ca721de993471ba066c2745c7b89a9f949909644c792a088ca4

See more details on using hashes here.

File details

Details for the file bgen_reader-2.0.2-cp36-cp36m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for bgen_reader-2.0.2-cp36-cp36m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 2d63e8d66bddc1ff24261c1f6167a8ec157bc6ea387d6113e8924fe15ca555d3
MD5 109c933b8877be12cfaa99686d30c7d0
BLAKE2b-256 32ca852e77acea3bd5a647fa707f0fe3210081a618876b25a7d8169e39aa8111

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page