Bgen file format reader

These details have not been verified by PyPI

Project links

Project description

bgen-reader

BGEN is a file format for storing large genetic datasets. It supports both unphased genotypes and phased haplotype data with variable ploidy and number of alleles. It was designed to provides a compact data representation without sacrificing variant access performance.

This Python package is a wrapper around the bgen library, a low-memory footprint reader that efficiently reads BGEN files. It fully supports the BGEN format specifications: 1.2 and 1.3; as well as their optional compressed formats.

Install

The recommended way to install this package is via conda

conda install -c conda-forge bgen-reader

Alternatively, it can be installed using the pip command

pip install bgen-reader

However, this method will require that the bgen C library has been installed before.

Usage

Unphased genotype

>>> from bgen_reader import read_bgen
>>>
>>> bgen = read_bgen("example.bgen", verbose=False)
>>>
>>> print(bgen["variants"].head())

        id    rsid chrom   pos  nalleles allele_ids
0  SNPID_2  RSID_2    01  2000         2        A,G
1  SNPID_3  RSID_3    01  3000         2        A,G
2  SNPID_4  RSID_4    01  4000         2        A,G
3  SNPID_5  RSID_5    01  5000         2        A,G
4  SNPID_6  RSID_6    01  6000         2        A,G

>>> print(bgen["samples"].head())

           id
0  sample_001
1  sample_002
2  sample_003
3  sample_004
4  sample_005

>>> print(len(bgen["genotype"]))

>>> p = bgen["genotype"][0].compute()
>>> print(p)

[[       nan        nan        nan]
 [0.02780236 0.00863674 0.9635609 ]
 [0.01736504 0.04968414 0.93295083]
 ...
 [0.01419069 0.02810669 0.95770262]
 [0.91949463 0.05206298 0.02844239]
 [0.00244141 0.98410029 0.0134583 ]]

>>> print(p.shape)

(500, 3)

The example.bgen file can be found in the example folder, as well as the next ones.

Phased genotype

>>> from bgen_reader import read_bgen
>>> bgen = read_bgen("haplotypes.bgen", verbose=False)
>>>
>>> print(bgen["variants"].head())

     id rsid chrom  pos  nalleles allele_ids
0  SNP1  RS1     1    1         2        A,G
1  SNP2  RS2     1    2         2        A,G
2  SNP3  RS3     1    3         2        A,G
3  SNP4  RS4     1    4         2        A,G

>>> print(bgen["samples"].head())

         id
0  sample_0
1  sample_1
2  sample_2
3  sample_3

>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> print(bgen["genotype"][0, 1].compute())

[0. 1. 1. 0.]

>>> # Is it a phased one?
>>> print(bgen["X"][0, 1].compute().sel(data="phased").item())

>>> # How many haplotypes?
>>> print(bgen["X"][0, 1].compute().sel(data="ploidy").item())

>>> # And how many alleles?
>>> print(bgen["variants"].loc[0, "nalleles"])

>>> # Therefore, the first haplotype has probability 100%
>>> # of having the allele
>>> print(bgen["variants"].loc[0, "allele_ids"].split(",")[1])

>>> # And the second haplotype has probability 100% of having
>>> # the first allele
>>> print(bgen["variants"].loc[0, "allele_ids"].split(",")[0])

Complex file

>>> from bgen_reader import read_bgen, convert_to_dosage
>>>
>>> bgen = read_bgen("complex.bgen", verbose=False)
>>>
>>> print(bgen["variants"])

     id rsid chrom  pos  nalleles                            allele_ids
0         V1    01    1         2                                   A,G
1  V2.1   V2    01    2         2                                   A,G
2         V3    01    3         2                                   A,G
3         M4    01    4         3                                 A,G,T
4         M5    01    5         2                                   A,G
5         M6    01    7         4                            A,G,GT,GTT
6         M7    01    7         6                 A,G,GT,GTT,GTTT,GTTTT
7         M8    01    8         7          A,G,GT,GTT,GTTT,GTTTT,GTTTTT
8         M9    01    9         8  A,G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT
9        M10    01   10         2                                   A,G

>>> print(bgen["samples"])

         id
0  sample_0
1  sample_1
2  sample_2
3  sample_3

>>> # Print the estimated probabilities for the first variant
>>> # and second individual.
>>> print(bgen["genotype"][0, 1].compute())

[ 1.  0.  0. nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]

>>> # The NaN elements are a by-product of the heterogenous
>>> # ploidy and number of alleles across variants and samples.
>>> # For example, the 9th variant for the 4th individual
>>> # has ploidy
>>> ploidy = bgen["X"][8, 3].compute().sel(data="ploidy").item()
>>> print(ploidy)

>>> # and number of alleles equal to
>>> nalleles = bgen["variants"].loc[8, "nalleles"]
>>> print(nalleles)

>>> # Its probability distribution is given by the array
>>> p = bgen["genotype"][8, 3].compute()
>>> print(p)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

>>> # of size
>>> print(len(p))

>>> # Since the 9th variant for the 4th individual is
>>> # unphased,
>>> print(bgen["X"][8, 3].compute().sel(data="phased").item())

>>> # the estimated probabilities imply the dosage
>>> # (or expected number of alleles)
>>> print(convert_to_dosage(p, nalleles, ploidy))

[0. 1. 0. 0. 0. 1. 0. 0.]

Problems

If you encounter any issue, please, submit it.

Authors

Danilo Horta

License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.0.9

Oct 28, 2024

4.0.8

Jun 23, 2021

4.0.7

Nov 5, 2020

4.0.6

Sep 16, 2020

4.0.5

Aug 11, 2020

4.0.4

Apr 8, 2020

4.0.3

Apr 7, 2020

4.0.2

Apr 7, 2020

4.0.1

Apr 7, 2020

4.0.0

Apr 7, 2020

3.0.7

Oct 22, 2019

3.0.6

May 9, 2019

3.0.5

Apr 8, 2019

3.0.4

Feb 2, 2019

3.0.3

Feb 1, 2019

3.0.2

Jan 18, 2019

2.0.9

May 9, 2019

2.0.8

Nov 17, 2018

2.0.7

Oct 8, 2018

2.0.6

Sep 26, 2018

2.0.5

Jul 17, 2018

2.0.4

Jul 12, 2018

2.0.3

Jul 10, 2018

This version

2.0.2

Jul 9, 2018

2.0.1

Jun 14, 2018

2.0.0

Jun 13, 2018

1.1.6

Jun 9, 2018

1.1.5

Jun 9, 2018

1.1.4

Jun 8, 2018

1.1.3

Jun 7, 2018

1.1.2

Apr 5, 2018

1.1.1

Apr 5, 2018

1.1.0

Apr 5, 2018

1.0.4

Jan 28, 2018

1.0.1

Nov 7, 2017

1.0.0

Sep 20, 2017

0.1.17

Aug 30, 2017

0.1.16

Aug 16, 2017

0.1.15

Aug 15, 2017

0.1.14

Aug 14, 2017

0.1.13

Aug 14, 2017

0.1.12

Aug 10, 2017

0.1.11

Aug 10, 2017

0.1.10

Jun 28, 2017

0.1.9

Jun 26, 2017

0.1.8

Jun 26, 2017

0.1.7

Jun 23, 2017

0.1.6

Jun 23, 2017

0.1.5

Jun 23, 2017

0.1.4

Jun 23, 2017

0.1.2

Jun 23, 2017

0.1.1

Jun 23, 2017

0.1.0

Jun 22, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bgen_reader-2.0.2-cp37-cp37m-macosx_10_6_intel.whl (895.9 kB view details)

Uploaded Jul 9, 2018 CPython 3.7mmacOS 10.6+ Intel (x86-64, i386)

bgen_reader-2.0.2-cp36-cp36m-win_amd64.whl (884.1 kB view details)

Uploaded Jul 9, 2018 CPython 3.6mWindows x86-64

bgen_reader-2.0.2-cp36-cp36m-macosx_10_6_intel.whl (895.9 kB view details)

Uploaded Jul 9, 2018 CPython 3.6mmacOS 10.6+ Intel (x86-64, i386)

File details

Details for the file bgen_reader-2.0.2-cp37-cp37m-macosx_10_6_intel.whl.

File metadata

Download URL: bgen_reader-2.0.2-cp37-cp37m-macosx_10_6_intel.whl
Upload date: Jul 9, 2018
Size: 895.9 kB
Tags: CPython 3.7m, macOS 10.6+ Intel (x86-64, i386)
Uploaded using Trusted Publishing? No

File hashes

Hashes for bgen_reader-2.0.2-cp37-cp37m-macosx_10_6_intel.whl
Algorithm	Hash digest
SHA256	`e37006b133cdb5a6fa82c85b83f96fa10dcd866cd7cf2977f7be29f2d85f88f8`
MD5	`f2032427b507bca8b2d59f302b9102ab`
BLAKE2b-256	`72d7f08b31e42255cb1c2118b82cf2f192f2d87b1563d86c95bc4903ac10349b`

See more details on using hashes here.

File details

Details for the file bgen_reader-2.0.2-cp36-cp36m-win_amd64.whl.

File metadata

Download URL: bgen_reader-2.0.2-cp36-cp36m-win_amd64.whl
Upload date: Jul 9, 2018
Size: 884.1 kB
Tags: CPython 3.6m, Windows x86-64
Uploaded using Trusted Publishing? No

File hashes

Hashes for bgen_reader-2.0.2-cp36-cp36m-win_amd64.whl
Algorithm	Hash digest
SHA256	`d6ac7b600cb2655a5ebd09f1ae33d9d5261262ed2524d956364e83a2992532ab`
MD5	`83b326d9ee5a497bf788870219d4296c`
BLAKE2b-256	`f8272d45aa2e4ca721de993471ba066c2745c7b89a9f949909644c792a088ca4`

See more details on using hashes here.

File details

Details for the file bgen_reader-2.0.2-cp36-cp36m-macosx_10_6_intel.whl.

File metadata

Download URL: bgen_reader-2.0.2-cp36-cp36m-macosx_10_6_intel.whl
Upload date: Jul 9, 2018
Size: 895.9 kB
Tags: CPython 3.6m, macOS 10.6+ Intel (x86-64, i386)
Uploaded using Trusted Publishing? No

File hashes

Hashes for bgen_reader-2.0.2-cp36-cp36m-macosx_10_6_intel.whl
Algorithm	Hash digest
SHA256	`2d63e8d66bddc1ff24261c1f6167a8ec157bc6ea387d6113e8924fe15ca555d3`
MD5	`109c933b8877be12cfaa99686d30c7d0`
BLAKE2b-256	`32ca852e77acea3bd5a647fa707f0fe3210081a618876b25a7d8169e39aa8111`

See more details on using hashes here.

bgen-reader 2.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bgen-reader

Install

Usage

Unphased genotype

Phased genotype

Complex file

Problems

Authors

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes