Skip to main content

Package for loading data from bgen files

Project description

Another bgen reader

bgen

This is a package for reading bgen files.

This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU). Decompressing the genotype probabilities is the slow step, zlib decompression takes 80% of the total time, using zstd compressed genotypes would be much faster, maybe 2-3X faster?

This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).

Install

pip install bgen

Usage

from bgen import BgenReader

bfile = BgenReader(BGEN_PATH)
rsids = bfile.rsids()

# select a variant by indexing
var = bfile[1000]

# pull out genotype probabilities
probs = var.probabilities  # returns 2D numpy array
dosage = var.minor_allele_dosage  # returns 1D numpy array for biallelic variant

# iterate through every variant in the file
with BgenReader(BGEN_PATH, delay_parsing=True) as bfile:
  for var in bfile:
      dosage = var.minor_allele_dosage

# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)

# or for writing bgen files
import numpy as np
from bgen import BgenWriter

geno = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]).astype(np.float64)
with BgenWriter(BGEN_PATH, n_samples=3) as bfile:
  bfile.add_variant(varid='var1', rsid='rs1', chrom='chr1', pos=1,
                    alleles=['A', 'G'], genotypes=geno)

API documentation

class BgenReader(path, sample_path='', delay_parsing=False)
    # opens a bgen file. If a bgenix index exists for the file, the index file
    # will be opened automatically for quicker access of specific variants.
    Arguments:
      path: path to bgen file
      sample_path: optional path to sample file. Samples will be given integer IDs
          if sample file is not given and sample IDs not found in the bgen file
      delay_parsing: True/False option to allow for not loading all variants into
          memory when the BgenFile is opened. This can save time when iterating
          across variants in the file
  
  Attributes:
    samples: list of sample IDs
    header: BgenHeader with info about the bgen version and compression.
  
  Methods:
    slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
    iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
    fetch(chrom, start=None, stop=None): get all variants within a genomic region
    drop_variants(list[int]): drops variants by index from being used in analyses
    with_rsid(rsid): returns BgenVar with given position
    at_position(pos): returns BgenVar with given rsid
    varids(): returns list of varids for variants in the bgen file.
    rsids(): returns list of rsids for variants in the bgen file.
    chroms(): returns list of chromosomes for variants in the bgen file.
    positions(): returns list of positions for variants in the bgen file.

class BgenVar(handle, offset, layout, compression, n_samples):
  # Note: this isn't called directly, but instead returned from BgenFile methods
  Attributes:
    varid: ID for variant
    rsid: reference SNP ID for variant
    chrom: chromosome variant is on
    pos: nucleotide position variant is at
    alleles: list of alleles for variant
    is_phased: True/False for whether variant has phased genotype data
    ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
    minor_allele: the least common allele (for biallelic variants)
    minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
    alt_dosage: 1D numpy array of alt allele dosages for each sample
    probabilitiies:  2D numpy array of genotype probabilities, one sample per row
  
  BgenVars can be pickled e.g. pickle.dumps(var)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bgen-1.5.6.tar.gz (884.6 kB view hashes)

Uploaded Source

Built Distributions

bgen-1.5.6-cp311-cp311-win_amd64.whl (475.9 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

bgen-1.5.6-cp311-cp311-win32.whl (422.3 kB view hashes)

Uploaded CPython 3.11 Windows x86

bgen-1.5.6-cp311-cp311-musllinux_1_1_x86_64.whl (3.5 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

bgen-1.5.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

bgen-1.5.6-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

bgen-1.5.6-cp310-cp310-win_amd64.whl (476.0 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

bgen-1.5.6-cp310-cp310-win32.whl (423.5 kB view hashes)

Uploaded CPython 3.10 Windows x86

bgen-1.5.6-cp310-cp310-musllinux_1_1_x86_64.whl (3.4 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

bgen-1.5.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

bgen-1.5.6-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

bgen-1.5.6-cp39-cp39-win_amd64.whl (476.5 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

bgen-1.5.6-cp39-cp39-win32.whl (424.1 kB view hashes)

Uploaded CPython 3.9 Windows x86

bgen-1.5.6-cp39-cp39-musllinux_1_1_x86_64.whl (3.4 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

bgen-1.5.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

bgen-1.5.6-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

bgen-1.5.6-cp38-cp38-win_amd64.whl (476.8 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

bgen-1.5.6-cp38-cp38-win32.whl (424.2 kB view hashes)

Uploaded CPython 3.8 Windows x86

bgen-1.5.6-cp38-cp38-musllinux_1_1_x86_64.whl (3.4 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

bgen-1.5.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

bgen-1.5.6-cp38-cp38-macosx_10_9_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

bgen-1.5.6-cp37-cp37m-win_amd64.whl (474.9 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

bgen-1.5.6-cp37-cp37m-win32.whl (422.1 kB view hashes)

Uploaded CPython 3.7m Windows x86

bgen-1.5.6-cp37-cp37m-musllinux_1_1_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

bgen-1.5.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

bgen-1.5.6-cp37-cp37m-macosx_10_9_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page