Skip to main content

A python package to download, process and access dbSNP data.

Project description

dbSNP to Parquet Converter

A python-package focused rewrite of Weinstock's dbSNP to parquet repository. This package allows convenient download, processing and access to data from dbSNP. A simple python interface can be used to work with the data once it is processed.

Installation

pyvariantdb is available on PyPI and can be installed from there with the package management tool of choice. For development we use pixi:

# install pixi
curl -fsSL https://pixi.sh/install.sh | sh
pixi update && pixi install

Usage

We recommend to prepare the data from the command line before using the package since download and processing takes quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of environment variables:

export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"

Execution of the pipeline can be done with:

# activate pixi
pixi shell -e default
# download dbsnp
pyvariantdb-download
# transform the database to parquets
pyvariantdb-make-dbsnp 

We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective coordindates. However, to achieve this we need to run the pipeline mentioned above.

Extracting coordinates is simple:

from pyvariantdb.lookup import SNPLookup

lookup = SNPLookup()
# P53 mutations, chromosome 17
rsids = ["rs1042522", "rs17878362", "rs1800372"]
df_all = lookup.query_all(rsids)
df_chr = lookup.query_chromosome(rsids, "17")
print(df_all)
print(df_chr) 

Processing Pipeline

pyvariantdb offers some quality of life improvements for working with dbSNP and the original repository. The original pipeline remains the same:

  1. Downloads dbSNP data (GRCh38 build 156)
  2. Filters for SNVs only
  3. Converts chromosome contigs to standard naming
  4. Splits data by chromosome
  5. Creates Parquet lookup tables with RSID mappings

Output

The script entrypoint generates the following files on-disk to the defined cache dir.

  • dbSNP_156.bcf - Full filtered BCF file
  • dbSNP_156.chr*.bcf - Per-chromosome BCF files
  • dbSNP_156.chr*.lookup.parquet - Per-chromosome RSID lookup tables
  • dbSNP_156.lookup.parquet - Combined RSID lookup table

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyvariantdb-1.3.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyvariantdb-1.3-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file pyvariantdb-1.3.tar.gz.

File metadata

  • Download URL: pyvariantdb-1.3.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pyvariantdb-1.3.tar.gz
Algorithm Hash digest
SHA256 2d04577f66c6c5b37bae01837fe718eadee0b54d50325f7042e8c075d78f7393
MD5 9c7846dce61c4c5ebd5d6ae5d9e3af44
BLAKE2b-256 4f4622d1298c171d30a5021dab7707a4b2156d5d66d57bdb4311c2f7a447844c

See more details on using hashes here.

File details

Details for the file pyvariantdb-1.3-py3-none-any.whl.

File metadata

  • Download URL: pyvariantdb-1.3-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pyvariantdb-1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6e5f8f43c1b312ea01856a8dad99642c7044753feb9a311232f6e7beb741f4c4
MD5 3f5ff30e2cf73dab97be3b87121e06d4
BLAKE2b-256 b0aa48c91641f84603448b569173fe4dad3fdb93ba48f5e9c6ddfb8db34a7e07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page