Skip to main content

A python package to download, process and access dbSNP data.

Project description

dbSNP to Parquet Converter

A python-package focused rewrite of Weinstock's dbSNP to parquet repository. This package allows convenient download, processing and access to data from dbSNP. A simple python interface can be used to work with the data once it is processed.

Installation

pyvariantdb is available on PyPI and can be installed from there with the package management tool of choice. For development we use pixi:

# install pixi
curl -fsSL https://pixi.sh/install.sh | sh
pixi update && pixi install

Usage

We recommend to prepare the data from the command line before using the package since download and processing takes quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of environment variables:

export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"

Execution of the pipeline can be done with:

# activate pixi
pixi shell -e default
# download dbsnp
pyvariantdb-download
# transform the database to parquets
pyvariantdb-make-dbsnp 

We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective coordindates. However, to achieve this we need to run the pipeline mentioned above.

Extracting coordinates is simple:

from pyvariantdb.lookup import SNPLookup

lookup = SNPLookup()
# P53 mutations, chromosome 17
rsids = ["rs1042522", "rs17878362", "rs1800372"]
df_all = lookup.query_all(rsids)
df_chr = lookup.query_chromosome(rsids, "17")
print(df_all)
print(df_chr) 

Processing Pipeline

pyvariantdb offers some quality of life improvements for working with dbSNP and the original repository. The original pipeline remains the same:

  1. Downloads dbSNP data (GRCh38 build 156)
  2. Filters for SNVs only
  3. Converts chromosome contigs to standard naming
  4. Splits data by chromosome
  5. Creates Parquet lookup tables with RSID mappings

Output

The script entrypoint generates the following files on-disk to the defined cache dir.

  • dbSNP_156.bcf - Full filtered BCF file
  • dbSNP_156.chr*.bcf - Per-chromosome BCF files
  • dbSNP_156.chr*.lookup.parquet - Per-chromosome RSID lookup tables
  • dbSNP_156.lookup.parquet - Combined RSID lookup table

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyvariantdb-1.1.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyvariantdb-1.1-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file pyvariantdb-1.1.tar.gz.

File metadata

  • Download URL: pyvariantdb-1.1.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.9

File hashes

Hashes for pyvariantdb-1.1.tar.gz
Algorithm Hash digest
SHA256 d3bb5a7b7d7bed49cab43bff77e3e63f859f351d18dc2e512e4cfba358f75040
MD5 4678d606cbddc32686b9a281dc991b21
BLAKE2b-256 45891c65441f306ee360dc4a0ab85853deae7670481aea7445b9177e92dbc2c7

See more details on using hashes here.

File details

Details for the file pyvariantdb-1.1-py3-none-any.whl.

File metadata

  • Download URL: pyvariantdb-1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.9

File hashes

Hashes for pyvariantdb-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ea30e45e4cd458304fb35685b3257cf7194caf1350e160b53a16792e13540649
MD5 913de6b348de82fb0e9c1925a33b801d
BLAKE2b-256 3ec016c7e99c06d0fa464bdab244ab521dd154219d20b468b2ac006ba0f63cf3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page