A python package to download, process and access dbSNP data.
Project description
dbSNP to Parquet Converter
A python-package focused rewrite of Weinstock's dbSNP to parquet repository. This package allows convenient download, processing and access to data from dbSNP. A simple python interface can be used to work with the data once it is processed.
Installation
pyvariantdb is available on PyPI and can be installed from there with the package management tool of choice. For
development we use pixi:
# install pixi
curl -fsSL https://pixi.sh/install.sh | sh
pixi update && pixi install
Usage
We recommend to prepare the data from the command line before using the package since download and processing takes quit some time. Per default the data is stored at ~/.cache/pyvariantdb. This can be changed through the usage of environment variables:
export PYVARIANTDB_HOME = "/raid/cache/pyvariantdb"
Execution of the pipeline can be done with:
# activate pixi
pixi shell -e default
# download dbsnp
pyvariantdb-download
# transform the database to parquets
pyvariantdb-make-dbsnp
We aim to provide an easy interface to extract genomic coordinates from dbSNP. For this, the main functionality is extremely simple. Just provide the rs-identifier (with or without chromosome) and we fetch the respective coordindates. However, to achieve this we need to run the pipeline mentioned above.
Extracting coordinates is simple:
from pyvariantdb.lookup import SNPLookup
lookup = SNPLookup()
# P53 mutations, chromosome 17
rsids = ["rs1042522", "rs17878362", "rs1800372"]
df_all = lookup.query_all(rsids)
df_chr = lookup.query_chromosome(rsids, "17")
print(df_all)
print(df_chr)
Processing Pipeline
pyvariantdb offers some quality of life improvements for working with dbSNP and the original repository.
The original pipeline remains the same:
- Downloads dbSNP data (GRCh38 build 156)
- Filters for SNVs only
- Converts chromosome contigs to standard naming
- Splits data by chromosome
- Creates Parquet lookup tables with RSID mappings
Output
The script entrypoint generates the following files on-disk to the defined cache dir.
dbSNP_156.bcf- Full filtered BCF filedbSNP_156.chr*.bcf- Per-chromosome BCF filesdbSNP_156.chr*.lookup.parquet- Per-chromosome RSID lookup tablesdbSNP_156.lookup.parquet- Combined RSID lookup table
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyvariantdb-1.1.tar.gz.
File metadata
- Download URL: pyvariantdb-1.1.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3bb5a7b7d7bed49cab43bff77e3e63f859f351d18dc2e512e4cfba358f75040
|
|
| MD5 |
4678d606cbddc32686b9a281dc991b21
|
|
| BLAKE2b-256 |
45891c65441f306ee360dc4a0ab85853deae7670481aea7445b9177e92dbc2c7
|
File details
Details for the file pyvariantdb-1.1-py3-none-any.whl.
File metadata
- Download URL: pyvariantdb-1.1-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea30e45e4cd458304fb35685b3257cf7194caf1350e160b53a16792e13540649
|
|
| MD5 |
913de6b348de82fb0e9c1925a33b801d
|
|
| BLAKE2b-256 |
3ec016c7e99c06d0fa464bdab244ab521dd154219d20b468b2ac006ba0f63cf3
|