Skip to main content

This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.

Project description

KalinNonchev Downloads

gnomAD_DB

Changelog

NEW version (April 2024)

  • release gnomAD WGS v4.1 and WES v4.1
    • More information here.

version (November 2023)

  • release gnomAD WGS v4.0 and WES v4.0
  • gnomad_version=["v2"|"v3"|"v4"] argument has to be specified when initializing the database
  • minor fixes

version (July 2022)

  • release gnomAD WGS v3.1.2
  • minor bug fixes

version (December 2021)

  • more available variant features present, check here
  • get_maf_from_df renamed to get_info_from_df
  • get_maf_from_str renamed to get_info_from_str
  • [DEPRECATED 11.2023]genome=["Grch37"|"Grch38"] argument has to be specified when initializing the database

Why and What

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.

This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of <100G and allows scientists to look for various variant annotations present in gnomAD (i.e. Allele Count, Depth, Minor Allele Frequency, etc. - here you can find all selected features given the genome version). (A query containing 300.000 variants takes ~40s.)

It extracts from a gnomAD vcf about 23 variant annotations. You can find further information about the exact fields here.

The package works for all currently available gnomAD releases.(July 2022)

1. Download SQLite preprocessed files

I have preprocessed and created sqlite3 files for gnomAD for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.

You can download it as:

from gnomad_db.database import gnomAD_DB
download_link = "https://zenodo.org/record/6818606/files/gnomad_db_v3.1.2.sqlite3.gz?download=1"
output_dir = "test_dir" # database_location
gnomAD_DB.download_and_unzip(download_link, output_dir)

NB this would take ~30min (network speed 10mb/s)

or you can create the database by yourself. However, I recommend using the preprocessed files to save resources and time. If you do so, you can go to 2. API usage and explore the package and its great features!

2. API usage

Congratulations, your database is set up! Now it is time to learn how to use it.

First, you can install the package in the gnomad_db env or in the one which you are going to use for your downstream analysis

pip install gnomad_db

You can use the package like

  1. import modules
import pandas as pd
from gnomad_db.database import gnomAD_DB
  1. Initialize database connection
    Make sure to have the correct gnomad version!
# pass dir
database_location = "test_dir"
db = gnomAD_DB(database_location, gnomad_version="v3")
  1. Insert some test variants to run the examples below
    If you have downloaded the preprocessed sqlite3 files, you can skip this step as you already have variants, make sure to have the correct genome version!
# get some variants
var_df = pd.read_csv("data/test_vcf_gnomad_chr21_10000.tsv.gz", sep="\t", names=db.columns, index_col=False)
# IMPORTANT: The database removes internally chr prefix (chr1->1)
# insert these variants
db.insert_variants(var_df)
  1. Query variant minor allele frequency
    These example variants are assembled to hg38!
# query some MAF scores
dummy_var_df = pd.DataFrame({
    "chrom": ["1", "21"], 
    "pos": [21, 9825790], 
    "ref": ["T", "C"], 
    "alt": ["G", "T"]})

# query from dataframe AF column
db.get_info_from_df(dummy_var_df, "AF")

# query from dataframe AF and AF_popmax columns
db.get_info_from_df(dummy_var_df, "AF, AF_popmax")

# query from dataframe all columns
db.get_info_from_df(dummy_var_df, "*")

# query from string
db.get_info_from_str("21:9825790:C>T", "AF")
  1. You can query also intervals of minor allele frequencies
db.get_info_for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")

For more information on how to use the package, look into GettingStartedwithGnomAD_DB.ipynb notebook!

Citation

In case you found our work useful, please consider citing us:

@misc{gnomad_db,
  author       = {Kalin Nonchev},
  title        = {gnomAD_DB: Scalable SQLite Database for gnomAD VCF Files},
  year         = {2021},
  publisher    = {GitHub},
  journal      = {GitHub repository},
  howpublished = {\url{https://github.com/KalinNonchev/gnomAD_DB}},
  note         = {Accessed: 2025-05-27}
}

Contact

In case, you have questions, please get in touch with Kalin Nonchev.

NB: The package is under development and any use cases suggestions/extensions and feedback are welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gnomad_db-0.1.6.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gnomad_db-0.1.6-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file gnomad_db-0.1.6.tar.gz.

File metadata

  • Download URL: gnomad_db-0.1.6.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gnomad_db-0.1.6.tar.gz
Algorithm Hash digest
SHA256 3360185159a2b6c230b0451b9bc3e5edfe2b73b4707e37773d289ebe936ca26a
MD5 30071c49d985ab20fb892661403801ff
BLAKE2b-256 bb76c30cbbca18529be5cec1be6b4041eb2182d263fca240555158a89f863974

See more details on using hashes here.

File details

Details for the file gnomad_db-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: gnomad_db-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gnomad_db-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 010f52aba39b91bc21eb5a2c8b8c7598d09be5f05d0480728262c8c5b7017e3a
MD5 3993099c9b4f30bc9d9c014d13d1f50e
BLAKE2b-256 1d81fdf2d1302879529f1cb488a520e9e1be83dae1f0f36665e0dcd66ebb7e06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page