Skip to main content

This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.

Project description

gnomAD_MAF

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.

This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1 (about 760.000.000 variants), and allows scientists to look for minor allele frequencies of variants really fast (A query containing 300.000 variants takes ~40s.)

It extracts from a gnomAD vcf the ["AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"] columns.

The package works for all currently available gnomAD releases.(2021)

1. Download SQLite preprocessed files

I have preprocessed and created sqlite3 files for gnomAD v2.1.1 and 3.1.1 for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.

gnomAD v3.1.1 (hg38, 759'302'267 variants) 25G zipped, 56G in total - https://zenodo.org/record/5045170/files/gnomad_db_v3.1.1.sqlite3.gz?download=1
gnomAD v2.1.1 (hg19, 261'942'336 variants) 9G zipped, 20G in total - https://zenodo.org/record/5045102/files/gnomad_db_v2.1.1.sqlite3.gz?download=1

You can download it as:

from gnomad_db.database import gnomAD_DB
download_link = "https://zenodo.org/record/5045102/files/gnomad_db_v2.1.1.sqlite3.gz?download=1"
output_dir = "test_dir" # database_location
gnomAD_DB.download_and_unzip(download_link, output_dir)

NB this would take ~30min (network speed 10mb/s)

or you can create the database by yourself. However, I recommend to use the preprocessed files to save ressources and time. If you do so, you can go to 2. API usage and explore the package and its great features!

1.1 Data preprocessing and SQL database creation

Start by downloading the vcf files from gnomAD in a single directory:

wget -c link_to_gnomAD.vcf.bgz

After that specify the arguments in the script_config.yaml.

database_location: "test_out" # where to create the database, make sure you have space on your device.
gnomad_vcf_location: "data" # where are your *.vcf.bgz located
tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this 
script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this

Once this is done, run

conda env create -f environment.yaml
conda activate gnomad_db
python -m ipykernel install --user --name gnomad_db --display-name "gnomad_db"

to prepare your conda environment

Finally, you can trigger the snakemake pipeline which will create the SQL database

snakemake --cores 12

2. API usage

Congratulations, your database is set up! Now it is time to learn how to use it.

First, you can install the package in the gnomad_db env or in the one which you are going to use for your downstream analysis

pip install gnomad_db

You can use the package like

  1. import modules
import pandas as pd
from gnomad_db.database import gnomAD_DB
  1. Initialize database connection
# pass dir
database_location = "test_dir"
db = gnomAD_DB(database_location)
  1. Insert some test variants to run the examples below
    If you have downloaded the preprocessed sqlite3 files, you can skip this step as you already have variants
# get some variants
var_df = pd.read_csv("data/test_vcf_gnomad_chr21_10000.tsv.gz", sep="\t", names=db.columns, index_col=False)
# IMPORTANT: The database removes internally chr prefix (chr1->1)
# insert these variants
db.insert_variants(var_df)
  1. Query variant minor allele frequency
    These example variants are assembled to hg38!
# query some MAF scores
dummy_var_df = pd.DataFrame({
    "chrom": ["1", "21"], 
    "pos": [21, 9825790], 
    "ref": ["T", "C"], 
    "alt": ["G", "T"]})

# query from dataframe AF column
db.get_maf_from_df(dummy_var_df, "AF")

# query from dataframe AF and AF_popmax columns
db.get_maf_from_df(dummy_var_df, "AF, AF_popmax")

# query from dataframe all columns
db.get_maf_from_df(dummy_var_df, "*")

# query from string
db.get_maf_from_str("21:9825790:C>T", "AF")
  1. You can query also intervals of minor allele frequencies
db.get_mafs_for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")

For more information on how to use the package, look into GettingStartedwithGnomAD_DB.ipynb notebook!

NB: The package is under development and any use cases suggestions/extensions and feedback are welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gnomad_db-0.0.5.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

gnomad_db-0.0.5-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file gnomad_db-0.0.5.tar.gz.

File metadata

  • Download URL: gnomad_db-0.0.5.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for gnomad_db-0.0.5.tar.gz
Algorithm Hash digest
SHA256 f84b775168fb85c043a149f45114b3b4555223444569943f7557958b4a1f0649
MD5 df6a79c8005acaf4773c07fbf099b616
BLAKE2b-256 83831bb1553043bdcaad6de7f3d43bc2d609e1e45e18e02cfdbed12c44d32b3d

See more details on using hashes here.

File details

Details for the file gnomad_db-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: gnomad_db-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for gnomad_db-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 357a2a15a92ffc25e290a94f42de7ed0107ec1a45ce9be3de2267cbfe98c9f83
MD5 a67643114702dc40db7732f9cd25ff29
BLAKE2b-256 3d3f79178d7414bd1dc0d323ac37ec75220a2776c8183ecf5969d9064474688d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page