This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

gnomAD_MAF

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.

This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1 (about 760.000.000 variants), and allows scientists to look for minor allele frequencies of variants really fast (A query containing 300.000 variants takes ~40s.)

It extracts from a gnomAD vcf the ["AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"] columns.

The package works for all currently available gnomAD releases.(2021)

1. Data preprocessing and SQL database creation

Start by downloading the vcf files from gnomAD in a single directory:

wget -c link_to_gnomAD.vcf.bgz

After that specify the arguments in the script_config.yaml.

database_location: "test_out" # where to create the database, make sure you have space on your device.
gnomad_vcf_location: "data" # where are your *.vcf.bgz located
tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this 
script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this

Once this is done, run

conda env create -f environment.yaml
conda activate gnomad_db
python -m ipykernel install --user --name gnomad_db --display-name "gnomad_db"

to prepare your conda environment

Finally, you can trigger the snakemake pipeline which will create the SQL database

snakemake --cores 12

2. API usage

Congratulations, your database is set up! Now it is time to learn how to use it.

First, you can install the package in the gnomad_db env or in the one which you are going to use for your downstream analysis

pip install gnomad_db

You can use the package like

import modules

import pandas as pd
from gnomad_db.database import gnomAD_DB

Initialize database connection

# pass dir
database_location = "test_dir"
db = gnomAD_DB(database_location)

Insert some test variants to run the examples below

# get some variants
var_df = pd.read_csv("data/test_vcf_gnomad_chr21_10000.tsv.gz", sep="\t", names=db.columns, index_col=False)
# IMPORTANT: The database removes internally chr prefix (chr1->1)
# insert these variants
db.insert_variants(var_df)

Query variant minor allele frequency

# query some MAF scores
dummy_var_df = pd.DataFrame({
    "chrom": ["1", "21"], 
    "pos": [21, 9825790], 
    "ref": ["T", "C"], 
    "alt": ["G", "T"]})

# query from dataframe AF column
db.get_maf_from_df(dummy_var_df, "AF")

# query from dataframe AF and AF_popmax columns
db.get_maf_from_df(dummy_var_df, "AF, AF_popmax")

# query from dataframe all columns
db.get_maf_from_df(dummy_var_df, "*")

# query from string
db.get_maf_from_str("21:9825790:C>T", "AF")

You can query also intervals of minor allele frequencies

db.get_mafs_for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")

For more information on how to use the package, look into GettingStartedwithGnomAD_DB.ipynb notebook!

NB: The package is under development and any use cases suggestions/extensions and feedback are welcome.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.4

Nov 4, 2023

0.1.3

Nov 4, 2023

0.1.2

Jul 13, 2022

0.1.1

Dec 13, 2021

0.1.0

Dec 12, 2021

0.0.7

Jun 30, 2021

0.0.6

Jun 30, 2021

0.0.5

Jun 30, 2021

This version

0.0.4

May 17, 2021

0.0.3

May 9, 2021

0.0.2

May 8, 2021

0.0.1

May 8, 2021

0.0.0

May 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gnomad_db-0.0.4.tar.gz (5.6 kB view hashes)

Uploaded May 17, 2021 Source

Built Distribution

gnomad_db-0.0.4-py3-none-any.whl (5.7 kB view hashes)

Uploaded May 17, 2021 Python 3

Hashes for gnomad_db-0.0.4.tar.gz

Hashes for gnomad_db-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`638cb8a965489c21b34b3bae84e33d465bffedee94704be18f962335ca782f65`
MD5	`55abaca1771087366b914f46cdbbd700`
BLAKE2b-256	`f1f487763530e10cf832ce942368e81be05732096acbc24081e124c178cff3dd`

Hashes for gnomad_db-0.0.4-py3-none-any.whl

Hashes for gnomad_db-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cbd665bd4ea662bfef5ddc89137c1fe7af391945caf5ef551d784197dd054403`
MD5	`742a8bbfd1d37ad9634bb1f3f4dc0fa5`
BLAKE2b-256	`a35d33151b3cf1a1afd74c36b2f6b0b7d00d55d4a837c6b1344f794f6246bfd8`