This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.
Project description
gnomAD_MAF
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1 (about 760.000.000 variants), and allows scientists to look for minor allele frequencies of variants really fast (A query containing 300.000 variants takes ~40s.)
It extracts from a gnomAD vcf the ["AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"] columns.
The package works for all currently available gnomAD releases.(2021)
1. Data preprocessing and SQL database creation
Start by downloading the vcf files from gnomAD in a single directory:
wget -c link_to_gnomAD.vcf.bgz
After that specify the arguments in the script_config.yaml
.
database_location: "test_out" # where to create the database, make sure you have space on your device.
gnomad_vcf_location: "data" # where are your *.vcf.bgz located
tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this
script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
Once this is done, run
conda env create -f environment.yaml
conda activate gnomad_db
python -m ipykernel install --user --name gnomad_db --display-name "gnomad_db"
to prepare your conda environment
Finally, you can trigger the snakemake pipeline which will create the SQL database
snakemake --cores 12
2. API usage
Congratulations, your database is set up! Now it is time to learn how to use it.
First, you can install the package in the gnomad_db env or in the one which you are going to use for your downstream analysis
pip install gnomad_db
You can use the package like
- import modules
import pandas as pd
from gnomad_db.database import gnomAD_DB
- Initialize database connection
# pass dir
database_location = "test_dir"
db = gnomAD_DB(database_location)
- Insert some test variants to run the examples below
# get some variants
var_df = pd.read_csv("data/test_vcf_gnomad_chr21_10000.tsv.gz", sep="\t", names=db.columns, index_col=False)
# IMPORTANT: The database removes internally chr prefix (chr1->1)
# insert these variants
db.insert_variants(var_df)
- Query variant minor allele frequency
# query some MAF scores
dummy_var_df = pd.DataFrame({
"chrom": ["1", "21"],
"pos": [21, 9825790],
"ref": ["T", "C"],
"alt": ["G", "T"]})
# query from dataframe AF column
db.get_maf_from_df(dummy_var_df, "AF")
# query from dataframe AF and AF_popmax columns
db.get_maf_from_df(dummy_var_df, "AF, AF_popmax")
# query from dataframe all columns
db.get_maf_from_df(dummy_var_df, "*")
# query from string
db.get_maf_from_str("21:9825790:C>T", "AF")
- You can query also intervals of minor allele frequencies
db.get_mafs_for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")
For more information on how to use the package, look into GettingStartedwithGnomAD_DB.ipynb notebook!
NB: The package is under development and any use cases suggestions/extensions and feedback are welcome.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gnomad_db-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbd665bd4ea662bfef5ddc89137c1fe7af391945caf5ef551d784197dd054403 |
|
MD5 | 742a8bbfd1d37ad9634bb1f3f4dc0fa5 |
|
BLAKE2b-256 | a35d33151b3cf1a1afd74c36b2f6b0b7d00d55d4a837c6b1344f794f6246bfd8 |