Pipeline for joint calling, sample and variant QC for WGS germline variant calling data
Project description
Joint calling pipeline
A Hail based pipeline for post-processing and filtering of large scale genomic variant calling datasets.
- Combines GVCFs (generated by GATK4) to a Hail Matrix Table.
- Performs sample-level QC.
- Performs variant QC using a random forest model.
- Performs variant QC using a allele-specific VQSR model.
Usage
The workflow should be run using the CPG analysis runner.
Install the CPG analysis runner:
mamba install -c cpg -c conda-forge analysis-runner
Assuming the name of the dataset is fewgenomes
,
export DATASET=fewgenomes
Add joint-calling as a submodule to your analysis repositry. E.g., to a branch develop
:
cd $DATASET # change to the dataset repository
git submodule add -f -b develop https://github.com/populationgenomics/joint-calling
git submodule init
git submodule update
Run the workflow using the analysis runner on a given dataset::
# Assuming we already changed to the dataset repository root
DATASET=$(basename $(pwd))
analysis-runner \
--dataset $DATASET \
--access-level test \
--description "joint calling test" \
--output-dir "gs://cpg-$DATASET-hail/joint-calling/test" \
joint-calling/driver_for_analysis_runner.sh workflows/batch_workflow.py \
--dataset $DATASET
--access-level test \
Thsi command will use the test
access level, which means finding the input GVCFs in the test
bucket gs://cpg-$DATASET-test/gvcf/batch0/*.g.vcf.gz
, writing the resulting matrix tables to the temporary
bucket, gs://cpg-fewgenomes-temporary/mt/v0.mt
, and writing all other analysis files to gs://cpg-fewgenomes-temporary/joint-calling/v0
, e.g.:
gs://cpg-fewgenomes-temporary/joint-calling/v0/sample_qc
gs://cpg-fewgenomes-temporary/joint-calling/v0/variant_qc
To use the main bucket as an input, but write the results to the temporary bucket, run the workflow with the standard
access level:
DATASET=$(basename $(pwd))
analysis-runner \
--dataset $DATASET \
--access-level standard \
--description "joint calling standard" \
--output-dir "gs://cpg-$DATASET-hail/joint-calling" \
joint-calling/driver_for_analysis_runner.sh workflows/batch_workflow.py \
joint-calling/ workflows/batch_workflow.py \
--dataset $DATASET \
--access-level standard \
--version v0 \
--batch 0 \
--batch 1
It will find input GVCFs in the main
bucket, assuming the batch IDs are batch1
and batch2
: gs://cpg-$DATASET-main/gvcf/batch{0,1}/*.g.vcf.gz
; write the resulting matrix tables to the temporary
bucket: gs://cpg-fewgenomes-temporary/mt/v0.mt
, and other analysis files to gs://cpg-fewgenomes-temporary/joint-calling/v0
.
To make a "production" run that uses the inputs from main
and writes to main
and analysis
, run with a full access level:,
DATASET=$(basename $(pwd))
analysis-runner \
--dataset $DATASET \
--access-level full \
--description "joint calling full" \
--output-dir "gs://cpg-$DATASET-hail/joint-calling" \
joint-calling/driver_for_analysis_runner.sh workflows/batch_workflow.py \
--dataset $DATASET \
--access-level full \
--version v0 \
--batch 0 \
--batch 1
It will find input GVCFs in the main
bucket, assuming the batch IDs are batch1
and batch2
: gs://cpg-$DATASET-main/gvcf/batch{0,1}/*.g.vcf.gz
; write the resulting matrix tables to the main
bucket: gs://cpg-fewgenomes-main/mt/v0.mt
, and other analysis files to analysis
: gs://cpg-fewgenomes-analysis/joint-calling/v0
.
Overview of the pipeline steps
-
Find inputs. According to the specified
--dataset
and--batch
parameters, look atgs://cpg-<dataset>-main/gvcf/<batch-id>/*.g.vcf.gz
(orgs://cpg-<dataset>-temporary/gvcf/<batch-id>/*.g.vcf.gz
) for the GVCFs and a CSV file with QC metadata. -
Post-process the GVCFs:
- Run GATK ReblockGVCFs to annotate with allele-specific VCF INFO fields required for recalibration (QUALapprox, VarDP, RAW_MQandDP),
- Subset GVCF to non-alt chromosomes.
-
Run the GVCF combiner using
scripts/combine_gvcfs.py
. The script merges GVCFs into a sparse Matrix Table using Hail's vcf_combiner. -
Run the
scripts/sample_qc.py
script, that performs the sample-level QC, and generates a Table with the filtered sample IDs, as well as a metadata Table with metrics that were used for filtering (coverage, sex, ancestry, contamination, variant numbers/distributions, etc). -
Run the random forest approach to perform the variant QC.
-
Run the allele-specific VQSR approach to perform the variant QC.
Sample QC
The sample QC and random forest variant QC pipelines are largely a re-implementation and orchestration of the Hail methods used for the quality control of GnomAD release. Good summaries of gnomAD QC pipeline can be found in gnomAD update blog posts:
- https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad
- https://macarthurlab.org/2018/10/17/gnomad-v2-1
- https://macarthurlab.org/2019/10/16/gnomad-v3-0
- https://gnomad.broadinstitute.org/blog/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#sample-and-variant-quality-control
- https://blog.hail.is/whole-exome-and-whole-genome-sequencing-recommendations/
Here we give a brief overview of the sample QC steps:
-
Compute sample QC metrics using Hail’s
sample_qc
module on all autosomal bi-allelic SNVs. -
Filter outlier samples using the following cutoffs. Note that the most up to date cutoffs are speified in the configuration file filter_cutoffs.yaml, which can be overridden with
--filter-cutoffs-file
.- Number of SNVs: < 2.4M or > 8M
- Number of singletons: > 400k
- Hom/het ratio: > 3.3
-
Filter using BAM-level metrics was performed when such metrics were available. We removed samples that were outliers for:
- Contamination: freemix > 5% (
call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/*/call-CheckContamination/*.selfSM
/FREEMIX
) - Chimeras: > 5% (
call-AggregatedBamQC/AggregatedBamQC/*/call-CollectAggregationMetrics/*.alignment_summary_metrics
/PCT_CHIMERAS
) - Duplication: > 30% (
call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/*/call-MarkDuplicates/*.duplicate_metrics
/PERCENT_DUPLICATION
) - Median insert size: < 250 (
call-AggregatedBamQC/AggregatedBamQC/*/call-CollectAggregationMetrics/*.insert_size_metrics
/MEDIAN_INSERT_SIZE
) - Median coverage < 18X (calculated from the GVCFs).
- Contamination: freemix > 5% (
-
Sex inferred for each sample with Hail's
impute_sex
. Filter samples with sex chromosome aneuploidies or ambiguous sex assignment. -
Note that all filtering above makes it exclude samples from the variant QC modelling, as well as from the AC/AF/AN frequency calculation. However, it keeps the samples in the final matrix table, with labels in
mt.meta.hardfilter
. -
Relatedness inferred between samples using Hail's
pc_relate
. Identified pairs of 1st and 2nd degree relatives. Filter to a set of unrelated individuals using Hail'smaximal_independent_set
that tries to keep as many samples as possible. When multiple samples could be selected, we kept the sample with the highest coverage. -
PCA was a ran on high-quality variants, and RF was trained using 16 principal components as features on samples with known ancestry. Ancestry was assigned to all samples for which the probability of that ancestry was >75%.
-
Hail
sample_qc
was used stratified by 8 ancestry assignment PCs. Within each PC, outliers were filtered if they are 4 median absolute deviations (MADs) away from the median for the following metrics:n_snp
,r_ti_tv
,r_insertion_deletion
,n_insertion
,n_deletion
,r_het_hom_var
,n_het
,n_hom_var
,n_transition
,n_transversion
, or 8 MADs away from the median number of singletons (n_singleton
metric).
Allele-specific VQSR
-
Export variants into a sites-only VCF and split it into SNPs and indels, as well as region-wise for parallel processing.
-
Run Gnarly Genotyper to perform "quick and dirty" joint genotyping.
-
Filter variants in a large callset (>1000) with the ExcessHet > 54.69.
-
Create SNP and indel recalibration models using the allele-specific version of GATK Variant Quality Score Recalibration VQSR, using the standard GATK training resources (HapMap, Omni, 1000 Genomes, Mills indels), with the following features:
- SNVs:
AS_FS
,AS_SOR
,AS_ReadPosRankSum
,AS_MQRankSum
,AS_QD
,AS_MQ
- Indels:
AS_FS
,AS_SOR
,AS_ReadPosRankSum
,AS_MQRankSum
,AS_QD
- No sample had a high quality genotype at this variant site (GQ>=20, DP>=10, and AB>=0.2 for heterozygotes) (all fields are populated by GATK)
InbreedingCoeff
< -0.3 (there was an excess of heterozygotes at the site compared to Hardy-Weinberg expectations) (InbreedingCoeff
is populated by GATK)
- SNVs:
-
Apply the models to the VCFs and combine them back into one VCF.
-
Import the VCF back to a matrix table.
VQSR pipeline is a compilation from the following 2 WDL workflows:
hail-ukbb-200k-callset/GenotypeAndFilter.AS.wdl
- The Broad VQSR workflow documented here, translated from WDL with a help of Janis.
Random forest variant QC
-
Gather information for the random forest model
-
Impute missing entries
-
Select variants for training examples
-
Train random forests model
-
Test resulting model on chr20
-
Save training data with metadata describing random forest parameters used
-
Apply random forest model to the full variant set.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for joint_calling-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8970d7cbca61e1c939c43924851478fc0a6e096f242f68d512095a35e1ad1b61 |
|
MD5 | cc7e7af6c1e9bdc0dabe9c731b75935a |
|
BLAKE2b-256 | 6b9da126650fd855256f5e33c777dd17a8c6634139944df1f60f8884968321f3 |