Skip to main content

CPG ClinVar Re-interpretation

Project description

ClinVar, re-summarised

Motivation

During the creation of Talos, a tool for identifying clinically relevant variants in large cohorts, we use ClinVar ratings as a contributing factor in determining pathogenicity. During development of this tool we determined that the default summaries generated in ClinVar were highly conservative; see the table here describing the aggregate classification logic.

Content

This repository contains an alternative algorithm (described here) for re-aggregating the individual ClinVar submissions, generating decisions which favour clear assignment of pathogenic/benign ratings instead of defaulting to 'conflicting'. These ratings are not intended as a replacement of ClinVar's own decisions, but may provide value by showing that that though conflicting submissions exist, there is a clear bias towards either benign or pathogenic ratings.

We aim to re-run this process monthly, and publish the resulting files on Zenodo You can download this pre-generated bundle here: https://zenodo.org/records/16792026

Primary Outputs

  • Hail Table and TSV of all revised decisions
  • Hail Table and TSV of all Pathogenic missense changes, indexed on Transcript and Codon. This is usable as a PM5 annotation resource.

TSVs

  1. clinvar_decisions.tsv: A tab-separated file with headers, containing our re-summarised ClinVar decisions. Columns:

    • contig: the chromosome or contig of the variant
    • position: the position of the variant on the contig
    • reference: the reference allele at the variant position
    • alternate: the alternate allele at the variant position
    • clinical_significance: the clinical significance of the variant, as determined by our algorithm
    • gold_stars: the number of gold stars assigned to the variant, indicating the quality of the evidence supporting the asserted significance
    • allele_id: the unique identifier for the variant in ClinVar, accessible directly via URL like http://www.ncbi.nlm.nih.gov/clinvar?term=XXXXXXX[alleleid], or through ClinVar's web page using an 'advanced search' field
  2. clinvar_decisions.pm5.tsv: A tab-separated file with headers, containing our PM5 missense decisions. All ClinVar entries in this file are Pathogenic Missense changes. Columns:

    • transcript: the transcript ID of the gene in which the missense change occurs
    • codon: the codon position of the missense change in that transcript
    • clinvar_alleles: +-delimited String, each entry being an AlleleID::GoldStars string, where AlleleID is the unique identifier for the ClinVar allele, and GoldStars is the number of stars assigned to that allele. e.g. 12345::3+67890::1, indicating that allele 12345 has 3 stars, and allele 67890 has 1 star, and both affect the same codon in the same transcript.

Usage

Download Results

We aim to generate data monthly, and publish the results on Zenodo. The latest version of the data can be found at:

https://zenodo.org/records/16777475

Local Running

Downloading input files

A NextFlow workflow is provided to run the ClinvArbitration process locally. To use this process you will need reference files:

  • a reference genome, in FASTA format
  • a GFF3 file, containing gene annotations for the reference genome
  • the files containing raw ClinVar submissions and variant details

A directory (data) and a script (download_data.sh) are provided to download and store the required files. Running this script from the data directory will download and unpack all required files. The location these files are downloaded to matches the expected location in the Nextflow config, so you can run the workflow immediately after downloading.

The ClinVar Variant and Submission summary files are updated weekly. You should delete your local copy and re-download each time you run this workflow, to ensure you're capturing the latest data.

Running the workflow

The ClinvArbitration workflow can be run containerised, or locally. By default, the reference data will be read from a directory called data, and the outputs written to a directory nextflow_outputs.

Local execution requires:

  • a Nextflow installation, to operate the workflow
  • a Python environment, with the ClinvArbitration package and its dependencies installed
    • this can be actioned with pip install . from the root of this repository
  • BCFtools, to annotate the ClinVar variants with gene information
nextflow -c nextflow/nextflow.config \
    run nextflow/clinvarbitration.nf

A containerised execution requires:

  • a Nextflow installation, to operate the workflow
  • a Docker installation, to run the workflow in a container

Step 1: build the Docker image:

docker build -t clinvarbitration:local .

Step 2: run the workflow using the Docker image:`

nextflow -c nextflow/nextflow.config \
    run nextflow/clinvarbitration.nf \
    -with-docker clinvarbitration:local

CPG-Flow

Internally at CPG, this workflow is run using CPG-Flow, an in-house Hail Batch based workflow executor. The following elements relate to that workflow:

The intention is that once the Dockerfile within this repository is used, this workflow can be triggered like so:

analysis-runner \
    --skip-repo-checkout \
    --image australia-southeast1-docker.pkg.dev/cpg-common/images-dev/clinvarbitration:PR_24 \
    --config new_clinvarbitration.toml \
    --dataset seqr \
    --description 'resummarise_clinvar' \
    -o resummarise_clinvar \
    --access-level test \
    run_workflow

A config file is required containing a few entries, some relating to this workflow specifically, some relating to cpg-flow setup:

  • workflow.driver_image: populated by analysis-runner, points to this docker image
  • site_blacklist: list of ClinVar submitters to ignore. Useful in removing noise, or blinding to self submissions
  • ref_fasta: required to run bcftools csq. Must match the genome_build
  • genome_build: used to decide whether ClinVar/Annotation is sourced using GRCh37 or GRCh38 (default)

Acknowledgements

  • ClinVar, for providing the data which this process is based on

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clinvarbitration-2.2.8.tar.gz (157.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clinvarbitration-2.2.8-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file clinvarbitration-2.2.8.tar.gz.

File metadata

  • Download URL: clinvarbitration-2.2.8.tar.gz
  • Upload date:
  • Size: 157.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for clinvarbitration-2.2.8.tar.gz
Algorithm Hash digest
SHA256 ac75fde1e592f70f3d25aad7718923a4a121427186732d0d8fa74af0ec9013a8
MD5 1adcbc4a00667b8bd53250fb7d6b474b
BLAKE2b-256 09ab5a705d670b291dff745c214c0be504af35ccb30ea992240628b04d6610f9

See more details on using hashes here.

File details

Details for the file clinvarbitration-2.2.8-py3-none-any.whl.

File metadata

File hashes

Hashes for clinvarbitration-2.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 49a78e8a9eb6b646ba46f76762391f6592936235ba197ace0ab0847a9fd80896
MD5 a43eb4c57c2a743eb47551c4e9ad7c2b
BLAKE2b-256 e1f784e9b7ca465657d4f9d2ed12f4f61e6a0c26547671e53caae24a9efe60db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page