Skip to main content

Package to examine de novo clustering

Project description

github

Denovonear

This code assesses whether de novo single-nucleotide variants are closer together within the coding sequence of a gene than expected by chance, or whether the amino acids for those SNVs are closer together in the protein structure than expected by chance. We use local-sequence based mutation rates to account for differential mutability of regions. The default rates are per-trinucleotide based see Nature Genetics 46:944–950, but you can use your own rates, or even longer sequence contexts, such as 5-mers or 7-mers.

Install

pip install denovonear

Usage

Analyse de novo mutation clustering in the coding sequence with the CLI tool:

denovonear cluster \
   --in data/example.grch38.dnms.txt \
   --gencode data/example.grch38.gtf \
   --fasta data/example.grch38.fa \
   --out output.txt

Or test clustering within protein structures:

denovonear cluster-structure \
   --in data/example.grch38.dnms.txt \
   --structures PATH_TO_STRUCTURES.tar \
   --gencode data/example.grch38.gtf \
   --fasta data/example.grch38.fa \
   --out output.structure.txt

explanation of options:

  • --in: path to tab-separated table of de novo mutations. See example table below for columns, or example.grch38.dnms.txt in data folder.
  • --gencode: path to GENCODE annotations in GTF format for transcripts and exons e.g. example release. Can be gzipped, or uncompressed.
  • --fasta: path to genome fasta, matching genome build of gencode file
  • --structures: path to tar file containing PDB structures for all protein coding genes. This has only been tested with AlphaFold human proteome tar files e.g. https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000005640_9606_HUMAN_v4.tar The code identifies the approriate structure pdb by first searching for uniprot IDs via ensembl (starting with the transcript ID). This only permits variants in the canonical transcript, and returns nan if a) no structure PDB is found, b) multiple structures exists for the uniprot IDs, c) multiple chains are in the structure file, d) the structure has missing residues or e) the number of residues in the structure file does not match what is expected from the CDS length. This uses the carbon atom coordinates for each residue to place the amino acid position, and computes Euclidean distances between these coordinates.

If the --gencode or --fasta options are skipped (e.g. denovonear cluster --in INFILE --out OUTFILE), gene annotations will be retrieved via an ensembl web service. For that, you might need to specify --genome-build grch38 to ensure the gene coordinates match your de novo mutation coordinates.

  • --rates PATHS_TO_RATES_FILES
  • --rates-format context OR genome
  • --cache-folder PATH_TO_CACHE_DIR
  • --genome-build "grch37" or "grch38" (default=grch37)

The rates option operates in two ways. The first (which requires --rates-format to be "context") is to pass in one path to a tab separated file with three columns: 'from', 'to', and 'mu_snp'. The 'from' column contains DNA sequence (where the length is an odd number) with the base to change at the central nucleotide. The 'to' column contains the sequence with the central base modified. The 'mu_snp' column contains the probability of the change (as per site per generation).

The second way to use the rates option is to pass in multiple paths to VCFs containing mutation rates for every genome position. This requires the --rates-format to be "genome". Currently the only supported rates files are ones from Roulette (https://www.biorxiv.org/content/10.1101/2022.08.20.504670v1), which can be found here: http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/. This needs both the VCFs and their index files.

The cache folder defaults to making a folder named "cache" within the working directory. The genome build indicates which genome build the coordinates of the de novo variants are based on, and defaults to GRCh37.

Example de novo table

gene_name chr pos consequence snp_or_indel
OR4F5 chr1 69500 missense_variant DENOVO-SNP
OR4F5 chr1 69450 missense_variant DENOVO-SNP

Python usage

from denovonear.gencode import Gencode
from denovonear.cluster_test import cluster_de_novos
from denovonear.mutation_rates import load_mutation_rates

gencode = Gencode('./data/example.grch38.gtf', './data/example.grch38.fa')
symbol = 'OR4F5'
de_novos = {'missense': [69500, 69450, 69400], 'nonsense': []}
rates = load_mutation_rates()
p_values = cluster_de_novos(de_novos, gencode[symbol], rates, iterations=1000000)

Pull out site-specific rates by creating Transcript objects, then get the rates by consequence at each site

from denovonear.rate_limiter import RateLimiter
from denovonear.load_mutation_rates import load_mutation_rates
from denovonear.load_gene import construct_gene_object
from denovonear.site_specific_rates import SiteRates

# extract transcript coordinates and sequence from Ensembl
async with RateLimiter(per_second=15) as ensembl:
    transcript = await construct_gene_object(ensembl, 'ENST00000346085')

mut_rates = load_mutation_rates()
rates = SiteRates(transcript, mut_rates)

# rates are stored by consequence, but you can iterate through to find all
# possible sites in and around the CDS:
for cq in ['missense', 'nonsense', 'splice_lof', 'synonymous']:
    for site in rates[cq]:
        site['pos'] = transcript.get_position_on_chrom(site['pos'], site['offset'])

# or if you just want the summed rate
rates['missense'].get_summed_rate()

Identify transcripts containing de novo events

You can identify transcripts containing de novos events with the identify_transcripts.py script. This either identifies all transcripts for a gene with one or more de novo events, or identifies the minimal set of transcripts to contain all de novos (where transcripts are prioritised on the basis of number of de novo events, and length of coding sequence). Transcripts can be identified with:

    denovonear transcripts \
        --de-novos data/example_de_novos.txt \
        --out output.txt \
        --all-transcripts

Other options are:

  • --minimise-transcripts in place of --all-transcripts, to find the minimal set of transcripts
  • --genome-build "grch37" or "grch38" (default=grch37)

Gene or transcript based mutation rates

You can generate mutation rates for either the union of alternative transcripts for a gene, or for a specific Ensembl transcript ID with the construct_mutation_rates.py script. Lof and missense mutation rates can be generated with:

denovonear rates \
    --genes data/example_gene_ids.txt \
    --out output.txt

The tab-separated output file will contain one row per gene/transcript, with each line containing a transcript ID or gene symbol, a log10 transformed missense mutation rate, a log10 transformed nonsense mutation rate, and a log10 transformed synonymous mutation rate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

denovonear-0.11.2.tar.gz (202.8 kB view details)

Uploaded Source

Built Distributions

denovonear-0.11.2-cp312-cp312-win_amd64.whl (151.8 kB view details)

Uploaded CPython 3.12 Windows x86-64

denovonear-0.11.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

denovonear-0.11.2-cp312-cp312-macosx_11_0_arm64.whl (168.2 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

denovonear-0.11.2-cp311-cp311-win_amd64.whl (151.2 kB view details)

Uploaded CPython 3.11 Windows x86-64

denovonear-0.11.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

denovonear-0.11.2-cp311-cp311-macosx_11_0_arm64.whl (167.8 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

denovonear-0.11.2-cp310-cp310-win_amd64.whl (150.5 kB view details)

Uploaded CPython 3.10 Windows x86-64

denovonear-0.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

denovonear-0.11.2-cp310-cp310-macosx_11_0_arm64.whl (167.6 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

denovonear-0.11.2-cp39-cp39-win_amd64.whl (151.2 kB view details)

Uploaded CPython 3.9 Windows x86-64

denovonear-0.11.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

denovonear-0.11.2-cp39-cp39-macosx_11_0_arm64.whl (168.7 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

File details

Details for the file denovonear-0.11.2.tar.gz.

File metadata

  • Download URL: denovonear-0.11.2.tar.gz
  • Upload date:
  • Size: 202.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for denovonear-0.11.2.tar.gz
Algorithm Hash digest
SHA256 b372bcfdefe96f83cb0def3571284b2729fa192396e1473b265966876bced7ea
MD5 9b055de3605b4d6ca6970f54cb2787f1
BLAKE2b-256 a106736646ba3565987d77872f037886520df79e596e2a4ab86cfc90b06f4bcf

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 1d22ebf7a3b69706826e8044b6d0f3c8d1b494c861517618eff364b35e2a668e
MD5 889c35924a1e6087e809bd65674fc1d5
BLAKE2b-256 25938d7321c0e141b041216f479980cd24231668038d4d46addd0d34d4562629

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0b2347221582a9bee0ba385d71a075e61d765af62d25c1b308b54cedfadba80a
MD5 df21607d54ce954116fdad277d166664
BLAKE2b-256 d74adbc6be4f2181e6f392cff3e0459ef5a3f9535d8fe8a94019410147ef94c9

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5ee34453ac9975bcd89a5533fb5bb091216ea6b67adb29a78aa1f68845f6e733
MD5 4594a0fa16b2af9d257390195706320f
BLAKE2b-256 7f18a6e271c46c1c19ef1cd1f5ec195e32d03daac42ab9dccb342be89ec7582c

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7f42812ed2940357cbeb2c51ed3b2b3bc709dda8dea14d7d64ec62e20658ea2f
MD5 fc340b20ef4b4759957753bac411100a
BLAKE2b-256 c1435b9cbd9e9c16e93a2624fb24479dd1b1aad3f857a7d60e9d0dd56c4c0262

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e38c0a738ef994bc928de125eb46e6c9200209837e99d7e83d3c4ce3f3c46e82
MD5 33839b5b6cb80ce67811d3d504590550
BLAKE2b-256 9fe26362bd722b590082993cea00730806d5998b70e89f64f7b66827e381b899

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3f4c0ae9af7fbf660e65675929dbe72e43960b45d666ca240864bf83f3fa37f0
MD5 de3266cab2838b1c8e8112957ef2d6eb
BLAKE2b-256 a9e26a12a9887fe1df717e503a48c83b345a64148cf4bf6f98af0e58047330e6

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 33f643a96f396ab6be8b73c6d01ccb827e7aacd7d6cc2d4b9156de9692ece912
MD5 77f7107591c3ad30118718e3f5e63d1d
BLAKE2b-256 3d3886c078b18637abadc22bf1f857ee0d9d3cf448c3b6b129f929ec94378154

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 86613fb676ca8bed5b30f04e56a0faa368111ee3f9c3bdeb61a35a63d57ced66
MD5 31edbccf376244ea681f64610ecdcfaf
BLAKE2b-256 edc7336011a4c44d2cbe5347e8e79c9e4a232797e1a09509380d695f173db125

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b82cf9b62d4c2d39f41f7f4b0ce16b572b939e361520430bed7e58f28ceba67b
MD5 d27d98d8fc3ae726a31fde7fc14fc463
BLAKE2b-256 6659a2db42a87b01b4a733d20c0f41a255eb9329dde2c7382fc55360e2511505

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d43e11206fd41765fe62c957f12949267980829ded99c2e92fbdf961476a81cd
MD5 d6a34041a6852074ccd915ad548aac49
BLAKE2b-256 f075212372f053de01aac50bc33d767072d02076fd8b594daa18c04a5e71690c

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 da3af3e112bcc0cb6253f1c612787801aae7275dc6a8ce018f11d8ed4790c0ac
MD5 aba9317380353f43821eddfd90f6e91e
BLAKE2b-256 8bba5444d1f26bbffa63d9c58b890f0a9da64a33647956637bc3e423d121c543

See more details on using hashes here.

File details

Details for the file denovonear-0.11.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d5427e67e5a5f38920b610be6950016f548e86dfeb36701cdadbc71306bc0b4f
MD5 9adeff183c9a611f30ed61356e892215
BLAKE2b-256 5f2a921dd66cd00ee805a5672d568beca8aa9af9aaab7c7e82cb21f7708ac5ca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page