Skip to main content

Package to examine de novo clustering

Project description

github

Denovonear

This code assesses whether de novo single-nucleotide variants are closer together within the coding sequence of a gene than expected by chance, or whether the amino acids for those SNVs are closer together in the protein structure than expected by chance. We use local-sequence based mutation rates to account for differential mutability of regions. The default rates are per-trinucleotide based see Nature Genetics 46:944–950, but you can use your own rates, or even longer sequence contexts, such as 5-mers or 7-mers.

Install

pip install denovonear

Usage

Analyse de novo mutation clustering in the coding sequence with the CLI tool:

denovonear cluster \
   --in data/example.grch38.dnms.txt \
   --gencode data/example.grch38.gtf \
   --fasta data/example.grch38.fa \
   --out output.txt

Or test clustering within protein structures:

denovonear cluster-structure \
   --in data/example.grch38.dnms.txt \
   --structures PATH_TO_STRUCTURES.tar \
   --gencode data/example.grch38.gtf \
   --fasta data/example.grch38.fa \
   --out output.structure.txt

explanation of options:

  • --in: path to tab-separated table of de novo mutations. See example table below for columns, or example.grch38.dnms.txt in data folder.
  • --gencode: path to GENCODE annotations in GTF format for transcripts and exons e.g. example release. Can be gzipped, or uncompressed.
  • --fasta: path to genome fasta, matching genome build of gencode file
  • --structures: path to tar file containing PDB structures for all protein coding genes. This has only been tested with AlphaFold human proteome tar files e.g. https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000005640_9606_HUMAN_v4.tar The code identifies the approriate structure pdb by first searching for uniprot IDs via ensembl (starting with the transcript ID). This only permits variants in the canonical transcript, and returns nan if a) no structure PDB is found, b) multiple structures exists for the uniprot IDs, c) multiple chains are in the structure file, d) the structure has missing residues or e) the number of residues in the structure file does not match what is expected from the CDS length. This uses the carbon atom coordinates for each residue to place the amino acid position, and computes Euclidean distances between these coordinates.

If the --gencode or --fasta options are skipped (e.g. denovonear cluster --in INFILE --out OUTFILE), gene annotations will be retrieved via an ensembl web service. For that, you might need to specify --genome-build grch38 to ensure the gene coordinates match your de novo mutation coordinates.

  • --rates PATHS_TO_RATES_FILES
  • --rates-format context OR genome
  • --cache-folder PATH_TO_CACHE_DIR
  • --genome-build "grch37" or "grch38" (default=grch37)

The rates option operates in two ways. The first (which requires --rates-format to be "context") is to pass in one path to a tab separated file with three columns: 'from', 'to', and 'mu_snp'. The 'from' column contains DNA sequence (where the length is an odd number) with the base to change at the central nucleotide. The 'to' column contains the sequence with the central base modified. The 'mu_snp' column contains the probability of the change (as per site per generation).

The second way to use the rates option is to pass in multiple paths to VCFs containing mutation rates for every genome position. This requires the --rates-format to be "genome". Currently the only supported rates files are ones from Roulette (https://www.biorxiv.org/content/10.1101/2022.08.20.504670v1), which can be found here: http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/. This needs both the VCFs and their index files.

The cache folder defaults to making a folder named "cache" within the working directory. The genome build indicates which genome build the coordinates of the de novo variants are based on, and defaults to GRCh37.

Example de novo table

gene_name chr pos consequence snp_or_indel
OR4F5 chr1 69500 missense_variant DENOVO-SNP
OR4F5 chr1 69450 missense_variant DENOVO-SNP

Python usage

from denovonear.gencode import Gencode
from denovonear.cluster_test import cluster_de_novos
from denovonear.mutation_rates import load_mutation_rates

gencode = Gencode('./data/example.grch38.gtf', './data/example.grch38.fa')
symbol = 'OR4F5'
de_novos = {'missense': [69500, 69450, 69400], 'nonsense': []}
rates = load_mutation_rates()
p_values = cluster_de_novos(de_novos, gencode[symbol], rates, iterations=1000000)

Pull out site-specific rates by creating Transcript objects, then get the rates by consequence at each site

from denovonear.rate_limiter import RateLimiter
from denovonear.load_mutation_rates import load_mutation_rates
from denovonear.load_gene import construct_gene_object
from denovonear.site_specific_rates import SiteRates

# extract transcript coordinates and sequence from Ensembl
async with RateLimiter(per_second=15) as ensembl:
    transcript = await construct_gene_object(ensembl, 'ENST00000346085')

mut_rates = load_mutation_rates()
rates = SiteRates(transcript, mut_rates)

# rates are stored by consequence, but you can iterate through to find all
# possible sites in and around the CDS:
for cq in ['missense', 'nonsense', 'splice_lof', 'synonymous']:
    for site in rates[cq]:
        site['pos'] = transcript.get_position_on_chrom(site['pos'], site['offset'])

# or if you just want the summed rate
rates['missense'].get_summed_rate()

Identify transcripts containing de novo events

You can identify transcripts containing de novos events with the identify_transcripts.py script. This either identifies all transcripts for a gene with one or more de novo events, or identifies the minimal set of transcripts to contain all de novos (where transcripts are prioritised on the basis of number of de novo events, and length of coding sequence). Transcripts can be identified with:

    denovonear transcripts \
        --de-novos data/example_de_novos.txt \
        --out output.txt \
        --all-transcripts

Other options are:

  • --minimise-transcripts in place of --all-transcripts, to find the minimal set of transcripts
  • --genome-build "grch37" or "grch38" (default=grch37)

Gene or transcript based mutation rates

You can generate mutation rates for either the union of alternative transcripts for a gene, or for a specific Ensembl transcript ID with the construct_mutation_rates.py script. Lof and missense mutation rates can be generated with:

denovonear rates \
    --genes data/example_gene_ids.txt \
    --out output.txt

The tab-separated output file will contain one row per gene/transcript, with each line containing a transcript ID or gene symbol, a log10 transformed missense mutation rate, a log10 transformed nonsense mutation rate, and a log10 transformed synonymous mutation rate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

denovonear-0.11.3.tar.gz (202.8 kB view details)

Uploaded Source

Built Distributions

denovonear-0.11.3-cp312-cp312-win_amd64.whl (151.8 kB view details)

Uploaded CPython 3.12 Windows x86-64

denovonear-0.11.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

denovonear-0.11.3-cp312-cp312-macosx_11_0_arm64.whl (168.2 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

denovonear-0.11.3-cp311-cp311-win_amd64.whl (151.2 kB view details)

Uploaded CPython 3.11 Windows x86-64

denovonear-0.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

denovonear-0.11.3-cp311-cp311-macosx_11_0_arm64.whl (167.8 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

denovonear-0.11.3-cp310-cp310-win_amd64.whl (150.5 kB view details)

Uploaded CPython 3.10 Windows x86-64

denovonear-0.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

denovonear-0.11.3-cp310-cp310-macosx_11_0_arm64.whl (167.6 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

denovonear-0.11.3-cp39-cp39-win_amd64.whl (151.2 kB view details)

Uploaded CPython 3.9 Windows x86-64

denovonear-0.11.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

denovonear-0.11.3-cp39-cp39-macosx_11_0_arm64.whl (168.7 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

File details

Details for the file denovonear-0.11.3.tar.gz.

File metadata

  • Download URL: denovonear-0.11.3.tar.gz
  • Upload date:
  • Size: 202.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for denovonear-0.11.3.tar.gz
Algorithm Hash digest
SHA256 b8883c19cbbc6fa8cbaafc23dfe602fb7b284121ab02da89ba6de3194f4be17d
MD5 88ed71ff8d96233b8067aaa31a1c0f0e
BLAKE2b-256 c5bbe74cf0c081f35f5db2ddf3b73ce11c717b4a989ca95935edb9dd39b40357

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 8b8bf74093e5d3fbf247308dd4301df7ef29692f06f8026d866ef1fb1ed4c5de
MD5 4df75552b5e9e9bd2f4736d88f408560
BLAKE2b-256 6401e307a7316165a3c71208d4afdba0b16dcfa35ca785572c406f6dbf6d9849

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 92fbc8f70dd149f006581a0f4c7742443289a0a820538e0d82ce9d3554c73ada
MD5 d2b2fa0266fa80cf9757acd310440c3d
BLAKE2b-256 b6d93c50ed51d41e306b0fb837f933cddf4773e7a453989bcc8875d13eb7aa4c

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 71e11c703acc776223de18df93b6855a9413f598a2ea6621b83cd936bd981631
MD5 a20ac495b9218d1cf0eb94d0524b5e37
BLAKE2b-256 698729da953202e336ce91cc22b89b5e3b37305b25eaa84d170adf57f8a791b9

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b7c8a3800b562119e74778fd75580114e6c18c85ddec41bdefe1c49a8746623c
MD5 5ac5df26f637ecd82406767ab155d847
BLAKE2b-256 59bea1ced6fa16395766a8e9c0f7da4c750b5d664678d8eb35fc003ae1e57b82

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d98ea5ae2472b26db036423c65527f65d8d502aea0d4c6019deb3e3c15b5e6e5
MD5 51124a528e7eeed95233704e40a9c9e5
BLAKE2b-256 08c5af7115a1859d9dafede7053b6cb77ef3a57912980245f83ea0acde05f61d

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7aa1a2b99e41e6d47ab63537675d34f8865bde1f405835d238e6eeebc86813df
MD5 2f4416ba221a0b3c1ca0cd7b7c574ee9
BLAKE2b-256 618258180f1ee39131448d72c91f80535dcd20cc7867a21c183c33b6538b6513

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 83cc2e32244ec881bde578ef3822f057ee655fc45c8cad26956166e3b6f88fd0
MD5 b8336109b9ecebe664581ee5443e65f7
BLAKE2b-256 90cbf1d298a4eb4d19620a2453683dad8642ec73f15c64f4e4ce97ccc2bb73d9

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 97ae869836973e2a5398c70e77347157d965daa20025ee6092501f89326db8ce
MD5 cb0782ee0c40fe1d511ccbe7072aa6d8
BLAKE2b-256 1c0a75ee1011372ab851c48d3c1aeaca685ec2c8477b68bf85a88b14a7b80eba

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 90dc60ed070ec413945e554ae33325adf21baa33824a079a6ebc13e95661c0f0
MD5 327a077e52bcd4cfa16d48e2d74bfff8
BLAKE2b-256 1b92a8f2a9f3a77a1a082b061ebee5bb25284f48fc80eeb79d3630413f40717a

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d4a49246216d7f518a1fa37b87feaa824c070709ac413a430e772cc92608f586
MD5 768feaa950bda00a94f25db3d57fead6
BLAKE2b-256 4af5df31439fd3ff44a18fe81e57b4277246a427232f0c8c8a6b172bcb542b75

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 caaa932ea8b33e8075137c2e7ccf18881c1158ffd89d1b5d17e794943ae5d380
MD5 53e059151292bbefd000e3fbfd340727
BLAKE2b-256 fbe6d6f15e53e97760a81240450e4ebf0dcb32f9b24af1d269d3bbc6c23ffb48

See more details on using hashes here.

File details

Details for the file denovonear-0.11.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for denovonear-0.11.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f4fb95bcf0cf61f385cd70f46fb0e7a33cf0b75d5ffbda7b7fbc6e8861cf63c5
MD5 b9be579dc6050ebd6c7a2998e38ba023
BLAKE2b-256 9e38b8b6038a6e9bbb735f7023cbe43433ce07ed9dccf9b8708bf017e7b798cd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page