Skip to main content

Assemble transcript sequences fragments near variants

Project description

DOI Build Status Coverage Status

isovar

Abundance quantification of distinct transcript sequences containing somatic variants from cancer RNAseq

Example

$ isovar-protein-sequences.py  \
    --vcf somatic-variants.vcf  \
    --bam rnaseq.bam \
    --genome hg19 \
    --min-reads 2 \
    --protein-sequence-length 30 \
    --output isovar-results.csv

  chr       pos ref alt                      amino_acids  \
0  22  46931060   A   C   FGVEAVDHGWPSMSSGSSWRASRGPPPPPR
1  22  46931062   G   A  CFGVEAVDHGWPPMSLAHGGPAVVHRLHPEA

   variant_aa_interval_start  variant_aa_interval_end ends_with_stop_codon  \
0                         16                       17                False
1                         16                       17                False

  frameshift  translations_count  supporting_variant_reads_count  \
0      False                   1                               1
1      False                   1                               1

   total_variant_reads  supporting_transcripts_count  total_transcripts  \
0                  130                             2                  2
1                  127                             2                  2

     gene
0  CELSR1
1  CELSR1

Algorithm/Design

The one line explanation of isovar: ProteinSequence = VariantSequence + ReferenceContext.

A little more detail about the algorithm: 1. Scan through an RNAseq BAM file and extract sequences overlapping a variant locus (represented by ReadAtLocus) 2. Make sure that the read contains the variant allele and split its sequence into prefix/alt/suffix string parts (represented by VariantRead) 3. Combine multiple VariantRead records into a VariantSequence 4. Gather possible reading frames for distinct reference sequences around the variant locus (represented by ReferenceContext). 5. Use the reading frame from a ReferenceContext to translate a VariantSequence into a protein fragment (represented by Translation). 6. Multiple distinct variant sequences and reference contexts can generate the same translations, so we aggregate those equivalent Translation objects into a ProteinSequence.

Since we may not want to deal with every possible translation of every distinct sequence detected around a variant, isovar sorts the variant sequences by the number of supporting reads and the reference contexts in order of protein length and a configurable number of translated protein fragments can be kept from this ordering.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isovar-0.4.0.tar.gz (55.4 kB view details)

Uploaded Source

File details

Details for the file isovar-0.4.0.tar.gz.

File metadata

  • Download URL: isovar-0.4.0.tar.gz
  • Upload date:
  • Size: 55.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for isovar-0.4.0.tar.gz
Algorithm Hash digest
SHA256 afc88c76b24d74cd7c524bf7f95fff271725580afbf0dd1be3580ecc952d0d87
MD5 f07f1a7aad400a1ea18802970b376a37
BLAKE2b-256 d7d7b4749eb20a87bd034eae20ee564eb8d69a109a3c74ddfd0044a4ad9e4d0a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page