Determine mutant protein sequences from RN using assembly around variants
Project description
Isovar
- Overview
- Python API
- Commandline
- Internal Design
- Other Isovar Commandline Tools
- Sequencing Recommendations
Overview
Isovar determines mutant protein subsequences around mutations from cancer RNAseq data.
Isovar works by:
-
collecting RNA reads which spanning the location of a variant,
-
filtering the RNA reads to those which support the mutation,
-
assembling mutant reads into longer coding sequences,
-
matching mutant coding sequences against reference annotated reading frames, and
-
translating coding sequences determined directly from RNA into mutant protein sequences.
The assembled coding sequences may incorporate proximal (germline and somatic) variants, along with any splicing alterations which occur due to modified splice signals.
Python API
In the example below, isovar.run_isovar
returns a list of isovar.IsovarResult
objects.
Each of these objects corresponds to a single input variant and contains all of the information about the RNA evidence at that variant's location and any mutant protein sequences which were assembled for the variant.
from isovar import run_isovar
isovar_results = run_isovar(
variants="cancer-mutations.vcf",
alignment_file="tumor-rna.bam")
# this code traverses every variant and prints the number
# of RNA reads which support the alt allele for variants
# which had a successfully assembled/translated protein sequence
for isovar_result in isovar_results:
# if any protein sequences were assembled from RNA
# then the one with most supporting reads can be
# accessed from a property called `top_protein_sequence`.
if isovar_result.top_protein_sequence is not None:
# print number of distinct fragments supporting the
# the variant allele for this mutation
print(isovar_result.variant, isovar_result.num_alt_fragments)
A collection of IsovarResult
objects can also be flattened into a Pandas DataFrame:
from isovar import run_isovar, isovar_results_to_dataframe
df = isovar_results_to_dataframe(
run_isovar(
variants="cancer-mutations.vcf",
alignment_file="tumor-rna.bam"))
Python API options for collecting RNA reads
To change how Isovar collects and filters RNA reads you can create
your own instance of the isovar.ReadCollector
class and pass it to run_isovar
.
from isovar import run_isovar, ReadCollector
# create a custom ReadCollector to change options for how RNA reads are processed
read_collector = ReadCollector(
use_duplicate_reads=True,
use_secondary_alignments=True,
use_soft_clipped_bases=True)
isovar_results = run_isovar(
variants="cancer-mutations.vcf",
alignment_file="tumor-rna.bam",
read_collector=read_collector)
Python API options for coding sequence assembly and translation
To change how Isovar assembles RNA reads into coding sequences, determines their
reading frames, and groups translated amino acid sequences you can create your
own instance of the isovar.ProteinSequenceCreator
class and pass it to run_isovar
.
from isovar import run_isovar, ProteinSequenceCreator
# create a custom ProteinSequenceCreator to change options for how
# protein sequences are assembled from RNA reads
protein_sequence_creator = ProteinSequenceCreator(
# number of amino acids we're aiming for, coding sequences
# might still give us a shorter sequence due to an early stop
# codon or poor coverage
protein_sequence_length=30,
# minimum number of reads covering each base of the coding sequence
min_variant_sequence_coverage=2,
# how much of a reference transcript should a coding sequence match before
# we use it to establish a reading frame
min_transcript_prefix_length=20,
# how many mismatches allowed between coding sequence (before the variant)
# and transcript (before the variant location)
max_transcript_mismatches=2,
# also count mismatches after the variant location toward
# max_transcript_mismatches
count_mismatches_after_variant=False,
# if more than one protein sequence can be assembled for a variant
# then drop any beyond this number
max_protein_sequences_per_variant=1,
# if set to False then coding sequence will be derived from
# a single RNA read with the variant closest to its center
variant_sequence_assembly=True,
# how many nucleotides must two reads overlap before they are combined
# into a single coding sequence
min_assembly_overlap_size=30)
isovar_results = run_isovar(
variants="cancer-mutations.vcf",
alignment_file="tumor-rna.bam",
protein_sequence_creator=protein_sequence_creator)
Python API for filtering results
You can filter a collection of IsovarResult
objects by any of their numerical properties using the filter_thresholds
option
of the run_isovar
function. The value expected for this argument is a dictionary whose keys have named like 'min_fraction_ref_reads'
or 'max_num_alt_fragments'
and whose values are numerical thresholds.
Everything after the 'min_'
or 'max_'
at the start of a key is expected to be the name of a property of IsovarResult
.
Many of the commonly accessed properties regarding RNA read evidence follow the pattern:
{num|fraction}_{ref|alt|other}_{reads|fragments}
For example, in the following code the results are filtered to have 10 or more alt reads supporting a variant and no more than 25% of the fragments supporting an allele other than the ref or alt.
from isovar import run_isovar
isovar_results = run_isovar(
variants="cancer-mutations.vcf",
alignment_file="tumor-rna.bam",
filter_thresholds={"min_num_alt_reads": 10, "max_fraction_other_fragments": 0.25})
for isovar_result in isovar_results:
# print each variant and whether it passed both filters
print(isovar_result.variant, isovar_result.passes_all_filters)
A variant which fails one or more filters is not excluded from the result collection but it has False
values in its corresponding
filter_values
dictionary property and will have a False
value for the passes_all_filters
property.
If a result collection is flattened into a DataFrame then each filter is included as a column.
It's also possible to filter on boolean properties (without numerical thresholds) by passing filter_flags
to run_isovar
. These boolean
properties can be further negated by prepending 'not_' to the property name, so that both 'protein_sequence_matches_predicted_effect'
and 'not_protein_sequence_matches_predicted_effect'
are valid names for filter_flags
.
Commandline
Basic example:
$ isovar \
--vcf somatic-variants.vcf \
--bam rnaseq.bam \
--protein-sequence-length 30 \
--output isovar-results.csv
Commandline options for loading variants
--vcf VCF Genomic variants in VCF format
--maf MAF Genomic variants in TCGA's MAF format
--variant CHR POS REF ALT
Individual variant as 4 arguments giving chromsome,
position, ref, and alt. Example: chr1 3848 C G. Use
'.' to indicate empty alleles for insertions or
deletions.
--genome GENOME What reference assembly your variant coordinates are
using. Examples: 'hg19', 'GRCh38', or 'mm9'. This
argument is ignored for MAF files, since each row
includes the reference. For VCF files, this is used if
specified, and otherwise is guessed from the header.
For variants specfied on the commandline with
--variant, this option is required.
--download-reference-genome-data
Automatically download genome reference data required
for annotation using PyEnsembl. Otherwise you must
first run 'pyensembl install' for the release/species
corresponding to the genome used in your VCF.
--json-variants JSON_VARIANTS
Path to Varcode.VariantCollection object serialized as
a JSON file.
Commandline options for loading aligned tumor RNA-seq reads
--bam BAM BAM file containing RNAseq reads
--min-mapping-quality MIN_MAPPING_QUALITY
Minimum MAPQ value to allow for a read (default 1)
--use-duplicate-reads
By default, reads which have been marked as duplicates
are excluded.Use this option to include duplicate
reads.
--drop-secondary-alignments
By default, secondary alignments are included in
reads, use this option to instead only use primary
alignments.
Commandline options for coding sequence assembly
--min-variant-sequence-coverage MIN_VARIANT_SEQUENCE_COVERAGE
Minimum number of reads supporting a variant sequence
(default 2)
--disable-variant-sequence-assembly
Disable assemble variant cDNA sequence from
overlapping reads
Commandline options for translating cDNA to protein sequence
--protein-sequence-length PROTEIN_SEQUENCE_LENGTH
--max-reference-transcript-mismatches MAX_REFERENCE_TRANSCRIPT_MISMATCHES
Maximum number of mismatches between variant sequence
reference sequence before a candidate reading frame is
ignored.
--count-mismatches-after-variant
If true, mismatches after the variant locus will count
toward the --max-reference-transcript-mismatches
filter.
--min-transcript-prefix-length MIN_TRANSCRIPT_PREFIX_LENGTH
Number of nucleotides before the variant we try to
match against a reference transcript. Values greater
than zero exclude variants near the start codon of
transcripts without 5' UTRs.
--max-protein-sequences-per-variant MAX_PROTEIN_SEQUENCES_PER_VARIANT
Commandline options for filtering
--min-alt-rna-reads MIN_ALT_RNA_READS
Minimum number of reads supporting variant allele
(default 3)
--min-alt-rna-fragments MIN_ALT_RNA_FRAGMENTS
Minimum number of fragments supporting variant allele
(default 2). Note that this option is the same as
--min-alt-rna-reads for single-end sequencing.
--min-alt-rna-fraction MIN_ALT_RNA_FRACTION
Minimum ratio of fragments supporting variant allele
to total RNA fragments (default 0.005).
--min-ratio-alt-to-other-fragments MIN_RATIO_ALT_TO_OTHER_FRAGMENTS
At loci where alleles other than the ref and a single
alt are supported, this parameter controls how many
more times fragments supporting the variant allele are
required relative to other non-reference alleles
(default 3.0).
Commandline options for writing an output CSV
--output OUTPUT Output CSV file
--output-columns OUTPUT_COLUMNS [OUTPUT_COLUMNS ...]
Subset of columns to write
Internal Design
The inputs to Isovar are one or more somatic variant call (VCF) files, along with a BAM file containing aligned tumor RNA reads. The following objects are used to aggregate information within Isovar:
-
LocusRead: Isovar examines each variant locus and extracts reads overlapping that locus, represented by
LocusRead
. TheLocusRead
representation allows filtering based on quality and alignment criteria (e.g. MAPQ > 0) which are thrown away in later stages of Isovar. -
AlleleRead: Once
LocusRead
objects have been filtered, they are converted into a simplified representation calledAlleleRead
. EachAlleleRead
contains only the cDNA sequences before, at, and after the variant locus. -
ReadEvidence: The set of
AlleleRead
objects overlapping a mutation's location may support many different distinct allele. TheReadEvidence
type represents the grouping of these reads into ref, alt and otherAlleleRead
sets, where ref reads agree with the reference sequence, alt reads agree with the given mutation, and other reads contain all non-ref/non-alt alleles. The alt reads will be used later to determine a mutant coding sequence, but the ref and other groups are also kept in case they are useful for filtering. -
VariantSequence: Overlapping
AlleleRead
s containing the same mutation are assembled into a longer sequence. TheVariantSequence
object represents this candidate coding sequence, as well as all theAlleleRead
objects which were used to create it. -
ReferenceContext: To determine the reading frame in which to translate a
VariantSequence
, Isovar looks at all Ensembl annotated transcripts overlapping the locus and collapses them into one or moreReferenceContext
object. EachReferenceContext
represents the cDNA sequence upstream of the variant locus and in which of the {0, +1, +2} reading frames it is translated. -
Translation: Use the reading frame from a
ReferenceContext
to translate aVariantSequence
into a protein fragment, represented byTranslation
. -
ProteinSequence: Multiple distinct variant sequences and reference contexts can generate the same translations, so we aggregate those equivalent
Translation
objects into aProteinSequence
. -
IsovarResult: Since a single variant locus might have reads which assemble into multiple incompatible coding sequences, an
IsovarResult
represents a variant and one or moreProteinSequence
objects which are associated with it. We typically don't want to deal with every possible translation of every distinct sequence detected around a variant, so the protein sequences are sorted by their number of supporting fragments and the best protein sequence is made easy to access. TheIsovarResult
object also has many informative properties suchnum_alt_fragments
,fraction_ref_reads
, &c.
Other Isovar Commandline Tools
- isovar-protein-sequences --vcf variants.vcf --bam rna.bam
- All protein sequences which can be assembled from RNA reads for any of the given variants.
- isovar-allele-counts --vcf variants.vcf --bam rna.bam
- Counts of reads and fragments supporting the ref, alt, and other alleles at all given variant locations.
- isovar-allele-reads --vcf variants.vcf --bam rna.bam
- Sequences of all reads overlapping any of the given variants.
- isovar-translations --vcf variants.vcf --bam rna.bam
- All possible translations of any assembled cDNA sequence containing any of the given variants in the reference frame of any matching transcript.
- isovar-reference-contexts --vcf variants.vcf
- Shows all candidate reference contexts (sequence and reading frame) before each variant, derived from overlapping reference coding transcripts.
- isovar-variant-reads --vcf variants.vcf --bam rna.bam
- Like the isovar-allele-reads command but limited only to reads which support the alt allele.
- isovar-variant-sequences --vcf variants.vcf --bam rna.bam
- Shows all assembled cDNA coding sequences supporting any of the given variants.
Sequencing Recommendations
Isovar works best with high quality / high coverage mRNA sequence data. This means that you will get best results from >100M paired-end reads sequenced on an Illumina HiSeq from a library enriched with poly-A capture. The number of reads varies depending on degree of RNA degradation and tumor purity. The read length will determine the longest protein sequence you can recover, since Isovar's cDNA assembly only considers reads that overlap a variant. With 100bp reads you will be able to assemble at most 199bp of sequence around a somatic single nucleotide variant, and consequently only be to determine 66 amino acids from the protein sequence. If you disable the cDNA assembly algorithm then a 100bp read will only be able to determine 33 amino acids.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file isovar-1.0.8.tar.gz
.
File metadata
- Download URL: isovar-1.0.8.tar.gz
- Upload date:
- Size: 90.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/2.7.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | feeb1cc83e912b10f5fc335ed299caf75630c11ed74541926d19c66429fbed03 |
|
MD5 | 416b7ec3d81cd8369fea23eae145b5f4 |
|
BLAKE2b-256 | 80db7d4cdb69c95e6a6abbfabdff6145ad18a80125d4fb72ae90510c9547712c |