Skip to main content

Predict ancestral sequence of fungal repeat elements by correcting for RIP-like mutations in multi-sequence DNA alignments.

Project description

License: MIT

deRIP2

Predict progenitor sequence of fungal repeat families by correcting for RIP-like mutations (CpA --> TpA) and cytosine deamination (C --> T) events.

Mask RIP or deamination events from input alignment as ambiguous bases.

Table of contents

Algorithm overview

For each column in input alignment:

  • Check if number of gapped rows is greater than max gap proportion. If true, then a gap is added to the output sequence.
  • Set invariant column values in output sequence.
  • If at least X proportion of bases are C/T or G/A (i.e. maxSNPnoise = 0.4, then at least 0.6 of positions in column must be C/T or G/A).
  • If reaminate option is set then revert T-->C or A-->G.
  • If reaminate is not set then check for number of positions in RIP dinucleotide context (C/TpA or TpG/A).
  • If proportion of positions in column in RIP-like context => minRIPlike threshold, AND at least one substrate and one product motif (i.e. CpA and TpA) is present, perform RIP correction in output sequence.
  • For all remaining positions in output sequence (not filled by gap, reaminate, or RIP-correction) inherit sequence from input sequence with the fewest observed RIP events (or greatest GC content if not RIP detected or multiple sequences sharing min-RIP count).

Outputs:

  • Corrected sequence as fasta.
  • Optional, alignment with:
    • Corrected sequence appended.
    • With corrected positions masked as ambiguous bases.

Options and Usage

Installation

Requires Python => v3.6

Clone from this repository:

% git clone https://github.com/Adamtaranto/deRIP2.git && cd deRIP2 && pip install -e .

Install from PyPi.

% pip install derip2

Test installation.

# Print version number and exit.
% derip2 --version
derip2 0.0.2

# Get usage information
% derip2 --help

Example usage

For aligned sequences in 'myalignment.fa':

  • Any column >= 70% gap positions is not corrected.
  • Bases in column must be >= 80% C/T or G/A
  • At least 50% bases must be in RIP dinucleotide context (C/T as CpA / TpA)
  • Inherit all remaining uncorrected positions from least RIP'd sequence.
  • Mask all substrate and product motifs from corrected columns as ambiguous bases (i.e. CpA to TpA --> YpA)
derip2 --inAln myalignment.fa --format fasta \
--maxGaps 0.7 \
--maxSNPnoise 0.2 \
--minRIPlike 0.5 \
--outDir results \
--outName deRIPed_sequence.fa \
--outAlnName aligment_with_deRIP.fa \
--label deRIPseqName \
--mask 

Output:

  • results/deRIPed_sequence.fa
  • results/masked_aligment_with_deRIP.fa

Standard options

Usage: derip2 [-h] [--version] -i INALN
              [--format {clustal,emboss,fasta,fasta-m10,ig,nexus,phylip,phylip-sequential,phylip-relaxed,stockholm}]
              [-g MAXGAPS] [-a] [--maxSNPnoise MAXSNPNOISE]
              [--minRIPlike MINRIPLIKE] [--fillmaxgc] [--fillindex FILLINDEX]
              [--mask] [--noappend] [-d OUTDIR] [-o OUTNAME]
              [--outAlnName OUTALNNAME]
              [--outAlnFormat {clustal,emboss,fasta,fasta-m10,ig,nexus,phylip,phylip-sequential,phylip-relaxed,stockholm}]
              [--label LABEL]

Predict ancestral sequence of fungal repeat elements by correcting for RIP-
like mutations or cytosine deamination in multi-sequence DNA alignments. 
Optionally, mask corrected positions in alignment.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -i INALN, --inAln INALN
                        Multiple sequence alignment.
  --format {clustal,emboss,fasta,fasta-m10,ig,nexus,phylip,phylip-sequential,phylip-relaxed,stockholm}
                        Format of input alignment.
  -g MAXGAPS, --maxGaps MAXGAPS
                        Maximum proportion of gapped positions in column to be
                        tolerated before forcing a gap in final deRIP
                        sequence.
  -a, --reaminate       Correct deamination events in non-RIP contexts.
  --maxSNPnoise MAXSNPNOISE
                        Maximum proportion of conflicting SNPs permitted
                        before excluding column from RIP/deamination
                        assessment. i.e. By default a column with >= 0.5 'C/T'
                        bases will have 'TpA' positions logged as RIP events.
  --minRIPlike MINRIPLIKE
                        Minimum proportion of deamination events in RIP
                        context (5' CpA 3' --> 5' TpA 3') required for column
                        to deRIP'd in final sequence. Note: If 'reaminate'
                        option is set all deamination events will be corrected
  --fillmaxgc           By default uncorrected positions in the output
                        sequence are filled from the sequence with the lowest
                        RIP count. If this option is set remaining positions
                        are filled from the sequence with the highest G/C
                        content.
  --fillindex FILLINDEX
                        Force selection of alignment row to fill uncorrected
                        positions from by row index number (indexed from 0).
                        Note: Will override '--fillmaxgc' option.
  --mask                Mask corrected positions in alignment with degenerate
                        IUPAC codes.
  --noappend            If set, do not append deRIP'd sequence to output
                        alignment.
  -d OUTDIR, --outDir OUTDIR
                        Directory for deRIP'd sequence files to be written to.
  -o OUTNAME, --outName OUTNAME
                        Write deRIP sequence to this file.
  --outAlnName OUTALNNAME
                        Optional: If set write alignment including deRIP
                        sequence to this file.
  --outAlnFormat {clustal,emboss,fasta,fasta-m10,ig,nexus,phylip,phylip-sequential,phylip-relaxed,stockholm}
                        Optional: Write alignment including deRIP sequence to
                        file of format X.
  --label LABEL         Use label as name for deRIP'd sequence in output
                        files.

Issues

Submit feedback to the Issue Tracker

License

Software provided under MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for derip2, version 0.0.2
Filename, size File type Python version Upload date Hashes
Filename, size derip2-0.0.2.tar.gz (13.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page