Tool for ORF-calling and ORF-classification using ML approaches
Project description
TIdeS
Transcript Identification and Selection (TIdeS) is a method to identify putative open reading frames (pORFs) from a given transcriptome and is able to aid in the bulk decontamination of sequences from "messy" transcriptomic data.
Overall, TIdeS couples sequence composition with ML approaches to discern pORFs in the correct reading frame with substantial improvement over other popular tools, while providing support for additional non-standard genetic codes. Additionally, TIdeS can be used to classify ORFs into several user-defined categories from highly contaminated datasets (e.g., parasite + host, kleptoplasts, big "dirty" protists).
Dependencies
Installation
Note that TIdeS is only supported on UNIX systems (linux and MacOS).
Python's pip can be used to install the necessary python version and related packages.
pip install tides-ml
Followed by downloading the precompiled executables for the remaining dependencies.
Alternatively, you can do this through conda (note this will be updated):
# Create a new environment for TIdeS
conda create -n tides-ml
conda activate tides-ml
# Install the necessary packages (with minimum support versions)
conda install -c bioconda -c conda-forge diamond">=2.0.13" cd-hit">=4.8.1" barrnap">=0.9" kraken2">=2.0"
Clone the repository.
git clone https://github.com/xxmalcala/TIdeS.git
Running TIdeS
The general syntax to run TIdeS is:
tides --fin <transcriptome-assembly> --taxon <taxon-name> --db <protein-database>
Several example command lines and uses for TIdeS (i.e., ORF-calling and ORF classifying) are included in the examples folder. To run the examples, you need to be within the examples folder (e.g., ./orf_call_and_decontam.sh
)
List of all options
Command | Comment |
---|---|
-h , --help |
Print the help message |
-i , --fin <STRING> |
FASTA formatted file. |
-o , --taxon <STRING> |
Name for your taxon, project, outputs. |
-t , --threads <INTEGER> |
Number of available threads to use. Default value is 4 . |
-d , --db <STRING> |
Path to FASTA or DIAMOND formatted proteome database. |
-k , --kraken <STRING> |
Kraken2 database to identify and filter non-eukaryotic sequences. |
--no-filter |
Skip all transcript pre-processing. |
-p , --partials |
Include partial ORFs for ORF calling. |
-id , --id <INTEGER> |
Minimum % identity to remove redundant transcripts. Default value is 97 . |
-l , --min-orf <INTEGER> |
Minimum transcript length (bp) for ORF calling. Default value is 300 . |
-ml , --max-orf <INTEGER> |
Maximum transcript length (bp) for ORF calling. Default value is 10000 . |
-e , --evalue <REAL> |
Maximum e-value to infer reference ORFs. Default value is 1e-30 . |
-gc , --gencode <STRING/INTEGER> |
Genetic code to use to for ORF calling and translation. Default is 1 . |
-s , --strand <STRING> |
Strands to call ORFs (both/minus/plus). Default value is both . |
-c , --contam <STRING> |
Path to annotated sequence table. If unset, TIdeS will assume a prior model is provided as well. |
m , --model <STRING> |
Path to a prior TIdeS run's model. These are the ".pkl" files. |
-k , --kmer <INTEGER> |
kmer length to use. Default value is 3 . |
-ov , --overlap |
Permit overlapping kmers. |
--step <INTEGER> |
Step-size for overlapping kmers. Default value is kmer-length/2 . |
--clean |
Remove intermediate filter-step files. |
-gz , --gzip |
Compress TIdeS outputs when finished. |
ORF-Calling and Assessment
Reference protein database
Create a reference protein database for TIdeS (note you can use your own if you choose!). This will generate a database from six diverse eukaryotes, representing a broad yet compact database for subsequent ORF-calling.
Note that this database (tides_aa_db.dmnd
) will be prepared from whichever directory you call upon this script.
./TIdes/util/prep_tides_db.sh
Inputs
- FASTA formatted transcriptome assembly
- Taxon name (e.g., Homo sapiens, Op_me_Hsap)
- Protein database (can be prepared by "prep_tides_db.sh" in the util folder)
tides -i <transcriptome-assembly> -o <taxon-name> -d <protein-database>
Decontamination
Inputs
- FASTA formatted transcriptome assembly
- Taxon/project name (e.g., Durisnkia baltica, Dinotoms) Optional (need one)
- Table of annotated sequence names (see examples folder)
- Path to Kraken2 database to identify putative non-eukaryotic sequences to remove
With a table of annotated sequences:
tides -i <transcriptome-assembly> -o <taxon-name> -c <annotated-seqs-table>
Using a Kraken2 database:
tides -i <transcriptome-assembly> -o <taxon-name> -c -k <kraken2-database>
Table of annotated sequences
The <annotated-seqs-table>
should include sequence names and their category separated by tabs. Note that these sequences should be present within the input FASTA file as well. Please aim to include at least 25 sequences for each category, although more (up to ~200) is great!
seq1 human
seq2 lunch
seq3 lunch
seq4 human
seq5 lunch
...
Additional uses/approaches
More on how to run TIdeS and its uses can be found in the examples
folder, including:
- ORF-Calling
- Classification of ORFs
- ORF-calling and sequence classifying with a previously trained model
- Preparing a simple proteome database and ORF-calling
- Naive approaches to inferring contamination
- Example FASTA and
<annotated-seqs-table>
files
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.