A tool to assess protein coding gene annotation
Project description
PSAURON
PSAURON is a machine learning model for rapid assessment of protein coding gene annotation. Link to paper coming soon...
Installation
$ pip install psauron
PSAURON can run on GPU or CPU and depends on PyTorch, which can be annoying :disappointed:
It may help to install PSAURON in a virtual enviromment :slightly_smiling_face:
$ python3 -m venv /path/to/new/virtual/environment
$ source /path/to/new/virtual/environment/bin/activate
$ pip install psauron
Quickstart
PSAURON takes as input a single multi-fasta file and outputs a .csv with scores for all reading frames.
By default, PSAURON uses all six frames of the nucleotide coding sequences (CDS).
$ psauron -i path_to_your_CDS.fa -o path_to_output.csv
You may also provide a multi-fasta with protein (amino acid) sequence.
$ psauron -i path_to_your_protein.faa -o path_to_output.csv -p
...or request PSAURON score only the in-frame nucleotide sequence.
$ psauron -i path_to_your_CDS.fa -o path_to_output.csv -s
Note: internal stop codons are ignored by PSAURON. A high PSAURON score does not guarantee a sequence contains a valid ORF. This is intended behavior, as alternate frame scores are used by default to boost the power of the model.
Usage
psauron [-h] -i INPUT_FASTA [-o OUTPUT_PATH] [-m MINIMUM_LENGTH] [-e EXCLUDE] [--inframe INFRAME] [--outframe OUTFRAME] [-c] [-s] [-p] [-v]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FASTA, --input-fasta INPUT_FASTA
REQUIRED path to FASTA with spliced CDS sequence or protein sequence. A spliced CDS fasta can be created from a GTF/GFF and a reference FASTA by using gffread.
-o OUTPUT_PATH, --output-path OUTPUT_PATH
OPTIONAL path to output results file, default=./psauron_score.csv
-m MINIMUM_LENGTH, --minimum-length MINIMUM_LENGTH
OPTIONAL exclude all proteins shorter than m amino acids, default=5
-e EXCLUDE, --exclude EXCLUDE
OPTIONAL exclude any CDS where FASTA description contains given text (case invariant), e.g. "hypothetical", default=None
--inframe INFRAME OPTIONAL probability threshold used to determine final psauron score, in-frame, higher number decreases sensitivity and increases specificity, default=0.5, range=[0,1]
--outframe OUTFRAME OPTIONAL probability threshold used to determine final psauron score, out-of-frame, higher number increases sensitivity and decreases specificity, default=0.5, range=[0,1]
-c, --use-cpu OPTIONAL set -c to force usage of CPU instead of GPU, default=False
-s, --single-frame OPTIONAL set -s to score only the in-frame CDS, which may lower accuracy of the model, default=False
-p, --protein OPTIONAL set -p if your FASTA contains amino acid protein sequence, which may lower accuracy of the model, default=False
-v, --verbose OPTIONAL set -v for verbose output with progress bars etc., default=False
-i INPUT_FASTA, REQUIRED path to FASTA with spliced CDS sequence. This fasta can be created from a GTF/GFF and a reference FASTA by using gffread.
Example gffread commands to get CDS FASTA:
gffread -x CDS_FASTA.fa -g genome.fa input.gff
gffread -x CDS_FASTA.fa -g genome.fa input.gtf
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file psauron-1.0.4.tar.gz
.
File metadata
- Download URL: psauron-1.0.4.tar.gz
- Upload date:
- Size: 749.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d913bc5f86d7a3a684d529067ae0e4300c51a38df3c389ff19663cc9fdd29dfa |
|
MD5 | e9d0511f460f1fd7651346415ea4eb21 |
|
BLAKE2b-256 | 3b79cdcf9109a03975bdc8d7f0650439c09a29c384b528a42d2f7b8a4d3104ce |
File details
Details for the file psauron-1.0.4-py3-none-any.whl
.
File metadata
- Download URL: psauron-1.0.4-py3-none-any.whl
- Upload date:
- Size: 748.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 36037e71b42ff0fa8199148e312c8d1efbd384a87351b0fc5e205f1236ee1a65 |
|
MD5 | 5d416fc9c5dfcfeed11657c14a93c567 |
|
BLAKE2b-256 | 25fcebf72f1a1a7cf22bcfece6ba694c0933a52e46c9b143763fe8f46bb3ca76 |