PaiRwisE Sequence IDENtiTy. Calculate pairwise nucleotide identitywith respect to a reference sequence.
Project description
PRESIDENT: PaiRwisE Sequence IDENtiTy
Calculate pairwise nucleotide identity with respect to a reference sequence.
Given a reference and a query sequence, calculate pairwise nucleotide identity with respect to the reference sequence relative to the entire length of the reference. In the main metric, only informative nucleotides (A, T, G, C) are considered identical to each other. The tool also provides some further metrics (e.g. regarding ambiguous 'N's) and splits the input FASTA into valid and failed FASTA files for further processing.
Installation:
To install president with conda, run the commands below:
conda create -y -n president -c bioconda -c conda-forge president
conda activate president
Note that pblat
is a dependency and only runs on Linux. Alternatively, president can be installed with pip in an environment where pblat is in the PATH:
pip install pypresident
Another possibility is to use a docker image with all dependencies installed:
docker pull rkibioinf/president:latest
Usage:
The pairwise alignment can be run with the following console call:
president --query query.fasta --reference reference.fasta -x identity_threshold -n n_threshold -t threads -p /path/to/output/ -f prefix
To run an example, download a query FASTA and a reference FASTA from GitLab.
Run the alignment with the following command and identity of ACGT bases of 90% and maximal 5% Ns in the query. Note that multiple fasta sequences are allowed to be present in the query but not in the reference FASTA.
president -q NC_045512.2.20mis.fasta -r NC_045512.2.fasta -x 0.9 -n 0.05 -t 4 -p output/ -f test_
Output:
The script provides:
- a tab-separated file with the below listed columns
- a FASTA file with valid sequences
- a FASTA file with invalid sequences
The separation between the valid and invalid bin is mainly based on the defined identity/n thresholds (-x
, default: 0.9; -n
, default: 0.05) and further sanity checks (non-IUPAC characters).
Note that for multiple query inputs the valid and invalid sequence headers are NOT associate with their filename. This meta information must be retrieved from the report.
An example from the call described above can be found online.
Attention: If your query sequence includes gap symbols (-
or .
; for example from a previous alignment step), president
will remove them from the output sequence! Rational: Gap symbols should be not part of a consensus sequence (for which the tool was developed) and will therefore be removed. Until version 0.6.7 gap symbols were replaced with N
!
The transposed version of the output is shown below.
variable | value |
---|---|
query_name | NC_045512.2 Severe acute [...] |
reference_name | NC_045512.2 Severe acute [...] |
file_in_query | NC_045512.2.20mis.fasta |
file_in_ref | NC_045512.2.fasta |
ACGT Nucleotide identity | 0.9987 |
ACGT Nucleotide identity (ignoring Ns) | 0.9994 |
ACGT Nucleotide identity (ignoring non-ACGTNs) | 0.9994 |
qc_all_valid | True |
qc_is_empty_query | False |
qc_post_align_pass_threshold | True |
qc_post_aligned | True |
qc_post_aligned_all_valid | True |
qc_valid_length | True |
qc_valid_nucleotides | True |
qc_valid_pass_nthreshold | True |
acgt_bases | 29883 |
iupac_bases | 20 |
non_iupac_bases | 0 |
N_bases | 20 |
length_query | 29903 |
length_reference | 29903 |
LongestNGap | 20 |
Matches | 29864 |
Mismatches | 19 |
query_description | |
query_index | 0 |
Date | 2021-01-20 |
Definitions:
- query_name - FASTA ID of the query sequence + file origin, format :
- reference_name - FASTA ID of the reference sequence
- file_in_query - the input query file name
- file_in_ref - the input reference file name (note that for multiple input queries a tempororay file name is reported)
- ACGT Nucleotide identity - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths))
- ACGT Nucleotide identity (ignoring Ns) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #Ns in query)
- ACGT Nucleotide identity (ignoring non-ACGTNs) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #non-ACGTNs in query)
- qc_all_valid - True if all checks below are True
- qc_is_empty_query - True if input query file is not empty
- qc_post_align_pass_threshold - True if 'ACGT Nucleotide identity' is greater than the
-x
parameter - qc_post_aligned - Set to True if the sequence was aligned (can be False either because of initial checks or failed alignments)
- qc_post_aligned_all_valid - True if all checks before alignment are True, alignment is only performed if this value is True
- qc_valid_length - True if the query is actually able to achieve the
-x
score even if the query is shorter/ longer - qc_valid_nucleotides - True if only valid IUPAC characters are in the query
- qc_valid_pass_nthreshold - True if
-n
percentage of Ns in the query is not exceeded - acgt_bases - number of ACGT bases
- iupac_bases - number of IUPAC bases
- non_iupac_bases - Number of non-IUPAC nucleotides in the query
- N_bases - the number of N bases
- length_query - the length of the query
- length_reference - the length of the reference
- LongestNGap - Lenght of the longest N gap
- Matches - the number of matches in the alignment
- Mismatches - the number of mismatches in the alignment
- query_description - query description, if available
- query_index - the position of the sequence in the query fasta input file (for multiple files the counter is not resetted)
- Date - the yyyy-mm-dd of the execution of the script
Note: max(sequence_lengths) is equal to max(length_query, length_reference).
Notes:
- nextstrain uses a quality threshold of < 3000 non-canonical nucleotides
ANI definition:
Contributing
If you would like to get involved, here is information on contribution guidelines and how to test the code locally.
You can contribute in multiple ways, e.g., reporting bugs, reviewing or refactoring code, requesting or implementing new features, etc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pypresident-0.6.8.tar.gz
.
File metadata
- Download URL: pypresident-0.6.8.tar.gz
- Upload date:
- Size: 35.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d4672eef2532559747b3e6f61c611514a55fe947b535da92ca3af492f44569b |
|
MD5 | 912ce48f79e6036823166a6eb515c52a |
|
BLAKE2b-256 | 6684db12332cb95d53820b292c48b0b06b02b91aa09527674a1fa0be57b3ea1f |
File details
Details for the file pypresident-0.6.8-py3-none-any.whl
.
File metadata
- Download URL: pypresident-0.6.8-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e826e23a505c04b2db815f7ee85f068482dcc382296d9b93afdc422e12ad2cf |
|
MD5 | 950288de7467a32cbc9e1521af7e7e3c |
|
BLAKE2b-256 | 51b414904a23b833a7443302247f268a748ff57f27f2ebaf485af66509e5aa8d |