PaiRwisE Sequence IDENtiTy. Calculate pairwise nucleotide identitywith respect to a reference sequence.
Project description
PRESIDENT: PaiRwisE Sequence IDENtiTy
Calculate pairwise nucleotide identity with respect to a reference sequence.
Given a reference and a query sequence, calculate pairwise nucleotide identity with respect to the reference sequence relative to the entire length of the reference. In the main metric, only informative nucleotides (A, T, G, C) are considered identical to each other. The tool also provides some further metrics (e.g. regarding ambiguous 'N's) and splits the input FASTA into valid and failed FASTA files for further processing.
Installation:
To install president with conda, run the commands below:
conda create -y -n president -c bioconda -c conda-forge president
conda activate president
Note that pblat
is a dependency and only runs on Linux. Alternatively, president can be installed with pip in an environment where pblat is in the PATH:
pip install president
Usage:
pypresident installs the package and the pairwise alignment can be run with the following console call:
president --query query.fasta --reference reference.fasta -x identity_threshold -t threads -p /path/to/output/prefix
To run an example, download a query FASTA and a reference FASTA from GitLab.
Run the alignment with the following command and identity of ACGT bases of 93%. Note that multiple fasta sequences are allowed to be present in the query but not in the reference FASTA.
president -q NC_045512.2.20mis.fasta -r NC_045512.2.fasta -x 0.9 -t 4 -p output/test
Output:
The script provides:
- a tab-separated file with the below listed columns
- a FASTA file with valid sequences
- a FASTA file with invalid sequences
The separation between the valid and invalid bin is mainly based on the defined identity threshold (-x
, default: 0.9) and further sanity checks (non-IUPAC characters, amount of 'N's and query length that cause sequence identity to drop below -x
).
An example from the call described above can be found online.
The transposed version of the output is shown below.
variable | value |
---|---|
ID | NC_045512.2 |
Valid | True |
ACGT Nucleotide identity | 0.9987 |
ACGT Nucleotide identity (ignoring Ns) | 0.9994 |
ACGT Nucleotide identity (ignoring non-ACGTNs) | 1.0 |
Ambiguous Bases | 20.0 |
Query Length | 29903 |
Query #ACGT | 29883 |
Query #IUPAC-ACGT | 20.0 |
Query #non-IUPAC | 0.0 |
aligned | True |
passed_initial_qc | True |
Date | 2021-01-20 |
reference_length | 29903 |
reference | NC_045512.2.fasta |
query | NC_045512.2.20mis.fasta |
Definitions:
- ID - the fasta sequence name in the query
- Valid - True if 'ACGT Nucleotide identity' is greater than the -x parameter
- ACGT Nucleotide identity - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths))
- ACGT Nucleotide identity (ignoring Ns) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #Ns in query)
- ACGT Nucleotide identity (ignoring non-ACGTNs) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #non-ACGTNs in query)
- Ambiguous Bases - Number of N nucleotides in the query.
- Query Length - Sequence length of the query.
- Query #ACGT - Number of ACGT in the query.
- Query #IUPAC-ACGT - Number of IUPAC characters that are not ACGT in the query.
- Query #non-IUPAC - Number of non-IUPAC characters in the query.
- aligned - True if the sequence got aligned with
pblat
- passed_initial_qc - True if the sequence is long enough / has enough ACGT nucleotides (instead of Ns) to reache the identiy threshold
- Date - the yyyy-mm-dd of the execution of the script
- reference_length - Sequence length of the reference
- reference - the reference file name
- query - the query file name
Note: max(sequence_lengths) is equal to max(length_query, length_reference).
Notes:
- nextstrain uses a quality threshold of < 3000 non-canonical nucleotides
ANI definition:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pypresident-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32aea81876e5de4736835cb46110fbae299e6f3bce7bd9833a2ff25296b4db81 |
|
MD5 | 9f63170eaf36843a8ebc9314a51a7856 |
|
BLAKE2b-256 | b49e84d8c2887e253580e970a4e9f53b0cd9607dc7bdb9aa844f78e6f039c4a7 |