Skip to main content

PaiRwisE Sequence IDENtiTy. Calculate pairwise nucleotide identitywith respect to a reference sequence.

Project description

pipeline codecov

PRESIDENT: PaiRwisE Sequence IDENtiTy

Calculate pairwise nucleotide identity with respect to a reference sequence.

Given a reference and a query sequence, calculate pairwise nucleotide identity with respect to the reference sequence relative to the entire length of the reference. In the main metric, only informative nucleotides (A, T, G, C) are considered identical to each other. The tool also provides some further metrics (e.g. regarding ambiguous 'N's) and splits the input FASTA into valid and failed FASTA files for further processing.

Installation:

To install president with conda, run the commands below:

conda create -y -n president -c bioconda -c conda-forge president
conda activate president

Note that pblat is a dependency and only runs on Linux. Alternatively, president can be installed with pip in an environment where pblat is in the PATH:

pip install pypresident

Another possibility is to use a docker image with all dependencies installed:

docker pull rkibioinf/president:latest

Usage:

The pairwise alignment can be run with the following console call:

president --query query.fasta --reference reference.fasta -x identity_threshold -n n_threshold -t threads -p /path/to/output/ -f prefix

To run an example, download a query FASTA and a reference FASTA from GitLab.

Run the alignment with the following command and identity of ACGT bases of 90% and maximal 5% Ns in the query. Note that multiple fasta sequences are allowed to be present in the query but not in the reference FASTA.

president -q NC_045512.2.20mis.fasta -r NC_045512.2.fasta -x 0.9 -n 0.05 -t 4 -p output/ -f test_

Output:

The script provides:

  • a tab-separated file with the below listed columns
  • a FASTA file with valid sequences
  • a FASTA file with invalid sequences

The separation between the valid and invalid bin is mainly based on the defined identity/n thresholds (-x, default: 0.9; -n, default: 0.05) and further sanity checks (non-IUPAC characters). Note that for multiple query inputs the valid and invalid sequence headers are NOT associate with their filename. This meta information must be retrieved from the report. An example from the call described above can be found online.

Attention: If your query sequence includes gap symbols (- or .; for example from a previous alignment step), president will remove them from the output sequence! Rational: Gap symbols should be not part of a consensus sequence (for which the tool was developed) and will therefore be removed. Until version 0.6.7 gap symbols were replaced with N!

The transposed version of the output is shown below.

variable value
query_name NC_045512.2 Severe acute [...]
reference_name NC_045512.2 Severe acute [...]
file_in_query NC_045512.2.20mis.fasta
file_in_ref NC_045512.2.fasta
ACGT Nucleotide identity 0.9987
ACGT Nucleotide identity (ignoring Ns) 0.9994
ACGT Nucleotide identity (ignoring non-ACGTNs) 0.9994
qc_all_valid True
qc_is_empty_query False
qc_post_align_pass_threshold True
qc_post_aligned True
qc_post_aligned_all_valid True
qc_valid_length True
qc_valid_nucleotides True
qc_valid_pass_nthreshold True
acgt_bases 29883
iupac_bases 20
non_iupac_bases 0
N_bases 20
length_query 29903
length_reference 29903
LongestNGap 20
Matches 29864
Mismatches 19
query_description
query_index 0
Date 2021-01-20
Definitions:
  1. query_name - FASTA ID of the query sequence + file origin, format :
  2. reference_name - FASTA ID of the reference sequence
  3. file_in_query - the input query file name
  4. file_in_ref - the input reference file name (note that for multiple input queries a tempororay file name is reported)
  5. ACGT Nucleotide identity - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths))
  6. ACGT Nucleotide identity (ignoring Ns) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #Ns in query)
  7. ACGT Nucleotide identity (ignoring non-ACGTNs) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #non-ACGTNs in query)
  8. qc_all_valid - True if all checks below are True
  9. qc_is_empty_query - True if input query file is not empty
  10. qc_post_align_pass_threshold - True if 'ACGT Nucleotide identity' is greater than the -x parameter
  11. qc_post_aligned - Set to True if the sequence was aligned (can be False either because of initial checks or failed alignments)
  12. qc_post_aligned_all_valid - True if all checks before alignment are True, alignment is only performed if this value is True
  13. qc_valid_length - True if the query is actually able to achieve the -x score even if the query is shorter/ longer
  14. qc_valid_nucleotides - True if only valid IUPAC characters are in the query
  15. qc_valid_pass_nthreshold - True if -n percentage of Ns in the query is not exceeded
  16. acgt_bases - number of ACGT bases
  17. iupac_bases - number of IUPAC bases
  18. non_iupac_bases - Number of non-IUPAC nucleotides in the query
  19. N_bases - the number of N bases
  20. length_query - the length of the query
  21. length_reference - the length of the reference
  22. LongestNGap - Lenght of the longest N gap
  23. Matches - the number of matches in the alignment
  24. Mismatches - the number of mismatches in the alignment
  25. query_description - query description, if available
  26. query_index - the position of the sequence in the query fasta input file (for multiple files the counter is not resetted)
  27. Date - the yyyy-mm-dd of the execution of the script

Note: max(sequence_lengths) is equal to max(length_query, length_reference).

Notes:
  • nextstrain uses a quality threshold of < 3000 non-canonical nucleotides

ANI definition:

Contributing

If you would like to get involved, here is information on contribution guidelines and how to test the code locally.

You can contribute in multiple ways, e.g., reporting bugs, reviewing or refactoring code, requesting or implementing new features, etc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypresident-0.6.8.tar.gz (35.7 kB view details)

Uploaded Source

Built Distribution

pypresident-0.6.8-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file pypresident-0.6.8.tar.gz.

File metadata

  • Download URL: pypresident-0.6.8.tar.gz
  • Upload date:
  • Size: 35.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pypresident-0.6.8.tar.gz
Algorithm Hash digest
SHA256 2d4672eef2532559747b3e6f61c611514a55fe947b535da92ca3af492f44569b
MD5 912ce48f79e6036823166a6eb515c52a
BLAKE2b-256 6684db12332cb95d53820b292c48b0b06b02b91aa09527674a1fa0be57b3ea1f

See more details on using hashes here.

File details

Details for the file pypresident-0.6.8-py3-none-any.whl.

File metadata

  • Download URL: pypresident-0.6.8-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pypresident-0.6.8-py3-none-any.whl
Algorithm Hash digest
SHA256 9e826e23a505c04b2db815f7ee85f068482dcc382296d9b93afdc422e12ad2cf
MD5 950288de7467a32cbc9e1521af7e7e3c
BLAKE2b-256 51b414904a23b833a7443302247f268a748ff57f27f2ebaf485af66509e5aa8d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page