pypresident

PaiRwisE Sequence IDENtiTy. Calculate pairwise nucleotide identitywith respect to a reference sequence.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PRESIDENT: PaiRwisE Sequence IDENtiTy

Calculate pairwise nucleotide identity with respect to a reference sequence.

Given a reference and a query sequence, calculate pairwise nucleotide identity with respect to the reference sequence relative to the entire length of the reference. In the main metric, only informative nucleotides (A, T, G, C) are considered identical to each other. The tool also provides some further metrics (e.g. regarding ambiguous 'N's) and splits the input FASTA into valid and failed FASTA files for further processing.

Installation:

To install president with conda, run the commands below:

conda create -y -n president -c bioconda -c conda-forge president
conda activate president

Note that pblat is a dependency and only runs on Linux. Alternatively, president can be installed with pip in an environment where pblat is in the PATH:

pip install president

Usage:

pypresident installs the package and the pairwise alignment can be run with the following console call:

president --query query.fasta --reference reference.fasta -x identity_threshold -n n_threshold -t threads -p /path/to/output/ -f prefix

To run an example, download a query FASTA and a reference FASTA from GitLab.

Run the alignment with the following command and identity of ACGT bases of 93% and maximal 5% Ns in the query. Note that multiple fasta sequences are allowed to be present in the query but not in the reference FASTA.

president -q NC_045512.2.20mis.fasta -r NC_045512.2.fasta -x 0.9 -n 0.05 -t 4 -p output/ -f test_

Output:

The script provides:

a tab-separated file with the below listed columns
a FASTA file with valid sequences
a FASTA file with invalid sequences

The separation between the valid and invalid bin is mainly based on the defined identity/n thresholds (-x, default: 0.9; -n, default: 0.05) and further sanity checks (non-IUPAC characters).

An example from the call described above can be found online.

The transposed version of the output is shown below.

variable	value
ACGT Nucleotide identity	0.9987
ACGT Nucleotide identity (ignoring Ns)	0.9994
ACGT Nucleotide identity (ignoring non-ACGTNs)	0.9994
Date	2021-01-20
Matches	29864
Mismatches	19
N_bases	20
Ngap	20
acgt_bases	29883
file_in_query	NC_045512.2.20mis.fasta
file_in_ref	NC_045512.2.fasta
iupac_bases	20
length_query	29903
length_reference	29903
non_iupac_bases	0
qc_all_valid	True
qc_is_empty_query	False
qc_post_align_pass_threshold	True
qc_post_aligned	True
qc_post_aligned_all_valid	True
qc_valid_length	True
qc_valid_nucleotides	True
qc_valid_number_n	True
query_description
query_index	0
query_name	NC_045512.2 Severe acute [...]
reference_name	NC_045512.2 Severe acute [...]

Definitions:

ACGT Nucleotide identity - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths))
ACGT Nucleotide identity (ignoring Ns) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #Ns in query)
ACGT Nucleotide identity (ignoring non-ACGTNs) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #non-ACGTNs in query)
Date - the yyyy-mm-dd of the execution of the script
Matches - the number of matches in the alignment
Mismatches - the number of mismatches in the alignment
N_bases - the number of N bases
LongestNGap - Lenght of the longest N gap
acgt_bases - number of ACGT bases
file_in_query - the input query file name
file_in_ref - the input reference file name
non_iupac_bases - Number of non-IUPAC nucleotides in the query
qc_all_valid - True if all checks below are True
qc_is_empty_query - True if input query file is not empty
qc_post_align_pass_threshold - True if 'ACGT Nucleotide identity' is greater than the -x parameter
qc_post_aligned - Set to True if the sequence was aligned (can be False either because of initial checks or failed alignments)
qc_post_aligned_all_valid - True if all checks before alignment are True, alignment is only performed if this value is True
qc_valid_length - True if the query is actually able to achieve the -x score even if the query is shorter/ longer
qc_valid_nucleotides - True if only valid IUPAC characters are in the query
qc_valid_pass_nthreshold - True if -n percentage of Ns in the query is not exceeded
query_description - query description, if available
query_index - the position of the sequence in the query fasta input file
query_name - FASTA ID of the query sequence
reference_name - FASTA ID of the reference sequence

Note: max(sequence_lengths) is equal to max(length_query, length_reference).

Notes:

nextstrain uses a quality threshold of < 3000 non-canonical nucleotides

ANI definition:

https://pubmed.ncbi.nlm.nih.gov/17220447/

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.6.8

Nov 22, 2022

0.6.7

Nov 10, 2022

0.6.7b4 pre-release

Nov 9, 2022

0.6.7b3 pre-release

Nov 9, 2022

0.6.7b2 pre-release

Nov 9, 2022

0.6.7b1 pre-release

Nov 9, 2022

0.6.6

Feb 3, 2022

0.6.5

Feb 3, 2022

0.6.3

Apr 23, 2021

0.6.2

Apr 23, 2021

0.6.1

Apr 22, 2021

0.6.0

Mar 3, 2021

0.5.2

Feb 24, 2021

0.5.1

Feb 20, 2021

0.5.0

Feb 13, 2021

This version

0.4.0

Feb 12, 2021

0.3.0

Jan 29, 2021

0.2.2

Jan 27, 2021

0.2.1

Jan 21, 2021

0.2.0

Jan 20, 2021

0.1.1

Jan 15, 2021

0.1.0

Jan 15, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypresident-0.4.0.tar.gz (33.0 kB view hashes)

Uploaded Feb 12, 2021 Source

Built Distribution

pypresident-0.4.0-py3-none-any.whl (15.7 kB view hashes)

Uploaded Feb 12, 2021 Python 3

Hashes for pypresident-0.4.0.tar.gz

Hashes for pypresident-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`b4f013f4a99b508febc0c38369e75d82f056941fa40d3b506ebfe15f8de982a7`
MD5	`9753dae338ee2a544eefdfb57a9b8c37`
BLAKE2b-256	`ea9d38c236cd49096eb9c064eb9c2c60efc919c185c86be42a6de47b2828f791`

Hashes for pypresident-0.4.0-py3-none-any.whl

Hashes for pypresident-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0fff0d27a2135ef4c4def0a2f15d840a726dc351538fdaed818372ecd3008286`
MD5	`ffb6c408d6414b4f8087a77914fb1494`
BLAKE2b-256	`302582fd27b9408f50277f25e4c5dab9f749d64f5c72f079e50f915b4614160c`