pypresident

PaiRwisE Sequence IDENtiTy. Calculate pairwise nucleotide identitywith respect to a reference sequence.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PRESIDENT: PaiRwisE Sequence IDENtiTy

Calculate pairwise nucleotide identity with respect to a reference sequence.

Given a reference and a query sequence(which can be fragmented), calculate pairwise nucleotide identity with respect to the reference sequence relative to the entire length of the reference. Only informative nucleotides (A, T, G, C) are considered identical to each other.

Requirements:

To get president running follow the steps below. Note that pbat only runs on linux.

conda create -y -n president -c bioconda python=3.8 pblat
conda activate president
pip install pypresident

Usage:

pypresident installs the package and the pairwise alignment can be run with the following console call:

president --query query.fasta --reference reference.fasta -x identity_threshold -p threads -o output.tsv

To run an example, download a query FASTA and a reference FASTA from GitLab.

Run the alignment with the following command and require and identity of ACGT bases of 93%. Note that multiple fasta sequences are allowed to be present in the query but not in the reference.

president --query NC_045512.2.20mis.fasta --reference NC_045512.2.fasta -x 0.93 -p 8 -o report.tsv

Output:

The script provides a tab-separated file with the following columns. An example from the call described above can be found online.

The transposed version of the output is shown below.

variable	value
ID	NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
Valid	True
Identity	0.9987
Ambiguous Identity	0.9994
Ambiguous Bases	20.0
Query Length	29903
passed_initial_qc	True
aligned	True
reference_length	29903

Definitions:

ID - the fasta sequence name in the query
Valid - True if 'Identity' is greater than the -x parameter
Identity - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths))
Ambiguous Identity - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #Ns in query)
Ambiguous Bases - Number of N nucleotides in the query.
Query Length - Sequence length of the query.
passed_initial_qc - True if the sequence is long enough / has enough ACGT nucleotides (instead of Ns) to reache the identiy threshold
aligned - True if the sequence got aligned with pblat
reference_length - Sequence length of the reference.

Note: max(sequence_lengths) is equal to max(length_query, length_reference).

Notes:

nextstrain uses a quality threshold of < 3000 non-canonical nucleotides

ANI definition:

https: // pubmed.ncbi.nlm.nih.gov/17220447/

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.6.8

Nov 22, 2022

0.6.7

Nov 10, 2022

0.6.7b4 pre-release

Nov 9, 2022

0.6.7b3 pre-release

Nov 9, 2022

0.6.7b2 pre-release

Nov 9, 2022

0.6.7b1 pre-release

Nov 9, 2022

0.6.6

Feb 3, 2022

0.6.5

Feb 3, 2022

0.6.3

Apr 23, 2021

0.6.2

Apr 23, 2021

0.6.1

Apr 22, 2021

0.6.0

Mar 3, 2021

0.5.2

Feb 24, 2021

0.5.1

Feb 20, 2021

0.5.0

Feb 13, 2021

0.4.0

Feb 12, 2021

0.3.0

Jan 29, 2021

0.2.2

Jan 27, 2021

0.2.1

Jan 21, 2021

This version

0.2.0

Jan 20, 2021

0.1.1

Jan 15, 2021

0.1.0

Jan 15, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypresident-0.2.0.tar.gz (13.1 kB view hashes)

Uploaded Jan 20, 2021 Source

Hashes for pypresident-0.2.0.tar.gz

Hashes for pypresident-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8b464b4b59e78540497aed144e071e11c18476ed056c4b546129d85831edaf47`
MD5	`5c7e09488d5e2e64867d48cc6a1f380e`
BLAKE2b-256	`58f0d0dd0bd25e92c7ddc0e4aa149d19d06699c4fb870aadc16bd64fe42394f9`