Quality control for phylogenetic pipelines using pytest
Project description
Phytest: Quality Control for Phylogenetic Analyses.
Documentation: https://phytest-devs.github.io/phytest
Code: https://github.com/phytest-devs/phytest
Tutorials: https://github.com/phytest-devs?q=example
Installation
Install phytest using pip:
pip install phytest
Quick Start
Phytest is a tool for automating quality control checks on sequence, tree and metadata files during phylogenetic analyses. Phytest ensures that phylogenetic analyses meet user-defined quality control tests.
Here we will create example data files to run our tests on.
Create an alignment fasta file example.fasta
>Sequence_A
ATGAGATCCCCGATAGCGAGCTAGCGATCGCAGCGACTCAGCAGCTACAGCGCAGAGGAGAGAGAGGCCCCTATTTACTAGAGCTCCAGATATAGNNTAG
>Sequence_B
ATGAGATCCCCGATAGCGAGCTAGXGATCGCAGCGACTCAGCAGCTACAGCGCAGAGGAGAGAGAGGCCCCTATTTACTAGAGCTCCAGATATAGNNTAG
>Sequence_C
ATGAGA--CCCGATAGCGAGCTAGCGATCGCAGCGACTCAGCAGCTACAGCGCAGAGGAGAGAGAGGCCCCTATTTACTAGAGCTCCAGATATAGNNTAG
>Sequence_D
ATGAGATCCCCGATAGCGAGCTAGCGATNNNNNNNNNNNNNNNNNTACAGCGCAGAGGAGAGAGAGGCCCCTATTTACTAGAGCTCCAGATATAGNNTAG
Create a tree newick file example.tree
(Sequence_A:1,Sequence_B:0.2,(Sequence_C:0.3,Sequence_D:0.4):0.5);
Writing a test file
- We want to enforce the follow constraints on our data:
The alignment has 4 sequences
The sequences have a length of 100
The sequences only contains the characters A, T, G, C, N and -
The sequences are allowed to only contain single base deletions
The longest stretch of Ns is 10
The tree has 4 tips
The tree is bifurcating
The alignment and tree have the same names
All internal branches are longer than the given threshold
There are no outlier branches in the tree
We can write these tests in a python files example.py
from phytest import Alignment, Sequence, Tree
def test_alignment_has_4_sequences(alignment: Alignment):
alignment.assert_length(4)
def test_alignment_has_a_width_of_100(alignment: Alignment):
alignment.assert_width(100)
def test_sequences_only_contains_the_characters(sequence: Sequence):
sequence.assert_valid_alphabet(alphabet="ATGCN-")
def test_single_base_deletions(sequence: Sequence):
sequence.assert_longest_stretch_gaps(max=1)
def test_longest_stretch_of_Ns_is_10(sequence: Sequence):
sequence.assert_longest_stretch_Ns(max=10)
def test_tree_has_4_tips(tree: Tree):
tree.assert_number_of_tips(4)
def test_tree_is_bifurcating(tree: Tree):
tree.assert_is_bifurcating()
def test_aln_tree_match_names(alignment: Alignment, tree: Tree):
aln_names = [i.name for i in alignment]
tree.assert_tip_names(aln_names)
def test_all_internal_branches_lengths_above_threshold(tree: Tree, threshold=1e-4):
tree.assert_internal_branch_lengths(min=threshold)
def test_outlier_branches(tree: Tree):
# Here we create a custom function to detect outliers
import statistics
tips = tree.get_terminals()
branch_lengths = [t.branch_length for t in tips]
cut_off = statistics.mean(branch_lengths) + statistics.stdev(branch_lengths)
for tip in tips:
assert tip.branch_length < cut_off, f"Outlier tip '{tip.name}' (branch length = {tip.branch_length})!"
Running Phytest
We can then run these tests on our data with phytest
:
phytest examples/example.py -s examples/data/example.fasta -t examples/data/example.tree
Generate a report by adding --report report.html
.
From the output we can see several tests failed:
FAILED examples/example.py::test_sequences_only_contains_the_characters[Sequence_B] - AssertionError: Invalid pattern found in 'Sequence_B'!
FAILED examples/example.py::test_single_base_deletions[Sequence_C] - AssertionError: Longest stretch of '-' in 'Sequence_C' > 1!
FAILED examples/example.py::test_longest_stretch_of_Ns_is_10[Sequence_D] - AssertionError: Longest stretch of 'N' in 'Sequence_D' > 10!
FAILED examples/example.py::test_outlier_branches - AssertionError: Outlier tip 'Sequence_A' (branch length = 1.0)!
Results (0.07s):
15 passed
4 failed
- examples/example.py:12 test_sequences_only_contains_the_characters[Sequence_B]
- examples/example.py:16 test_single_base_deletions[Sequence_C]
- examples/example.py:20 test_longest_stretch_of_Ns_is_10[Sequence_D]
- examples/example.py:32 test_outlier_branches
See docs for more information https://phytest-devs.github.io/phytest.
Citation
If you use phytest, please cite the following paper:
Wytamma Wirth, Simon Mutch, Robert Turnbull, Sebastian Duchene, Phytest: quality control for phylogenetic analyses, Bioinformatics, Volume 38, Issue 22, 15 November 2022, Pages 5124–5125, https://doi.org/10.1093/bioinformatics/btac664
@article{10.1093/bioinformatics/btac664,
author = {Wirth, Wytamma and Mutch, Simon and Turnbull, Robert and Duchene, Sebastian},
title = "{{Phytest: quality control for phylogenetic analyses}}",
journal = {Bioinformatics},
volume = {38},
number = {22},
pages = {5124-5125},
year = {2022},
month = {10},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btac664},
url = {https://doi.org/10.1093/bioinformatics/btac664},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/38/22/5124/47153886/btac664.pdf},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file phytest-1.4.1.tar.gz
.
File metadata
- Download URL: phytest-1.4.1.tar.gz
- Upload date:
- Size: 18.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.12 Linux/6.2.0-1012-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4083778922020f317efd02474a35dcb822df6451b6e9a6e9350bb1c1ce336a2 |
|
MD5 | ee485c96cb1b74b19f56815155ea33f0 |
|
BLAKE2b-256 | d4b7809bc0b7212abc5212e1983228c14b27e6ab5cbc1fcf2fdf719185a1d352 |
File details
Details for the file phytest-1.4.1-py3-none-any.whl
.
File metadata
- Download URL: phytest-1.4.1-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.12 Linux/6.2.0-1012-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7539861b627359528e5855ed1a08287ae0d2f501e35a193561713f73e363f56 |
|
MD5 | e7800cba7bf46e1f9d74e10f7f7d8255 |
|
BLAKE2b-256 | 1e330f3981c640c2de4944fdc84c87c07012d0c62336cd462f9aa61d6d60c93d |