Integrated pipeline for ML phylogenetic inference from ABI trace and FASTA data
Project description
AB12PHYLO
AB12PHYLO is an integrated, easy-to-use pipeline for Maximum Likelihood (ML) phylogenetic tree inference from ABI trace and FASTA
data.
At its core, AB12PHYLO runs parallelized instances of RAxML-NG (Kozlov et al. 2019) and a BLAST search in a reference database.
It enables visual, effortless sample identification based on phylogenetic position and sequence similarity, as well as population subset selection aided by metrics like Tajima's D for estimations of ongoing evolution.
AB12PHYLO was developed to identify plant pathogen populations possibly under balancing selection with Solanum chilense, especially in the genus Alternaria. With multi-gene phylogenies still a widely-used method in spite of the rise of whole-genome sequencing, future application on fungal phytopathogens or host plants might be possible.
Installation
The recommended way to install AB12PHYLO is via conda, as all external tools are available from the Bioconda channel, but any of the following approaches should work:
a) install to an existing python3 conda environment
conda activate <env>
conda install -c lkndl ab12phylo
where <env>
is your environment. Please do not install to base
! Then install a combination of the external tools you would like to use via
conda install -c bioconda "blast>=2.9.0" raxml-ng "gblocks=0.91b" mafft clustalo muscle t-coffee
Mind that some of these tools are not available for Windows as of 12 March 2021.
If starting the graphical ab12phylo
fails with something like ValueError: Namespace Gtk not available
, ModuleNotFoundError: No module named 'gi'
or nothing happens at all (on Windows) you are missing PyGObject, the python bindings for GTK3:
conda install -c conda-forge pygobject gtk3
If all the icons in the GUI are missing, install some:
conda install -c conda-forge adwaita-icon-theme hicolor-icon-theme
If you get an UnsatisfiableError
in conda because of incompatible packages, please use the next approach:
b) install to a newly created python3 conda environment
You could run conda create -n <env> python=3.x
with x==6|7|8
and proceed as in a) or download this file for Linux or that file for Windows, then open a terminal or Anaconda Powershell in your download folder and set up the environment specified inside via
conda env create -f gtk-env.yaml
and then install AB12PHYLO:
conda activate gtk-env
conda install -c lkndl ab12phylo
c) install via pip from PyPI to your system python
This is possibly the method with the most freedom for users on Ubuntu or any Linux with GTK installed per default, but make sure you are definitely not in a conda environment, i.e. there is no (<env>)
to the left of your shell prompt like this:
(<env>) foo@bar:~$ conda deactivate
foo@bar:~$
Install an MSA tool of your choice and possibly BLAST+ by yourself, then AB12PHYLO and all its python3 dependencies via
pip install ab12phylo
If starting ab12phylo
fails with No module named 'gi'
, here are instructions to install PyGObject. Please be careful, python is probably important for your system!
d) build AB12PHYLO from source using git and pip
Should work on any system and will always install the latest version:
git clone https://github.com/lkndl/ab12phylo
cd ab12phylo
pip install --upgrade pip
pip install .
Please see a) and c) if something goes wrong.
e) install via pip inside a conda environment
If you are on Linux and would like to install via pip
inside your conda <env>
, please check which python
and which pip
is active inside the environment (where you will see (<env>)
to the left of your shell prompt, not shown above).
(<env>) foo@bar:~$ which python
/home/foo/anaconda3/envs/<env>/bin/python
(<env>) foo@bar:~$ which pip
/usr/bin/pip
(<env>) foo@bar:~$ which pip3
/home/foo/anaconda3/envs/<env>/bin/pip3
In the case outlined here, pip
points to a version outside your conda installation, so use pip3
. If neither points to your conda, re-start your shell and check the environment. Sometimes there is no pip
at all, which can be fixed on Linux using your package manager (or sudo apt-get install python3-pip
on Ubuntu), or in conda via conda install -c conda-forge pip
.
If you see No module named 'gi'
make sure to fix that with conda as described in a).
Getting Started
AB12PHYLO has both a graphical and a commandline interface: ab12phylo
and ab12phylo-cmd
. The graphical version has its own help, which you will see if you start it with ab12phylo
from your terminal.
The following tutorial is intended for the commandline tool. First, take a look at the options by running ab12phylo-cmd -h
.
Test run
The pipeline comes with its own test data set. If you pass -test
, it will read options from an auxiliary (or backup) config file at <ab12phylo_root>/ab12phylo_cmd/config/test_config.yaml
and run on these. The test run is set to --verbose
and will run --no_remote
BLAST search.
Basic options
A simple real-world invocation might look like this:
ab12phylo-cmd -abi <seq_dir> \
-csv <wellsplates_dir> \
-g <barcode_gene> \
-rf <ref.fasta> \
-bst 1000 \
-dir <results>
where:
<seq_dir>
contains all input ABI trace files, ending in.ab1
<wellsplates_dir>
contains the.csv
mappings of user-defined IDs to sequencer's isolate coordinates<barcode_gene>
was sequenced. more info<ref.fasta>
contains full GenBank reference records like this- 1000
-bst
=--bootstrap
trees will be generated <results>
is where results will be
Detailed settings
ab12phylo-cmd
has a lot of defaults, but still allows fine-grained access:
ab12phylo-cmd -rf <ref.fasta> \
-db <your_own> \
-dbpath <your_dir> \
-abiset <whitelist> \
-regex_3 <"rx1" "rx2" "rx3"> \
-algo <mafft-clustalo-muscle-tcoffee> \
-gbl balanced \
-local \
-i \
-p1
ab12phylo-cmd -p2 \
-bst 1000 \
-st [32,16] \
-s 4 \
-v
- default: AB12PHYLO will search for
.ab1
and.csv
files in or below the current working directory - default: use the
./results
subdirectory - default: use the
GTR+Γ
model of evolution - use
<your_own>
BLAST+ database; in<your_dir>
- only trace files listed in the
<whitelist>
will be read - plate number, gene name and well will be parsed from the
.ab1
filename using these three RegEx. -algo
will generate the MSA:mafft
,clustalo
,muscle
ort_coffee
-gbl
setsGblocks
MSA trimming mode:skip
,relaxed
,balanced
orstrict
-local
skips online BLAST for sequences not in the local BLAST+ db and read why-i
or--info
shows some more run details in the console-p1
run only part one, up until BLAST; also generates an MView HTML of the MSA to find funny sequences
For the second invocation:
-p2
run part two of AB12PHYLO, starting with RAxML-NG-st
: ML tree searches from32
random and16
parsimony-based starting trees-s
fixes the random--seed
to4
for reproducibility-v
or--verbose
shows all logged events in the console
Results + Motif Search
You can import results from ab12phylo-cmd
into the graphical ab12phylo
. It's faster, easier and more capable when it comes to viewing and modifying trees. Open a new window/project, press Ctrl+i
or Import from commandline version
and select the folder with your commandline results.
Once ML tree inference, bootstrapping and BLAST has finished from the commandline, the pipeline will display a result.html
in your web browser. This page contains a form that allows Motif search across node attributes and calculates diversity metrics for the matching subset/subtree. Entering a space ' '
should match all samples, and entering multiple motifs separated by commas ,
will select all leaves that match at least one motif.
To select a subtree, enter a motif that matches several leaves or at least two separate specific leaf motifs, with one of them as far left as possible. Entering a single, specific motif for a subtree search can be used to find the index of a node for tree modifications like rooting.
Once you hit MATCH
or SUBTREE
, the CGI script in the package will highlight the selection in the tree, as well as compute and display diversity statistics for this 'population'. It will also write and link a file with all selected sample IDs or the paths of the original .ab1
files in the /query
subdirectory, which can be used as a whitelist file for a subsequent subset analysis run. Pass it via -sampleset
if using sample IDs, or via -abiset
if using file paths. File paths are recommended to reliably exclude outlier versions with identical base ID.
If results are moved or sent, motif search will be possible by using ab12phylo-view
or starting a CGI server in the directory via python3 -m http.server --cgi <port>
. Find the port on the intro tab.
BLAST
If none of the smaller BLAST+ databases are sufficient for your search, you are better of with a web BLAST. Collate your data by running -p1 --no_BLAST
, then upload the <gene>/<gene>.fasta
for the gene you wish to use for species annotation. Pass the result via --BLAST_xml
. You can pass more than one file; AB12PHYLO will use the last annotation for a sample and ignore everything but the first hit.
BLAST API
AB12PHYLO is perfectly capable of running online BLAST searches without a BLAST+ installation. However, this is not a suitable main BLAST strategy as BLAST API queries are de-prioritised after just a few attempts. Accordingly, if several runs are attempted on the same data set, passing the --no_remote
flag will use data from an earlier run by default. You can also leave out species annotation by skipping BLAST entirely with --no_BLAST
.
BLAST+ Database
Sometimes, this pipeline might run headless on a server. To keep it from running head-first in a firewall when it attempts to update its BLAST+ database via FTP, please pre-supply an unzipped, ready-to-use BLAST+ database via -dbpath
(and -db
name). Find databases on the NCBI website, but web BLAST and --BLAST_xml
might be both easier and faster.
Visualization
This is easier in the graphical ab12phylo
. If you'd like to stick to the commandline:
ab12phylo-visualize + ab12phylo-view
ab12phylo-visualize
will re-plot phylogenies and render a new results.html
. An end user may use this to switch support values or plot an MSA visualization with -msa-viz
. This will take some rendering time for wider alignments.
ab12phylo-view
shows results of a previous run in a browser, with motif search enabled. Both commands accept a path to the AB12PHYLO results or default to .
, and are equivalent to appending to the original ab12phylo-cmd
call.
ab12phylo-view <result_folder>
# or
cd <results_folder>
ab12phylo-view
# or
ab12phylo-cmd -c <my-config.yaml> -bst 1000 (...) -view
Support Values
For visualization, you can pick either Felsenstein Bootstrap Proportions FBP
or Transfer Bootstrap Expectation TBE
support values with -metric
. Newick tree files for both types are generated anyway, so switching the support value metric only requires re-running ab12phylo-visualize
.
MSA visualization
AB12PHYLO can plot an additional rectangular tree with an MSA visualization next to it by passing -msa_viz
. This is single-threaded and takes some extra time for larger MSAs. If you run -p1
, you can visually inspect the MView HTML of the alignment for outliers already without having to wait for the entire pipeline.
Tree modifications
You can -drop_nodes
, -replace_nodes
or subtrees with a placeholder, or -root
the tree with an outgroup sequence. These options accept integer node indices from motif search.
Also see the VISUALIZATION
section of --help
.
Advanced use
Remote runs
tmux and --headless
are highly recommended for remote runs, as well as pre-supplying a BLAST+ db, adapting or replacing the YAML config file and setting a fixed seed for reproducibility.
Genes and References
If data for several genes is supplied, AB12PHYLO will restrict the analysis to samples that are present for all genes. If this causes a lot of samples to be dropped, it might be worth leaving out a gene entirely by setting -g
= --genes
manually. If you somehow have No samples shared across all genes
, there might be some trace files that seemingly belongs to another gene.
Passing references is closely inter-linked: If a directory of reference files is supplied via -rd
= --ref_dir
, the package will try to match the .FASTA
files inside to genes by their filename. For example, ITS1F.fasta
will be matched to trace data from the ITS1F gene. Alternatively, an ordered list of reference files can be passed via -rf
= --ref
, and file names will be ignored.
If you're feeling this neat and precise and set both the genes and individual references, be careful: The pipeline will then deliberately match references and genes by order. Therefore, this will make a mess:
-g ITS1F OPA10 -rf ../opa10.fasta ITS1F.phy
RegEx
If you provide wellsplates mappings, AB12PHYLO will parse plate number, gene name and the sequencer's isolate coordinates from the .ab1
filename with a RegEx and fetch the user-defined ID from the corresponding .csv
look-up table. To use your own, please consult --help
and try out your RegEx here or there. From a bash shell, enclose the terms with double quotes "
. The wellsplate number or ID is also parsed from the .csv
filename with a RegEx.
Provide a RegEx to --regex_rev
and AB12PHYLO will look out for reverse reads, and add only their reverse complement to the data set.
Log File
If you're having trouble, look at the log file! It will be in your results directory and named like ab12phylo[|-p1|-p2][-view|-viz]?.log
. Alternatively, you can set the --verbose
flag and get the same information in real-time to your commandline. Your choice!
Dependencies
Biopython, NumPy, pandas, Toytree, Toyplot, matplotlib, PyYAML, lxml, xmltramp2, svgutils, Pillow, Requests and Jinja2
External Tools
- BLAST+ version >=2.9 (optional, recommended if small or custom database)
- RAxML-NG (optional, included)
- an MSA tool: MAFFT, Clustal Omega, MUSCLE or T-Coffee (optional, slow online clients for EMBL interface included)
- Gblocks for MSA trimming (optional, included)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ab12phylo-0.4.24b0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e20603d3c0f190da7e069266dcd91e12906823ccfb6ad413b2c0466a05ff59e7 |
|
MD5 | 8bfbc616ce1de23d5e6029f2cfd94184 |
|
BLAKE2b-256 | da29dc9ac20acb87866cd8ef29c44631a53648e75618a6845d46f7d1195d6d95 |