Skip to main content

Integrated pipeline for ML phylogenetic inference from ABI trace and FASTA data

Project description

AB12PHYLO

PyPI license github version PyPI Python version

AB12PHYLO is an integrated, easy-to-use pipeline for Maximum Likelihood (ML) phylogenetic tree inference from ABI trace and FASTA data. At its core, AB12PHYLO runs parallelized instances of RAxML-NG (Kozlov et al. 2019) and a BLAST search in a reference database. It enables visual, effortless sample identification based on phylogenetic position and sequence similarity, as well as population subset selection aided by metrics like Tajima's D for estimations of ongoing evolution.

./ab12phylo/files/screen_04.png

AB12PHYLO was developed to identify plant pathogen populations possibly under balancing selection with Solanum chilense, especially in the genus Alternaria. With multi-gene phylogenies still a widely-used method in spite of the rise of whole-genome sequencing, future application on fungal phytopathogens or host plants might be possible.

Installation

The recommended way to install AB12PHYLO is via conda, as all external tools are available from the Bioconda channel, but any of the following approaches should work:

a) install to an existing python3 conda environment

conda activate <env>
conda install -c lkndl ab12phylo

where <env> is your environment. Please do not install to base! Then install a combination of the external tools you would like to use via

conda install -c bioconda "blast>=2.9.0" raxml-ng "gblocks=0.91b" mafft clustalo muscle t-coffee

Mind that some of these tools are not available for Windows as of 12 March 2021.

If starting the graphical ab12phylo fails with something like ValueError: Namespace Gtk not available, ModuleNotFoundError: No module named 'gi' or nothing happens at all (on Windows) you are missing PyGObject, the python bindings for GTK3:

conda install -c conda-forge pygobject gtk3  

If all the icons in the GUI are missing, install some:

conda install -c conda-forge adwaita-icon-theme hicolor-icon-theme  

If you get an UnsatisfiableError in conda because of incompatible packages, please use the next approach:

b) install to a newly created python3 conda environment

You could run conda create -n <env> python=3.x with x==6|7|8 and proceed as in a) or download this file for Linux or that file for Windows, then open a terminal or Anaconda Powershell in your download folder and set up the environment specified inside via

conda env create -f gtk-env.yaml

and then install AB12PHYLO:

conda activate gtk-env
conda install -c lkndl ab12phylo

c) install via pip from PyPI to your system python

This is possibly the method with the most freedom for users on Ubuntu or any Linux with GTK installed per default, but make sure you are definitely not in a conda environment, i.e. there is no (<env>) to the left of your shell prompt like this:

(<env>) foo@bar:~$ conda deactivate
foo@bar:~$ 

Install an MSA tool of your choice and possibly BLAST+ by yourself, then AB12PHYLO and all its python3 dependencies via

pip install ab12phylo

If starting ab12phylo fails with No module named 'gi', here are instructions to install PyGObject. Please be careful, python is probably important for your system!

d) build AB12PHYLO from source using git and pip

Should work on any system and will always install the latest version:

git clone https://github.com/lkndl/ab12phylo
cd ab12phylo
pip install --upgrade pip
pip install .

Please see a) and c) if something goes wrong.

e) install via pip inside a conda environment

If you are on Linux and would like to install via pip inside your conda <env>, please check which python and which pip is active inside the environment (where you will see (<env>) to the left of your shell prompt, not shown above).

(<env>) foo@bar:~$ which python
/home/foo/anaconda3/envs/<env>/bin/python

(<env>) foo@bar:~$ which pip
/usr/bin/pip

(<env>) foo@bar:~$ which pip3
/home/foo/anaconda3/envs/<env>/bin/pip3

In the case outlined here, pip points to a version outside your conda installation, so use pip3. If neither points to your conda, re-start your shell and check the environment. Sometimes there is no pip at all, which can be fixed on Linux using your package manager (or sudo apt-get install python3-pip on Ubuntu), or in conda via conda install -c conda-forge pip.

If you see No module named 'gi' make sure to fix that with conda as described in a).

Getting Started

AB12PHYLO has both a graphical and a commandline interface: ab12phylo and ab12phylo-cmd. The graphical version has its own help, which you will see if you start it with ab12phylo from your terminal.

The following tutorial is intended for the commandline tool. First, take a look at the options by running ab12phylo-cmd -h.

Test run

The pipeline comes with its own test data set. If you pass -test, it will read options from an auxiliary (or backup) config file at <ab12phylo_root>/ab12phylo_cmd/config/test_config.yaml and run on these. The test run is set to --verbose and will run --no_remote BLAST search.

Basic options

A simple real-world invocation might look like this:

ab12phylo-cmd -abi <seq_dir> \
    -csv <wellsplates_dir> \
    -g <barcode_gene> \
    -rf <ref.fasta> \
    -bst 1000 \
    -dir <results>

where:

  • <seq_dir> contains all input ABI trace files, ending in .ab1
  • <wellsplates_dir> contains the .csv mappings of user-defined IDs to sequencer's isolate coordinates
  • <barcode_gene> was sequenced. more info
  • <ref.fasta> contains full GenBank reference records like this
  • 1000 -bst = --bootstrap trees will be generated
  • <results> is where results will be

Detailed settings

ab12phylo-cmd has a lot of defaults, but still allows fine-grained access:

ab12phylo-cmd -rf <ref.fasta> \
    -db <your_own> \
    -dbpath <your_dir> \
    -abiset <whitelist> \
    -regex_3 <"rx1" "rx2" "rx3"> \
    -algo <mafft-clustalo-muscle-tcoffee> \
    -gbl balanced \
    -local \
    -i \
    -p1

ab12phylo-cmd -p2 \
    -bst 1000 \
    -st [32,16]  \
    -s 4 \
    -v 
  • default: AB12PHYLO will search for .ab1 and .csv files in or below the current working directory
  • default: use the ./results subdirectory
  • default: use the GTR+Γ model of evolution
  • use <your_own> BLAST+ database; in <your_dir>
  • only trace files listed in the <whitelist> will be read
  • plate number, gene name and well will be parsed from the .ab1 filename using these three RegEx.
  • -algo will generate the MSA: mafft, clustalo, muscle or t_coffee
  • -gbl sets Gblocks MSA trimming mode: skip, relaxed, balanced or strict
  • -local skips online BLAST for sequences not in the local BLAST+ db and read why
  • -i or --info shows some more run details in the console
  • -p1 run only part one, up until BLAST; also generates an MView HTML of the MSA to find funny sequences

For the second invocation:

  • -p2 run part two of AB12PHYLO, starting with RAxML-NG
  • -st: ML tree searches from 32 random and 16 parsimony-based starting trees
  • -s fixes the random --seed to 4 for reproducibility
  • -v or --verbose shows all logged events in the console

Results + Motif Search

You can import results from ab12phylo-cmd into the graphical ab12phylo. It's faster, easier and more capable when it comes to viewing and modifying trees. Open a new window/project, press Ctrl+i or Import from commandline version and select the folder with your commandline results.

Once ML tree inference, bootstrapping and BLAST has finished from the commandline, the pipeline will display a result.html in your web browser. This page contains a form that allows Motif search across node attributes and calculates diversity metrics for the matching subset/subtree. Entering a space ' ' should match all samples, and entering multiple motifs separated by commas , will select all leaves that match at least one motif.

To select a subtree, enter a motif that matches several leaves or at least two separate specific leaf motifs, with one of them as far left as possible. Entering a single, specific motif for a subtree search can be used to find the index of a node for tree modifications like rooting.

Once you hit MATCH or SUBTREE, the CGI script in the package will highlight the selection in the tree, as well as compute and display diversity statistics for this 'population'. It will also write and link a file with all selected sample IDs or the paths of the original .ab1 files in the /query subdirectory, which can be used as a whitelist file for a subsequent subset analysis run. Pass it via -sampleset if using sample IDs, or via -abiset if using file paths. File paths are recommended to reliably exclude outlier versions with identical base ID.

If results are moved or sent, motif search will be possible by using ab12phylo-view or starting a CGI server in the directory via python3 -m http.server --cgi <port>. Find the port on the intro tab.

BLAST

If none of the smaller BLAST+ databases are sufficient for your search, you are better of with a web BLAST. Collate your data by running -p1 --no_BLAST, then upload the <gene>/<gene>.fasta for the gene you wish to use for species annotation. Pass the result via --BLAST_xml. You can pass more than one file; AB12PHYLO will use the last annotation for a sample and ignore everything but the first hit.

BLAST API

AB12PHYLO is perfectly capable of running online BLAST searches without a BLAST+ installation. However, this is not a suitable main BLAST strategy as BLAST API queries are de-prioritised after just a few attempts. Accordingly, if several runs are attempted on the same data set, passing the --no_remote flag will use data from an earlier run by default. You can also leave out species annotation by skipping BLAST entirely with --no_BLAST.

BLAST+ Database

Sometimes, this pipeline might run headless on a server. To keep it from running head-first in a firewall when it attempts to update its BLAST+ database via FTP, please pre-supply an unzipped, ready-to-use BLAST+ database via -dbpath (and -db name). Find databases on the NCBI website, but web BLAST and --BLAST_xml might be both easier and faster.

Visualization

This is easier in the graphical ab12phylo. If you'd like to stick to the commandline:

ab12phylo-visualize + ab12phylo-view

ab12phylo-visualize will re-plot phylogenies and render a new results.html. An end user may use this to switch support values or plot an MSA visualization with -msa-viz. This will take some rendering time for wider alignments. ab12phylo-view shows results of a previous run in a browser, with motif search enabled. Both commands accept a path to the AB12PHYLO results or default to ., and are equivalent to appending to the original ab12phylo-cmd call.

ab12phylo-view <result_folder>
# or
cd <results_folder>
ab12phylo-view
# or 
ab12phylo-cmd -c <my-config.yaml> -bst 1000 (...) -view

Support Values

For visualization, you can pick either Felsenstein Bootstrap Proportions FBP or Transfer Bootstrap Expectation TBE support values with -metric. Newick tree files for both types are generated anyway, so switching the support value metric only requires re-running ab12phylo-visualize.

MSA visualization

AB12PHYLO can plot an additional rectangular tree with an MSA visualization next to it by passing -msa_viz. This is single-threaded and takes some extra time for larger MSAs. If you run -p1, you can visually inspect the MView HTML of the alignment for outliers already without having to wait for the entire pipeline.

Tree modifications

You can -drop_nodes, -replace_nodes or subtrees with a placeholder, or -root the tree with an outgroup sequence. These options accept integer node indices from motif search.

Also see the VISUALIZATION section of --help.

Advanced use

Remote runs

tmux and --headless are highly recommended for remote runs, as well as pre-supplying a BLAST+ db, adapting or replacing the YAML config file and setting a fixed seed for reproducibility.

Genes and References

If data for several genes is supplied, AB12PHYLO will restrict the analysis to samples that are present for all genes. If this causes a lot of samples to be dropped, it might be worth leaving out a gene entirely by setting -g = --genes manually. If you somehow have No samples shared across all genes, there might be some trace files that seemingly belongs to another gene.

Passing references is closely inter-linked: If a directory of reference files is supplied via -rd = --ref_dir, the package will try to match the .FASTA files inside to genes by their filename. For example, ITS1F.fasta will be matched to trace data from the ITS1F gene. Alternatively, an ordered list of reference files can be passed via -rf = --ref, and file names will be ignored.

If you're feeling this neat and precise and set both the genes and individual references, be careful: The pipeline will then deliberately match references and genes by order. Therefore, this will make a mess:

-g ITS1F OPA10 -rf ../opa10.fasta ITS1F.phy

RegEx

If you provide wellsplates mappings, AB12PHYLO will parse plate number, gene name and the sequencer's isolate coordinates from the .ab1 filename with a RegEx and fetch the user-defined ID from the corresponding .csv look-up table. To use your own, please consult --help and try out your RegEx here or there. From a bash shell, enclose the terms with double quotes ". The wellsplate number or ID is also parsed from the .csv filename with a RegEx.

Provide a RegEx to --regex_rev and AB12PHYLO will look out for reverse reads, and add only their reverse complement to the data set.

Log File

If you're having trouble, look at the log file! It will be in your results directory and named like ab12phylo[|-p1|-p2][-view|-viz]?.log. Alternatively, you can set the --verbose flag and get the same information in real-time to your commandline. Your choice!

Dependencies

Biopython, NumPy, pandas, Toytree, Toyplot, matplotlib, PyYAML, lxml, xmltramp2, svgutils, Pillow, Requests and Jinja2

External Tools

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ab12phylo-0.4.24b0.tar.gz (14.4 MB view hashes)

Uploaded Source

Built Distribution

ab12phylo-0.4.24b0-py3-none-any.whl (14.6 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page