orthoflow

A phylogenomic workflow

These details have not been verified by PyPI

Project links

Project description

Orthoflow

Orthoflow is a workflow for phylogenetic inference of genome-scale datasets of protein-coding genes. Our goal was to make it straightforward to work from a combination of input sources including annotated contigs in Genbank format and FASTA files containing CDSs. It uses several state of the art inference methods for orthology inference, either based on HMM profiles or de novo inference of orthogroups. Through the use of OrthoSNAP, many additional ortholog alignments can be generated from multi-copy gene families. For phylogenetic inference, users can choose a supermatrix approach and/or gene tree inference followed by supertree reconstruction. Users can specify a range of alignment filtering settings to retain high-quality alignments for phylogenetic inference. The workflow produces a detailed report that, in addition to the phylogenetic results, includes a range of diagnostics to verify the quality of the results.

docs/source/_static/images/orthoflow-workflow-diagram.svg

Documentation

Detailed documentation can be found at https://rbturnbull.github.io/orthoflow/

Quick start guide

Installation

You can install orthoflow with pip:

pip install orthoflow

More information about installation is available here: https://rbturnbull.github.io/orthoflow/main/installation.html

Input data

Orthoflow works from an input CSV file with information about the data sources to be used. Preparing this file is central to setting up your run. The default filename for this is input_sources.csv.

It needs the columns file, taxon_string, data_type and translation_table.

The file column is the path to the file relative to the working directory.
The taxon_string is the name of the taxon from which the data was obtained.
The data_type column should be GenBank when providing a GenBank-formatted file with CDS annotations, or CDS or Protein when providing a FASTA file with coding sequences consisting of nucleotides or amino acids respectively.
The translation_table column should have the translation table (genetic code) number for the data as given here.

Let’s look at the demonstration dataset distributed with the code: tests/test-data/input_sources.csv.

file	taxon_string	data_type	translation_table
KY509313.gb	Avrainvillea_mazei_HV02664	GenBank	11
NC_026795.txt	Bryopsis_plumosa_WEST4718	GenBank	11
KX808498.gb	Caulerpa_cliftonii_HV03798	GenBank	11
KY819064.cds.fasta	Chlorodesmis_fastigiata_HV03865	CDS	11
KX808497.fa	Derbesia_sp_WEST4838	CDS	11
MH591079.gb	Dichotomosiphon_tuberosus_HV03781	GenBank	11
MH591080.gbk	Dichotomosiphon_tuberosus_HV03781	GenBank	11
MH591081.gbk	Dichotomosiphon_tuberosus_HV03781	GenBank	11
MH591083.gb	Flabellia_petiolata_HV01202	GenBank	11
MH591084.gb	Flabellia_petiolata_HV01202	GenBank	11
MH591085.gb	Flabellia_petiolata_HV01202	GenBank	11
MH591086.gb	Flabellia_petiolata_HV01202	GenBank	11

We are using a dataset of algal chloroplast genomes, some as annotated genbank files (data_type: Genbank), some as fasta files of the coding sequences (data_type: CDS). They all use the bacterial genetic code (translation_table: 11). Some of the genomes were in a single Genbank file (e.g. KY09313.gb at the top), others were fragmented across multiple files (e.g. last 4 all belonging to the same taxon).

The taxon_string column is perhaps the most important one, as these will be the names to appear in the output tree and this determines how input data gets grouped (e.g. all CDSs in the final four GenBank files will be grouped into a single taxon). In this case, we have included specimen numbers as part of the taxon string but that is optional.

Simple run

We are using the small demonstration dataset distributed with the Orthoflow in the tests/test-data subdirectory.

Go into the directory containing the input_sources.csv file and run orthoflow with default settings with these commands:

cd tests/test-data
orthoflow

By default, Orthoflow will extract the CDSs from the input files, run OrthoFinder followed by OrthoSNAP to determine orthologous genes, align them and infer a concatenated tree from the protein sequences. You can follow progress on the screen as the workflow executes and outputs are produced.

Note that the first time you run the workflow, it will be slow because it needs to download and install the software it depends on. This is a one-time thing and runs should get going much faster after.

Examining the output

Inferred tree and intermediate files

All output files are saved in the results directory. Output files are subdivided into the workflow modules, which each have their own subdirectory. For the demonstration analysis that we ran above, the inferred phylogeny will be in the supermatrix subdirectory and be called supermatrix.protein.treefile. Open this with a tree browser (e.g. FigTree). Also take some time to browse the intermediary files, including the orthogroups, gene alignments and the supermatrix constructed from them.

Report and diagnostics

The report provides an overview of the results, the analysis settings used and citations of the software used to produce the results. This report is found in the results/report.cds.html and/or results/report.protein.html, depending on the method used to infer the phylogeny.

Output logs

The output logs of all software used as part of the workflow can be found in the logs directory.

Credits and Attribution

Orthoflow was created by Robert Turnbull, Jacob Steenwyk, Simon Mutch, Vinícius Salazar, Pelle Scholten, Joanne L. Birch and Heroen Verbruggen.

The preprint for Orthoflow is here:

Robert Turnbull, Jacob L. Steenwyk, Simon J. Mutch, Pelle Scholten, Vinícius W. Salazar, Joanne L. Birch, and Heroen Verbruggen. Orthoflow: phylogenomic analysis and diagnostics with one command, 04 December 2023, PREPRINT available at Research Square [https://doi.org/10.21203/rs.3.rs-3699210/]

More details to come.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.4

Mar 14, 2024

0.3.1

Jan 19, 2024

0.3.0

Dec 16, 2023

0.2.0

Oct 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orthoflow-0.3.4.tar.gz (187.6 kB view details)

Uploaded Mar 14, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

orthoflow-0.3.4-py3-none-any.whl (212.1 kB view details)

Uploaded Mar 14, 2024 Python 3

File details

Details for the file orthoflow-0.3.4.tar.gz.

File metadata

Download URL: orthoflow-0.3.4.tar.gz
Upload date: Mar 14, 2024
Size: 187.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/6.5.0-1016-azure

File hashes

Hashes for orthoflow-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`063bff7c0f5e4a62e637f472ce27b95a6ccd8709fcf381b8b735bfd7df3a55e6`
MD5	`c084a2ab196ff72a084083b4cddbaba3`
BLAKE2b-256	`7cbd17d1f13603081db76028e2499b21e7b1cb7c53e7826381c4eb7c2890c386`

See more details on using hashes here.

File details

Details for the file orthoflow-0.3.4-py3-none-any.whl.

File metadata

Download URL: orthoflow-0.3.4-py3-none-any.whl
Upload date: Mar 14, 2024
Size: 212.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/6.5.0-1016-azure

File hashes

Hashes for orthoflow-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a80596453e4680b0f9a04f60f964b80c3cf273b8dbb38751dc18666af9eb8b78`
MD5	`34d4dd43e2fddab77975bea2d0f3b054`
BLAKE2b-256	`f85fb40259c574b3c41a59fbe9a91fb9c404a6544bf299af28d382b3f03df76d`

See more details on using hashes here.

orthoflow 0.3.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Orthoflow

Documentation

Quick start guide

Installation

Input data

Simple run

Examining the output

Inferred tree and intermediate files

Report and diagnostics

Output logs

Credits and Attribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes