FastOMA - a package to infer orthology information among proteomes
Project description
FastOMA
FastOMA is a scalable software package to infer orthology relationship.
Want to learn more about FastOMA and try it online, check out FastOMA academy and FastOMA talk at ISMB 2023 on YouTube!
Input and Output:
Input:
-
Sets of protein sequences in FASTA format (with
.fa
extension) in the folderproteome
. The name of each fasta file is the name of species. Please make sure that the name of fasta records do not contain special characters including||
. -
Rooted Species tree in newick format. A rough species tree is enough, and it does not need to be binary (fully resolved). Besides, we do not need branch lengths. You could use the NCBI tree via ete3 package via
cat list_taxanomic_id.txt | ete3 ncbiquery --tree > species_tree.nwk
(see this). Note that the name of leaves of the tree (species name) should be the same as the file name of FASTAs (without.fa
extension) (item 1). And there shouldn't be any repeated names in leaves names and internal node names. The tree should not be with quotation. -
The omamer database which is available for download from the OMA browser. The FastOMA workflow will automatically download the omamer database for LUCA (7.7 GB) if the argument
--omamer_db
is not provided on the command line. The argument can be a local file (e.g. a previously downloaded omamer database file) or a URL to an alternative omamer database, e.g. a subset of the LUCA database which is smaller, like Primates with this link which is ~100MB. However, to have a broader reference gene families, we recommend to use the LUCA database if possible.
You can see an example in the testdata folder.
$ ls proteome
AQUAE.fa CHLTR.fa MYCGE.fa
$ cat species_tree.nwk
((AQUAE,CHLTR)inter1,MYCGE)inter2;
Besides, the internal node should not contain any special character (e.g. \
/
or space
).
The reason is that FastOMA write some files whose names contain the internal node's name.
If the species tree does not have label for some/all internal nodes, FastOMA labels them sequentially.
The updated tree will be stored in the output folder named as species_tree_checked.nwk
.
Main output:
Orthology information as HOG structure in OrthoXML format which can be used with PyHAM. Learn more about HOG here and here. The details of output are described below.
Additionally, FastOMA generates TSV files for rootlevel HOGs (deepest level) and marker genes groups (one gene per species maximum) together with dumps of fasta files (one per rootlevel HOG / marker gene).
How to run FastOMA
FastOMA is implemented as a nextflow-workflow. As such, FastOMA can be run without any installation steps given the system supports running either docker containers, singularity containers or has conda installed.
nextflow run dessimozlab/FastOMA -profile docker --input_folder /path/to/in_folder --output_folder /path/to/out_folder
You could also add specific version to be used by adding -r v0.3.1
to the command line.
Nextflow will automatically fetch the dessimozlab/FastOMA repository and starts
the FastOMA.nf
workflow. The -profile
argument must be used to specify the profile to use. We support docker
,
singularity
and conda
which then automatically set up the necessary tools by downloading the required containers or creating
a conda environment with the necessary dependencies.
See also How to install FastOMA for additional ways how to install and run FastOMA. Note also the section on the different profiles.
For more informaton on how Nextflow and Docker work together, see here.
More details on how to run
We provide for every commit of the repository a docker image for FastOMA on dockerhub. You can specify the container as
part of the nextflow command with the parameter container_version
. If you want to use the container of the current
git checkout version, you can specify this in the following way:
nextflow run FastOMA.nf -profile docker \
--container_version "sha-$(git rev-list --max-count=1 --abbrev-commit HEAD)" \
--input_folder testdata/in_folder \
--output_folder myresult/
How to install FastOMA
There are four ways to run/install FastOMA detailed below:
1. Running workflow directly
The FastOMA workflow can be run directly without any installation using nextflow's ability to fetch a workflow from github. A specific version can be selected by specifying the -r
option to nextflow to select a specific version of FastOMA:
nextflow run desimozlab/FastOMA -r v0.3.1 -profile conda
This will fetch version v0.3.1 from github and run the FastOMA workflow using the conda profile. See section How to run fastOMA.
2. Cloning the FastOMA repo and running from there
git clone https://github.com/DessimozLab/FastOMA.git
cd FastOMA
nextflow run FastOMA.nf -profile docker --container_version "sha-$(git rev-list --max-count=1 --abbrev-commit HEAD)" ...
3. Manual installation (for development) in python virtual environment
-
install mafft and FastTree and ensure the software is accessible on the PATH.
-
install python >= 3.9
-
create virtual environment, activate it and install FastOMA with additional extras inside it:
python3 -m venv .venv source .venv/bin/activate pip install FastOMA[report,nextflow]
You can also install FastOMA from a clone of the repository in editable mode with
pip install -e .[report,nextflow]
. -
run pipeline including with some testdata (For more details, see the section How to run FastOMA on the test data )
nextflow run FastOMA.nf -profile standard --input_folder testdata/in_folder --output_folder output -with-report
4. Manual installation in conda/mamba environment
In the FastOMA repository, we provide a conda environment file that can be used to generate a conda / mamba environment.
For Conda installation you need to first download the Miniconda installer from this link.
For MacOS:
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o Miniconda3.sh
bash Miniconda3.sh
For Linux
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3.sh
bashMiniconda3.sh
Then follow the instruction on the terminal. Finally, close and re-open the terminal and run
conda create -n fastoma python=3.9 --file environment-conda.yml
conda activate fastoma
Then, clone and install fastOMA using
git clone https://github.com/DessimozLab/FastOMA.git
Alternatively, you could use Mamba instead of Conda (which needs its own installation). Note that it is not encouraged to have both Mamba and conda on one system, more info.
mamba env create -n FastOMA -f environment-conda.yml
mamba activate FastOMA
Afterwards, you can run the workflow using nextflow (which is installed as part of the conda environment)
nextflow run FastOMA.nf -profile standard|slurm --input_folder /path/to/input_folder --output_folder /path/to/output
Note that you should use either the profile standard
or slurm
such the nextflow executor will use the activated environment.
Using different nextflow profiles
Nextflow provides support to run a workflow on different infrastructures. Selection of this is done using the -profile
argument.
For FastOMA, we've implemented the following profiles below. Additional ones can also be created by specifying them in the nextflow.config
file.
Docker
With -profile docker
one can use docker as an execution platform. It requires docker to be installed on the system (see here). The pipeline
will automatically fetch missing containers from dockerhub (e.g. dessimozlab/fastoma) if not found locally. By default, the version
latest
is used by the pipeline, however we provide images for any branch and release as well; even for every recent commit.
One can select the desired container via the --container_version
argument
nextflow run FastOMA.nf -profile docker \
--container_version "sha-$(git rev-list --max-count=1 --abbrev-commit HEAD)" \
--input_folder testdata/in_folder \
--output_folder myresult/
This will use the container that is tagged with the current commit id. Similarly, one could also use
--container_version "0.3.1"
to use the container with version dessimozlab/fastoma:0.3.1
from dockerhub. Check the latest version on the DockerHub.
Singularity
Since Docker needs administrator privileges (root access), Singluarity (a.k.a Apptainer) is a good alternative. This can be installed using Conda with conda install conda-forge::singularity
. However, in most of the academic HPC cluster, singluarity is already installed and can be called with module load
.
With -profile singularity
singularity containers will be used to run the workflow. It requires singularity to
be installed on your system. The containers are automatically pulled from dockerhub and converted to singularity
containers. The same options as for Docker will be available.
Conda
with -profile conda
, the FastOMA workflow will create a conda environment which contains the necessary
dependencies and use this environment to run the workflow steps. Note that this environment does not need
to be activated manually. If you prefer to install the dependencies inside a conda or mamba environment
yourself, this can be achieved as described in .
Slurm (with singularity/conda)
On a HPC system you typically run processes using a scheduler system such as slurm or LSF. We provide
profiles -profile slurm
, -profile slurm_singularity
and -profile slurm_conda
to run FastOMA with
the respective engine using slurm as a scheduler system.
If you need a different scheduler, it is quite straight forward to
set it up in nextflow.config
based on the existing profiles and the documentation of
nextflow executors.
How to run FastOMA on the test data
Note : If you are using FastOMA with Docker or other profiles, check out the difference here.
First, cd to the testdata
folder and download the omamer database (optional) and change its name to omamerdb.h5
.
cd FastOMA/testdata
wget https://omabrowser.org/All/Primates-v2.0.0.h5 # 105MB
mv Primates-v2.0.0.h5 in_folder/omamerdb.h5
(This is for the test however, I would suggest downloading the LUCA-v2.0.0.h5
instead of Primates-v2.0.0.h5
for your real analysis.).
Check the item 2 in the input section for details.
Now we have such a structure in our testdata folder.
$ tree ../testdata/in_folder
├── omamerdb.h5
├── proteome
│ ├── AQUAE.fa
│ ├── CHLTR.fa
│ └── MYCGE.fa
└── species_tree.nwk
Finally, run the package using nextflow as below:
# cd FastOMA/testdata
nextflow run ../FastOMA.nf \
--input_folder in_folder \
--omamer_db in_folder/omamerdb.h5 \
--output_folder out_folder \
--report \
-profile standard
Note that to have a comprehensive test, we set the default value of needed cpus as 10.
If you face .command.sh: line 2: papermill: command not found
, note that the orthology inference is finished and you have them in output folder and you may want to install pip install -e .[report]
to have papermill
generating the report and run the last step.
Expected log for test data
After few minutes, the run for test data finishes.
[] process > check_input () [100%] 1 of 1 ✔
[] process > omamer_run () [100%] 3 of 3 ✔
[] process > infer_roothogs () [100%] 1 of 1 ✔
[] process > batch_roothogs () [100%] 1 of 1 ✔
[] process > hog_big () [100%] 1 of 1 ✔
[] process > hog_rest () [100%] 2 of 2 ✔
[] process > collect_subhogs () [100%] 1 of 1 ✔
The first step is to run OMAmer for finding the putative gene families (putative rootHOG) based on kmer similarity.
Next, we write them in FASTA files, which could be used to run next steps in parrallel on each FASTA gene family.
Then, to have similar size jobs, we batch these FASTA files either as one big roothog (per job hog_big
) or a few hundreds together as one job hog_rest
.
These are decided based on the FASTA file size. Finally, once all jobs of hog_big
and hog_rest
are done, we collect_subhog
and save all outputs.
If the run interrupted, by adding -resume
to the nextflow commond line, you might be able to continue your previous nextflow job.
Pro-tip. Nextflow creat a folder named work
for storing its temprorary files. The characters in the bracket of the nextflow log (not shown here) are the short form of the folder address in work/
where the last task of such job were done.
e.g [3f/2efg] process > check_input (1)
you can cd work/3f/2efg
then use tab to complete the folder name, then you can see the temporary files of check_input
task. In such folder there are some hidden files .command.log/sh/run
.f
Expected output structure for test data
The output of FastOMA includes several output files regarding orthology inference
(OrthologousGroups.tsv
, RootHOGs.tsv
, FastOMA_HOGs.orthoxml
, orthologs.tsv.gz
and species_tree_checked.nwk
),
a jupyter notebook based report about the dataset (report.ipynb
and report.html
) and four folders
(hogmap
, OrthologousGroupsFasta
, RootHOGsFasta
and stats
).
The hogmap
folder includes the output of OMAmer; each file corresponds to an input proteome.
The folder OrthologousGroupsFasta
includes FASTA files, and all proteins inside each FASTA file are orthologous to each other.
These could be used as gene markers for species tree inference with refined resolution, more info.
Note that OrthologousGroups are groups of strict orthologs, with at most 1 representative per species.
Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level. The file
FastOMA_HOGs.orthoxml
contains all the nested groups in orthoxml format. The RootHOGs.tsv
and RootHOGsFasta/
files contains
the groups at the deepest level.
So, following files and folders should appear in the folder out_folder
which was the argument.
$tree out_folder
├── FastOMA_HOGs.orthoxml
├── hogmap
│ ├── AQUAE.fa.hogmap
│ ├── CHLTR.fa.hogmap
│ └── MYCGE.fa.hogmap
├── OrthologousGroupsFasta
│ ├── OG_0000001.fa.gz
│ ├── OG_0000002.fa.gz
│ ├── OG_0000003.fa.gz
│ ├ ...
├── OrthologousGroups.tsv
├── orthologs.tsv.gz
├── phylostratigraphy.html
├── report.html
├── report.ipynb
├── RootHOGsFasta
│ ├── HOG:0000001.fa.gz
│ ├── HOG:0000002.fa.gz
│ ├── HOG:0000003.fa.gz
│ ├ ...
├── RootHOGs.tsv
├── species_tree_checked.nwk
└── stats
├── pipeline_dag_<date>.html
├── report_<date>.html
├── timeline_<date>.html
└── trace_<date>.txt
among which FastOMA_HOGs.orthoxml
is the final output in orthoXML format. Its content looks like this
<?xml version='1.0' encoding='utf-8'?>
<orthoXML xmlns="http://orthoXML.org/2011/" origin="FastOMA 0.1.6" originVersion="2024-01-10 17:36:45" version="0.5">
<species name="MYCGE" taxonId="5" NCBITaxId="0">
<database name="database" version="2023">
<genes>
<gene id="1000000001" protId="sp|P47500|RF1_MYCGE" />
<gene id="1000000002" protId="sp|P13927|EFTU_MYCGE" />
<gene id="1000000003" protId="sp|P47639|ATPB_MYCGE" />
...
<orthologGroup id="HOG:0000001_1" taxonId="1">
<score id="CompletenessScore" value="1.0" />
<property name="OMAmerRootHOG" value="HOG:D0900115" />
<property name="TaxRange" value="inter2" />
<geneRef id="1000000005" />
<orthologGroup id="HOG:0000001_2" taxonId="2">
<score id="CompletenessScore" value="1.0" />
<property name="TaxRange" value="inter1" />
<geneRef id="1002000010" />
<geneRef id="1001000009" />
</orthologGroup>
</orthologGroup>
</groups>
</orthoXML>
If you are interested in specific gene in specific species, and wants to know
proteins that are in the gene family, you can find its protein ID in the file RootHOGs.tsv
using grep.
The first column of this file RootHOGs.tsv
shows the rootHOG ID which could be searched on the OMA browser.
Note that some of the input genes might not appear in this file.
To find list of genes that are orthologous to your gene of interest, you can search in the file OrthologousGroups.tsv
where each line is an orthologous group. Each line corresponds to a FASTA file in the folder OrthologousGroupsFasta
.
Note that some of the output files are symlink (a.k.a a symbolic link), linked to files in the folder work
created by nextflow pipeline.
This means that if you remove or rename the work
and its parents folder, you will not have access to the output files anymore.
If you are working on a large scale project, you may need to change the limitation on the number of files opened in linux using ulimit -n 271072
.
You can learn about OMA and FastOMA on OMA Academy.
Regarding temp folders:
The folder temp_output
includes gene_id_dic_xml.pickle
storing mapping between gene name and gene integer ID used for orthoxml format,
temp_omamer_rhogs
a folder that includes the fasta files of omamer-based gene families (described here).
The folder temp_pickles
includes the pickle file of orthoxml object which are final product of FastOMA for each gene family stored in temp_omamer_rhogs
.
These file can be empty when the gene family doesn't end up as a group (usually with size of 5 Byte). Gene trees and MSAs will be stored in temp_pickles
if activated (in _config.py
and fastOMA installed with pip -e
).
Using omamer's output
The first step of the FastOMA pipele is to run OMAmer. If you already have the hogmap files, you can put them in the in_folder/hogmap_in
.
Then your structure of files will be
$ tree ../testdata/
├── in_folder
│ ├── hogmap_in
│ │ ├── CHLTR.fa.hogmap
│ │ ├── MYCGE.fa.hogmap
│ ├── omamerdb.h5
│ ├── proteome
│ │ ├── AQUAE.fa
│ │ ├── CHLTR.fa
│ │ └── MYCGE.fa
│ └── species_tree.nwk
└── README.md
In this case, FastOMA uses two hogmap files and only run omamer for the AQUAE.fa
. Then continue the rest of pipeline.
Let's save the planet together with
green computational Biology.
Run on a cluster
For running on a SLURM cluster you can add -c ../nextflow_slurm.config
to the commond line.
# cd FastOMA/testdata
# rm -r out_folder work # You may remove stuff from previous run
# ls ../FastOMA.nf
nextflow ../FastOMA.nf -c ../nextflow_slurm.config --input_folder in_folder --output_folder out_folder
You may need to re-run nextflow command line by adding -resume
, if the allocated time is not enough for your dataset.
You may need to increase the number of opoened files in your system with ulimit -n 131072
or higher as nextflow generates hundreds of files depending on the size of your input dataset.
Handle splice files
You can put the splice files in the folder in_folder/splice
. They should be named as species_name.splice
for each species.
For each row of different isforoms of a preotien, FastOMA selects the best one (based on omamer family score and isoform length).
We also use those proteins that are not in splice file but present in the FASTA proteome file.
$ head HUMAN.splice
HUMAN00001;HUMAN00002;HUMAN00003;HUMAN00004;HUMAN00005;HUMAN00006
HUMAN00007;HUMAN00008;HUMAN00009;HUMAN00010;HUMAN00011;HUMAN00012;
HUMAN00022;HUMAN00023;HUMAN00024;
HUMAN00027;HUMAN00028;HUMAN00029;HUMAN00030;HUMAN00031;HUMAN00032;HUMAN00033
HUMAN00034;HUMAN00035
HUMAN00036
HUMAN00037
The selected isforoms will be added as a new column to the input splice files stored as tsv at out_folder/temp_output/selected_isoforms/
Under the hood: what are fastOMA gene families?
Firstly, those proteins that are mapped to the same OMAdb rootHOG (e.g. HOG:D0066142 for HOG:D0066142.1a.1a) by OMAmer are
grouped together to create query rootHOGs (no protein from OMAdb is stored), from now on called rootHOG.
Then, as OMAmer provide us with alternative mapping, we try to merge those rootHOGs (high chance of split HOGs) that have
many shared mappings. The query proteins of these rootHOGs will be stored in only one rootHOG.
These will be saved as fasta files in out_folder/temp_output/temp_omamer_rhogs
with file names format HOG_LXXXXX.fa
. L
is the release ID of OMADB.
Replacing _
with ':' gives the HOG ID which could be investigated in the OMA Browser.
There are some cases that only one protein is mapped to one rootHOG, called singleton (which is not good, we are hoping for orthologous groups/pairs). Using alternative OMAmer mapping, FastOMA tries to put these to other rootHOGs. Still some will be left.
FastOMA uses the linclust software to find new gene families on set of unmapped proteins and singletons.
These will be saved as fasta files in out_folder/temp_output/temp_omamer_rhogs
with file names format HOG_clustXXXXX.fa
.
These are initial gene families that are used in infer_subhogs
step, which could be split into a few smaller gene families.
Cite us
Majidian, Sina, Yannis Nevers, Ali Yazdizadeh Kharrazi, Alex Warwick Vesztrocy, Stefano Pascarelli, David Moi, Natasha Glover, Adrian M. Altenhoff, and Christophe Dessimoz. "Orthology inference at scale with FastOMA." bioRxiv (2024): 2024-01. https://www.biorxiv.org/content/10.1101/2024.01.29.577392v1.full
Change log
- Update v0.3.1: spliting HOG and sampling
- Update v0.1.6: adding dynamic resources, additional and improved output
- Update v0.1.5: docker, add help, clean nextflow
- Update v0.1.4: new gene families with linclust if mmseqs is installed, using quoted protein name to handle species chars, check input first
- Update v0.1.3: merge rootHOGs and handle singleton using omamer multi-hits
- Update v0.1.2: improve rootHOG inference, splice, OMAmerv2 with multi-hits
- Release v0.1.0: improve nextflow pipeline and outputs.
- prelease v.0.0.6: use
--fragment-detection
forinfer-subhogs
and--low-so-detection --fragment-detection
- prelease v.0.0.6: using input hogmpa
- prelease v.0.0.5: adding pip setup.py
- prelease v.0.0.4: simple nextflow
- prelease v.0.0.3: with dask
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fastoma-0.3.3.tar.gz
.
File metadata
- Download URL: fastoma-0.3.3.tar.gz
- Upload date:
- Size: 122.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 350cca5dfca89249a1670b6a1234d2fbe296c981109fb7d4e85a71d8e1ca4ea1 |
|
MD5 | ddb717219c0ab5e08032e51dd42bb079 |
|
BLAKE2b-256 | 8a2b6e946f4273224866cd4cb934b6d97cc466a2c1cd2a7cafadfb21134f0b9d |
File details
Details for the file fastoma-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: fastoma-0.3.3-py3-none-any.whl
- Upload date:
- Size: 157.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f148ff002bf526754404a7f70896c7a4f31430ab51be246ea79db58a2d4cfc33 |
|
MD5 | 433bb87e035d75b3386f420730842328 |
|
BLAKE2b-256 | 33cd391f1cd67370f80895c078d0ca0ea2ca525a4294a910fbf4e990c235fed6 |