EsMeCaTa: Estimating Metabolic Capabilties from Taxonomy
Project description
EsMeCaTa: Estimating Metabolic Capabilties from Taxonomic affiliations
EsMeCaTa is a method to estimate consensus proteomes and metabolic capabilities from taxonomic affiliations (for example with 16S rRNA sequencing or using a specific taxon name) by using UniProt Proteomes database, NCBI Taxonomy database, MMseqs2 and eggNOG-mapper. This can be used to (1) estimate protein sequences and functions for an organism with no sequenced genomes, (2) explore the protein diversity of a taxon and (3) identify metabolic functions in the taxa. It is also used to search for key enzymes of biogeochemical cycles, for more information look at Tabigecy GitHub page.
Table of contents
- EsMeCaTa: Estimating Metabolic Capabilties from Taxonomic affiliations
- Table of contents
- Requirements
- Installation
- Input
- EsMeCaTa commands
- Usage
- EsMeCaTa functions
esmecata check: Estimate knowledge associated with taxonomic affiliationesmecata proteomes: Retrieve proteomes associated with taxonomic affiliationesmecata clustering: Proteins clusteringesmecata annotation: Retrieve protein annotations with eggNOG-mapperesmecata annotation_uniprot: Retrieve protein annotations with UniProtesmecata workflow: Consecutive runs of the three steps by using eggNOG-mapper for the annotationesmecata workflow_uniprot: Consecutive runs of the three steps
- EsMeCaTa outputs
- Post-processing analysis
- Tabigecy and bigecyhmm
- EsMeCaTa create_db
- Troubleshooting
- Citation
- License
Requirements
Requirements for EsMeCaTa vary depending on the command used.
EsMeCaTa is developed in Python, it is tested with Python 3.11. It needs the following python packages:
- biopython: To create fasta files and used by the option
--annotation-filesto index UniProt flat files. - pandas: To read the input files.
- requests: For the REST queries on UniProt.
- ete4: To analyse the taxonomic affiliation and extract taxon_id, also used to deal with taxon associated with more than 100 proteomes.
- scipy: to compute openness with a Heap's Law.
- SPARQLwrapper: Optionally, you can use SPARQL queries instead of REST queries. This can be done either with the UniProt SPARQL Endpoint (with the option
--sparql uniprot) or with a UniProt SPARQL Endpoint that you created locally (it is supposed to work but not tested, only SPARQL queries on the UniProt SPARQL endpoint have been tested). Warning: using SPARQL queries will lead to minor differences in functional annotations and metabolic reactions due to how the results are retrieved with REST query or SPARQL query.
Also esmecata requires MMseqs2 for protein clustering with esmecata workflow or esmecata clustering:
- MMseqs2: To cluster proteins. Test have been made on version MMseqs2 Release 13-45111., especially with the version of the commit f349118312919c4fcc448f4595ca3b3a387018e2. But EsMeCaTa should be compatible with more recent version.
And eggNOG-mapper for the annotation of the protein with esmecata workflow or esmecata annotation:
- eggNOG-mapper: to annotate protein clusters. It needs a path to the eggnog database. Also according to the option used, it needs around 60 Gb of RAM. So it is recommended to use it in a cluster.
If you use the option --bioservices, EsMeCaTa will also require this package:
- bioservices: To query UniProt instead of using the query functions of EsMeCaTa (potentially more robust overtime).
Installation
With the precomputed database
To query the precomputed database, it is only required to install EsMeCaTa with pip:
pip install esmecata
All the required dependencies for the estimation from the precomputed database are performed with python packages.
The second requirement is EsMeCaTa precomputed database (file size: 4G) available at the following Zenodo archive.
As this file is quite big and if you want just to test esmecata precomputed, you can try:
- the precomputed database (
buchnera_database.zip) present in the test_data folder. You can use it on thebuchnera_workflow.tsvinput file present in the same test folder. - one of the precomputed database associated with the article and present in this other Zenodo archive. The associated input files are in this folder.
Core pipeline installation
For the whole workflow, the easiest way to install the dependencies of EsMeCaTa is by using conda (or mamba):
conda install mmseqs2 pandas sparqlwrapper requests biopython eggnog-mapper -c conda-forge -c bioconda
EsMeCaTa is available on PyPI and can be installed with pip:
pip install esmecata
It can also be installed using esmecata github directory:
git clone https://github.com/AuReMe/esmecata.git
cd esmecata
pip install -e .
To use eggNOG-mapper, you have to setup it and install its database, refer to the setup part of the doc.
Optional dependencies
esmecata_report requires:
- arakawa: fork of datapane to create the HTML report.
- plotly: to create most of the figures.
- ontosunburst: to create sunburst figures of Enzyme Commission numbers.
- kaleido: required to render the figure with plotly. Kaleido requires to install a compatible Chrome version with:
kaleido_get_chrome
This can be installed with pip:
pip install arakawa plotly kaleido ontosunburst
esmecata_gseapy requires:
- pronto: to get Gene Ontology names.
- gseapy: to perform enrichment analysis.
- orsum: to visualize the results of enrichment analysis.
These dependencies can be installed with conda:
conda install pronto orsum gseapy plotly -c conda-forge -c bioconda
Input
EsMeCaTa takes as input a tabulated or an excel file with two columns one with the ID corresponding to the taxonomic affiliation (for example the OTU ID from 16S rRNA sequencing) and a second column with the taxonomic classification separated by ';'. In the following documentation, the first column (named observation_name) will be used to identify the label associated with each taxonomic affiliation. Several examples are available (buchnera_workflow.tsv, toy_example.tsv, methanogenic_reactor.tsv or honeybee_esmecata_metdata.tsv).
For example:
| observation_name | taxonomic_affiliation |
|---|---|
| Cluster_1 | Bacteria;Spirochaetes;Spirochaetia;Spirochaetales;Spirochaetaceae;Sphaerochaeta;unknown species |
| Cluster_2 | Bacteria;Chloroflexi;Anaerolineae;Anaerolineales;Anaerolineaceae;ADurb.Bin120;unknown species |
| Cluster_3 | Bacteria;Cloacimonetes;Cloacimonadia;Cloacimonadales;Cloacimonadaceae;Candidatus Cloacimonas;unknown species |
| Cluster_4 | Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Rikenellaceae;Rikenellaceae RC9 gut group;unknown species |
| Cluster_5 | Bacteria;Cloacimonetes;Cloacimonadia;Cloacimonadales;Cloacimonadaceae;W5;unknown species |
| Cluster_6 | Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Dysgonomonadaceae;unknown genus;unknown species |
| Cluster_7 | Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;unknown species |
It is possible to use EsMeCaTa with a taxonomic affiliation containing only one taxon:
| observation_name | taxonomic_affiliation |
|---|---|
| Cluster_1 | Sphaerochaeta |
| Cluster_2 | Yersinia |
But this can cause issue. For example, "Cluster_2" is associated with Yersinia but two genus are associated with this name (one mantid (taxId: 444888) and one bacteria (taxId: 629)). EsMeCaTa will not able to differentiate them. But if you give more informations by adding more taxa (for example: 'Bacteria;Gammaproteobacteria;Yersinia'), EsMeCaTa will compare all the taxa of the taxonomic affiliation (here: 2 (Bacteria) and 1236 (Gammaproteobacteria)) to the lineage associated with the two taxIDs (for bacteria Yersinia: [1, 131567, 2, 1224, 1236, 91347, 1903411, 629] and for the mantid one: [1, 131567, 2759, 33154, 33208, 6072, 33213, 33317, 1206794, 88770, 6656, 197563, 197562, 6960, 50557, 85512, 7496, 33340, 33341, 6970, 7504, 7505, 267071, 444888]). In this example, there is 2 matches for the bacterial one (2 and 1236) and 0 for the mantid one. So EsMeCaTa will select the taxId associated with the bacteria (629).
It is also possible to give NCBI taxon ID as input with a column named ncbi_taxid:
| observation_name | ncbi_taxid |
|---|---|
| Cluster_1 | 399320 |
| Cluster_2 | 229919 |
| Cluster_3 | 456826 |
| Cluster_4 | 171550 |
| Cluster_5 | 456826 |
| Cluster_6 | 2005520 |
| Cluster_7 | 1425363 |
A jupyter notebook explains how EsMeCata works.
EsMeCaTa commands
Several command line are created after the installation:
esmecata: the main command to perform esmecata workflow from input file or with a precomputed database.esmecata_report: another command to create HTML report showing different statistics on the predictions.esmecata_gseapy: to perform enrichment analysis using gseapy and orsum to identify functions specific to some taxa compare to the all community.esmecata_create_db: to create precomputed databases from an esmecata run or merge different precomputed databases.
Here is the help for the main esmecata command:
usage: esmecata [-h] [--version] {check,proteomes,clustering,annotation_uniprot,annotation,workflow_uniprot,workflow,precomputed} ...
From taxonomic affiliation to metabolism using Uniprot. For specific help on each subcommand use: esmecata {cmd} --help
options:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
valid subcommands:
{check,proteomes,clustering,annotation_uniprot,annotation,workflow_uniprot,workflow,precomputed}
check Check proteomes associated with taxon in Uniprot Proteomes database.
proteomes Download proteomes associated with taxon from Uniprot Proteomes.
clustering Cluster the proteins of the different proteomes of a taxon into a single set of representative shared proteins.
annotation_uniprot Retrieve protein annotations from Uniprot.
annotation Annotate protein clusters using eggnog-mapper.
workflow_uniprot Run all esmecata steps (proteomes, clustering and annotation).
workflow Run all esmecata steps (proteomes, clustering and annotation with eggnog-mapper).
precomputed Use precomputed database to create estimated data for the run.
Steps proteomes and annotation by UniProt requires an internet connection (for REST and SPARQL queries, except if you have a local Uniprot SPARQL endpoint). Step clustering requires MMseqs2. Annotation can be performed with UniProt or eggNOG-mapper (which is then a requirement if the option is selected). Precomputed requires the esmecata_database.zip file.
Usage
Use the precomputed database
The precomputed database of EsMeCaTa is available at this Zenodo repository. Warning, this precomputed database size is 4 Gb.
Several precomputed databases (of smaller size) associated with the article datasets are available in the Zenodo archive of EsMeCaTa's article. And there is also a little precomputed dabatase available for test purpose (on one organism buchnera_database.zip) in the test folder (test_data folder).
Using the precomputed database, esmecata searches for input taxon inside the precomputed database to make prediction. It requires an input file containing the taxonomic affiliations and a precomputed esmecata database. For each observation name in the input file, it will returned the associated annotations. It will also output the protein sequences for each taxa associated with the observation name.
usage: esmecata precomputed [-h] -i INPUT_FILE -d INPUT_FILE [INPUT_FILE ...] -o OUPUT_DIR [-r RANK_LIMIT] [--update-affiliations] [-t THRESHOLD_CLUSTERING]
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input taxon file (excel, tsv or csv) containing a column associating ID to a taxonomic affiliation (separated by ;).
-d INPUT_FILE [INPUT_FILE ...], --database INPUT_FILE [INPUT_FILE ...]
EsMeCaTa precomputed database file path. Multiple precomputed databases can be given, separated by a " ", for example -d "esmecata_db1.zip esmecata_db2.zip".
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
-r RANK_LIMIT, --rank-limit RANK_LIMIT
This option limits the rank used when searching for proteomes. All the ranks superior to the given rank will be ignored. For example, if 'family' is given, only taxon ranks inferior or equal to family will be kept. Look at the readme for more
information (and a list of rank names).
--update-affiliations
If the taxonomic affiliations were assigned from an outdated taxonomic database, this can lead to taxon not be found in ete4 database. This option tries to update the taxonomic affiliations using the lowest taxon name.
-t THRESHOLD_CLUSTERING, --threshold THRESHOLD_CLUSTERING
Proportion [0 to 1] of proteomes required to occur in a proteins cluster for that cluster to be kept in core proteome assembly. Default is 0.5.
Two options can be used to limit the rank used when searching for proteomes and to update the taxonomic affiliations from the input file.
Example of use:
esmecata precomputed -i input_taxonomic_affiliations.tsv -d esmecata_database.zip -o output_folder
It is also possible to give multiple precomputed databases, by separating them with a space:
esmecata precomputed -i input_taxonomic_affiliations.tsv -d "esmecata_database_1.zip esmecata_database_2.zip" -o output_folder
The order is important, because esmecata will use the first database and then the following one. Especially, it will not use the next database to search for taxa found in the previous ones.
For example, esmecata begins its search in esmecata_database_1.zip and find taxon Escherichia, all organisms associated with this taxon will not be searched again when using esmecata_database_2.zip.
Classical run of EsMeCaTa
Otherwise, it is possible to run the whole workflow of EsMeCaTa but it will take times as it will search and download proteomes from UniProt, cluster protein sequences with MMseqs2 and then annotate them with eggNOG-mapper.
These different steps are presented in the following section.
EsMeCaTa functions
EsMeCaTa is in three steps:
proteomes: search for proteomes associated with the taxonomic affiliations on UniProt (also done withchecksubcommand) and download them.clustering: clusters the protein of the proteomes and filter them according to a threshold -option (-t).annotation: annotate the protein cluster.
The annotation step can be performed with two methods, either by retrieving annotation from UniProt or by using eggNOG-mapper.
The eggNOG-mapper approach has been found to be the more accurate in association with a clustering threshold of 0.5 (-t 0.5).
These options are the default options of EsMeCaTa.
As these steps can required time, a precomputed database has been created containing taxa of species, genus, family, order, class and phylum having at least 5 proteomes in UniProt.
This precomputed database can be used with the command esmecata precomputed to search the taxonomic affiliations from the input file into the database.
esmecata check: Estimate knowledge associated with taxonomic affiliation
usage: esmecata check [-h] -i INPUT_FILE -o OUPUT_DIR [-b BUSCO] [--ignore-taxadb-update] [--all-proteomes] [-s SPARQL] [-l LIMIT_MAXIMAL_NUMBER_PROTEOMES] [-r RANK_LIMIT] [--minimal-nb-proteomes MINIMAL_NUMBER_PROTEOMES] [--update-affiliations] [--bioservices]
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input taxon file (excel, tsv or csv) containing a column associating ID to a taxonomic affiliation (separated by ;).
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
-b BUSCO, --busco BUSCO
BUSCO percentage between 0 and 1. This will remove all the proteomes without BUSCO score and the score before the selected ratio of completion.
--ignore-taxadb-update
If you have a not up-to-date version of the NCBI taxonomy database with ete4, use this option to bypass the warning message and use the old version.
--all-proteomes Download all proteomes associated with a taxon even if they are no reference proteomes.
-s SPARQL, --sparql SPARQL
Use sparql endpoint instead of REST queries on Uniprot.
-l LIMIT_MAXIMAL_NUMBER_PROTEOMES, --limit-proteomes LIMIT_MAXIMAL_NUMBER_PROTEOMES
Choose the maximal number of proteomes after which the tool will select a subset of proteomes instead of using all the available proteomes (default is 99).
-r RANK_LIMIT, --rank-limit RANK_LIMIT
This option limits the rank used when searching for proteomes. All the ranks superior to the given rank will be ignored. For example, if 'family' is given, only taxon ranks inferior or equal to family will be kept. Look at the readme for more
information (and a list of rank names).
--minimal-nb-proteomes MINIMAL_NUMBER_PROTEOMES
Choose the minimal number of proteomes to be selected by EsMeCaTa. If a taxon has less proteomes, it will be ignored and a higher taxonomic rank will be used. Default is 5.
--update-affiliations
If the taxonomic affiliations were assigned from an outdated taxonomic database, this can lead to taxon not be found in ete4 database. This option tries to update the taxonomic affiliations using the lowest taxon name.
--bioservices Use bioservices instead of esmecata functions for protein annotation.
For each taxon in each taxonomic affiliations EsMeCaTa will use ete4 to find the corresponding taxon ID. Then it will search for proteomes associated with these taxon ID in the Uniprot Proteomes database.
If there is more than 100 proteomes, esmecata applies a subsampling procedure:
-
(1) use the taxon ID associated with each proteomes to create a taxonomic tree with ete4.
-
(2) from the root of the tree (the input taxon), esmecata will find the direct descendants (sub-taxa).
-
(3) then esmecata will compute the number of proteomes associated with each sub-taxon.
-
(4) the corresponding proportions will be used to select randomly a number of proteomes corresponding to the proportion.
For example: for the taxon Clostridiales, 645 proteomes are found. Using the organism taxon ID associated with the 645 proteomes we found that there is 17 direct sub-taxons. Then for each sub-taxon we compute the percentage of proportion of proteomes given by the sub-taxon to the taxon Clostridiales.
There is 198 proteomes associated with the sub-taxon Clostridiaceae, the percentage will be computed as follow: 198 / 645 = 30% (if a percentage is superior to 1 it will be round down and if the percentage is lower than 1 it will be round up to keep all the low proportion sub-taxa). We will use this 30% to select randomly 30 proteomes amongst the 198 proteomes of Clostridiaceae. This is done for all the other sub-taxa, so we get a number of proteomes around 100 (here it will be 102). Due to the different rounds (up or down) the total number of proteomes will not be equal to exactly 100 but it will be around it. The number of proteomes leading to this behaviour is set to 99 by default but the user can modify it with the -l/--limit-proteomes option.
esmecata check options:
-s/--sparql: use SPARQL instead of REST requests
It is possible to avoid using REST queries for esmecata and instead use SPARQL queries. This option need a link to a SPARQL endpoint containing UniProt. If you want to use the SPARQL endpoint of UniProt, you can use the argument: -s uniprot.
-b/--busco: filter proteomes using BUSCO score (default is 0.8)
It is possible to filter proteomes according to to their BUSCO score (from UniProt documentation: The Benchmarking Universal Single-Copy Ortholog (BUSCO) assessment tool is used, for eukaryotic and bacterial proteomes, to provide quantitative measures of UniProt proteome data completeness in terms of expected gene content.). It is a percentage between 0 and 1 showing the quality of the proteomes that esmecata will download. By default esmecata uses a BUSCO score of 0.80, it will only download proteomes with a BUSCO score of at least 80%.
--ignore-taxadb-update: ignore need to update ete4 taxaDB
If you have an old version of the ete4 NCBI taxonomy database, you can use this option to use esmecata with it.
--all-proteomes: download all proteomes (reference and non-reference)
By default, esmecata will try to download the reference proteomes associated with a taxon. But if you want to download all the proteomes associated with a taxon (either if they are non reference proteome) you can use this option. Without this option non-reference proteomes can also be used if no reference proteomes are found.
-l/--limit-proteomes: choose the number of proteomes that will lead to the used of the selection of a subset of proteomes
To avoid working on too many proteomes, esmecata works on subset of proteomes when there is too many proteomes (by default this limit is set on 99 proteomes). Using this option the user can modify the limit.
--minimal-nb-proteomes: choose the minimal number of proteomes that taxon must have to be selected by esmecata (default 1).
To avoid working on too little proteomes, it is possible to give an int to this option.
With this int, esmecata will select only taxon associated to at least this number of proteomes.
For example if you use --minimal-nb-proteomes 10, and the lowest taxon in the taxonomic affiliation is associated with 3 proteomes, it will be ignored and a taxon with a higher taxonomic rank will be used.
-r/--rank-limit: This option limits the rank used when searching for proteomes. All the ranks superior to the given rank will be ignored. For example, if 'family' is given, only taxon ranks inferior or equal to family will be kept.
To avoid working on rank with too much proteomes (which can have an heavy impact on the number of proteomes downloaded and then on the clustering) it is possible to select a limit on the taxonomic rank used by the tool.
The selected rank will be used to find the ranks to keep. For example, if the rank 'phylum' is given, this rank and all the rank below (from phylum to isolate) will be kept. And the ranks from superphylum to superkingdom will be ignored when searching for proteomes. The following ranks can be given to this option (from Supplementary Table S3 of PMC7408187):
| Level | Rank |
|---|---|
| 1 | superkingdom (renamed into domain in recent version of NCBI Taxonomy database) |
| 2 | kingdom |
| 3 | subkingdom |
| 4 | superphylum |
| 5 | phylum |
| 6 | subphylum |
| 7 | infraphylum |
| 8 | superclass |
| 9 | class |
| 10 | subclass |
| 11 | infraclass |
| 12 | cohort |
| 13 | subcohort |
| 14 | superorder |
| 15 | order |
| 16 | suborder |
| 17 | infraorder |
| 18 | parvorder |
| 19 | superfamily |
| 20 | family |
| 21 | subfamily |
| 22 | tribe |
| 23 | subtribe |
| 24 | genus |
| 25 | subgenus |
| 26 | section |
| 27 | subsection |
| 28 | series |
| 29 | subseries |
| 30 | species group |
| 31 | species subgroup |
| 32 | species |
| 33 | forma specialis |
| 34 | subspecies |
| 35 | varietas |
| 36 | subvariety |
| 37 | forma |
| 38 | serogroup |
| 39 | serotype |
| 40 | strain |
| 41 | isolate |
Some ranks (which are not non-hierarchical) are not used for the moment when using this method (so some taxa can be removed whereas they are below a kept rank):
| Level | Rank | Note |
|---|---|---|
| clade | newly introduced, can appear anywhere in the lineage w/o breaking the order | |
| environmental samples | no order below this rank is required | |
| incertae sedis | can appear anywhere in the lineage w/o breaking the order, implies taxa with uncertain placements | |
| unclassified | no order below this rank is required, includes undefined or unspecified names | |
| no rank | applied to nodes not categorized here yet, additional rank and groups names will be released |
--bioservices: instead of using REST queries implemented in EsMeCaTa, relies on bioservices API to query UniProt. This requires the bioservices package.
esmecata proteomes: Retrieve proteomes associated with taxonomic affiliation
usage: esmecata proteomes [-h] -i INPUT_FILE -o OUPUT_DIR [-b BUSCO] [--ignore-taxadb-update] [--all-proteomes] [-s SPARQL] [-l LIMIT_MAXIMAL_NUMBER_PROTEOMES] [-r RANK_LIMIT] [--minimal-nb-proteomes MINIMAL_NUMBER_PROTEOMES] [--update-affiliations] [--bioservices]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input taxon file (excel, tsv or csv) containing a column associating ID to a taxonomic affiliation (separated by ;).
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
-b BUSCO, --busco BUSCO
BUSCO percentage between 0 and 1. This will remove all the proteomes without BUSCO score and the score before the selected ratio of completion.
--ignore-taxadb-update
If you have a not up-to-date version of the NCBI taxonomy database with ete4, use this option to bypass the warning message and use the old version.
--all-proteomes Download all proteomes associated with a taxon even if they are no reference proteomes.
-s SPARQL, --sparql SPARQL
Use sparql endpoint instead of REST queries on Uniprot.
-l LIMIT_MAXIMAL_NUMBER_PROTEOMES, --limit-proteomes LIMIT_MAXIMAL_NUMBER_PROTEOMES
Choose the maximal number of proteomes after which the tool will select a subset of proteomes instead of using all the available proteomes (default is 99).
-r RANK_LIMIT, --rank-limit RANK_LIMIT
This option limits the rank used when searching for proteomes. All the ranks superior to the given rank will be ignored. For example, if 'family' is given, only taxon ranks inferior or equal to family will be
kept. Look at the readme for more information (and a list of rank names).
--minimal-nb-proteomes MINIMAL_NUMBER_PROTEOMES
Choose the minimal number of proteomes to be selected by EsMeCaTa. If a taxon has less proteomes, it will be ignored and a higher taxonomic rank will be used. Default is 1.
--update-affiliations
If the taxonomic affiliations were assigned from an outdated taxonomic database, this can lead to taxon not be found in ete4 database. This option tries to update the taxonomic affiliations using the lowest taxon name.
--bioservices Use bioservices instead of esmecata functions for protein annotation.
EsMeCaTa proteomes performs the same action than esmecata check and after this step, it downloads the proteomes. For protein with isoforms, the canonical sequence is retrieved except when the isoforms are separated in different Uniprot entries.
esmecata clustering: Proteins clustering
usage: esmecata clustering [-h] -i INPUT_DIR -o OUPUT_DIR [-c CPU] [-t THRESHOLD_CLUSTERING] [-m MMSEQS_OPTIONS] [--linclust] [--remove-tmp]
optional arguments:
-h, --help show this help message and exit
-i INPUT_DIR, --input INPUT_DIR
This input folder of clustering is the output folder of proteomes command.
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
-c CPU, --cpu CPU CPU number for multiprocessing.
-t THRESHOLD_CLUSTERING, --threshold THRESHOLD_CLUSTERING
Proportion [0 to 1] of proteomes required to occur in a proteins cluster for that cluster to be kept in core proteome assembly.
-m MMSEQS_OPTIONS, --mmseqs MMSEQS_OPTIONS
String containing mmseqs options for cluster command (except --threads which is already set by --cpu command and -v). If nothing is given, esmecata will used the option "--min-seq-id 0.3 -c 0.8"
--linclust Use mmseqs linclust (clustering in linear time) to cluster proteins sequences. It is faster than mmseqs cluster (default behaviour) but less sensitive.
--remove-tmp Delete tmp files to limit the disk space used: files created by mmseqs (in mmseqs_tmp).
For each taxon (a row in the table) EsMeCaTa will use MMseqs2 to cluster the proteins (using an identity of 30% and a coverage of 80%, these values can be changed with the --mmseqsoption). Then if a cluster contains at least one protein from each proteomes, it will be kept (this threshold can be changed using the --threshold option). The representative proteins from the cluster will be used. A fasta file of all the representative proteins will be created for each taxon.
esmecata clustering options:
-t/--threshold: clustering threshold
It is possible to modify the requirements of the presence of at least one protein from each proteomes in a cluster to keep it. Using the threshold option one can give a float between 0 and 1 to select the ratio of representation of proteomes in a cluster to keep.
For example a threshold of 0.8 means that all the cluster with at least 80% representations of proteomes will be kept (with a taxon, associated with 10 proteomes, it means that at least 8 proteomes must have a protein in the cluster so the cluster must be kept).
-c/--cpu: number of CPU for MMseqs2.
You can give a numbe of CPUs to parallelise MMseqs2.
-m/--mmseqs: mmseqs option to be used for the clustering.
String containing mmseqs options for cluster command (except --threads which is already set by --cpu command and -v). If nothing is given, esmecata will used the option "--min-seq-id 0.3 -c 0.8". For example you can give --mmseqs "--min-seq-id 0.8 --kmer-per-seq 80" to ask for a minimal identity between sequence of 80% and having 80 kmers per sequence.
--linclust: replacemmseqs clusterbymmseqs linclust(faster but less sensitive)
Use mmseqs linclust (clustering in linear time) to cluster proteins sequences. It is faster than mmseqs cluster (default behaviour) but less sensitive.
--remove-tmp: remove mmseqs files stored inmmseqs_tmpfolder
esmecata annotation: Retrieve protein annotations with eggNOG-mapper
usage: esmecata annotation [-h] -i INPUT_DIR -o OUPUT_DIR -e EGGNOG_DATABASE [-c CPU] [--eggnog-tmp EGGNOG_TMP_DIR]
options:
-h, --help show this help message and exit
-i INPUT_DIR, --input INPUT_DIR
This input folder of annotation is the output folder of clustering command.
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
-e EGGNOG_DATABASE, --eggnog EGGNOG_DATABASE
Path to eggnog database.
-c CPU, --cpu CPU CPU number for multiprocessing.
--eggnog-tmp EGGNOG_TMP_DIR
Path to eggnog tmp dir.
Requires eggNOG-mapper in the path and the path to the eggnog database.
This command takes as input the folder created by esmecata clustering and uses especially the reference_proteins_consensus_fasta folder.
This folder contains the consensus protein sequences associated with each protein clusters kept according to the clustering threshold.
These sequences are given as input to eggNOG-mapper.
The number of CPU used by eggNOG-mapper can be modified with -c.
By default, EsMeCaTa uses the option --dbmem of eggNOG-mapper to store the database in memory, this requires around 50G of RAM.
esmecata annotation_eggnog options:
-
-e: path to the eggnog database (required). -
-c: number of CPUs to be used by eggNOG-mapper. -
--eggnog-tmp: path to the folder to store eggnog temporary files (by default it is inside esmecata output folder).
esmecata annotation_uniprot: Retrieve protein annotations with UniProt
usage: esmecata annotation_uniprot [-h] -i INPUT_DIR -o OUPUT_DIR [-s SPARQL] [-p PROPAGATE_ANNOTATION] [--uniref] [--expression] [--annotation-files ANNOTATION_FILES] [--bioservices]
options:
-h, --help show this help message and exit
-i INPUT_DIR, --input INPUT_DIR
This input folder of annotation is the output folder of clustering command.
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
-s SPARQL, --sparql SPARQL
Use sparql endpoint instead of REST queries on Uniprot.
-p PROPAGATE_ANNOTATION, --propagate PROPAGATE_ANNOTATION
Proportion [0 to 1] of the occurrence of an annotation to be propagated from the protein of a cluster to the reference protein of the cluster. 0 mean the annotations from all proteins are propagated to the reference and 1 only the annotation
occurring in all the proteins of the cluster (default).
--uniref Use uniref cluster to extract more annotations from the representative member of the cluster associated with the proteins. Needs the --sparql option.
--expression Extract expression information associated with the proteins. Needs the --sparql option.
--annotation-files ANNOTATION_FILES
Use UniProt annotation files (uniprot_trembl.txt and uniprot_sprot.txt) to avoid querying UniProt REST API. Need both paths to these files separated by a ",".
--bioservices Use bioservices instead of esmecata functions for protein annotation.
For each of the protein clusters kept after the clustering, esmecata will look for the annotation (GO terms, EC number, function, gene name, InterPro) in UniProt.
By default, EsMeCaTa will look at the annotations of each proteins from a cluster and keeps only annotation occurring in all the protein of a cluster (threshold 1 of option -p).
It is like selecting the intersection of the annotation of the cluster. This can be changed with the option -p and giving a float between 0 and 1.
Then esmecata will create a tabulated file for each row of the input file and also a folder containing PathoLogic file that can be used as input for Pathway Tools.
esmecata annotation options:
-s/--sparql: use SPARQL instead of REST requests
It is possible to avoid using REST queries for esmecata and instead use SPARQL queries. This option need a link to a SPARQL endpoint containing UniProt. If you want to use the SPARQL endpoint, you can just use: -s uniprot.
-p/--propagate: propagation of annotation
It is possible to modify how the annotations are retrieved. By default, esmecata will take the annotations occurring in at least all the proteins of the cluster (-p 1). But with the -p option it is possible to propagate annotation form the proteins of the cluster to the reference proteins.
This option takes a float as input between 0 and 1, that will be used to filter the annotations retrieved. This number is multiplied by the number of protein in the cluster to estimate a threshold. To keep an annotation the number of the protein having this annotation in the cluster must be higher than the threshold. For example with a threshold of 0.5, for a cluster of 10 proteins an annotation will be kept if 5 or more proteins of the cluster have this annotation.
If the option is set to 0, there will be no filter all the annotation of the proteins of the cluster will be propagated to the reference protein (it corresponds to the union of the cluster annotations). This parameter gives the higher number of annotation for proteins. If the option is set to 1, only annotations that are present in all the proteins of a cluster will be kept (it corresponds to the intersection of the cluster annotations). This parameter is the most stringent and will limit the number of annotations associated with a protein.
For example, for the same taxon the annotation with the parameter -p 0 leads to the reconstruction of a metabolic networks of 1006 reactions whereas the parameter -p 1 creates a metabolic network with 940 reactions (in this example with no use of the -p option, so without annotation propagation, there was also 940 reactions inferred).
--uniref: use annotation from uniref
To add more annotations, esmecata can search the UniRef cluster associated with the protein associated with a taxon. Then the representative protein of the cluster will be extracted and if its identity with the protein of interest is superior to 90% esmecata will find its annotation (GO Terms and EC numbers) and will propagate these annotations to the protein. At this moment, this option is only usable when using the --sparql option.
--expression: extract expression information
With this option, esmecata will extract the expression information associated with a protein. This contains 3 elements: Induction, Tissue specificity and Disruption Phenotype. At this moment, this option is only usable when using the --sparql option.
--annotation-files: use UniProt txt files instead of queyring Uniprot servers.
As the annotation step needs a high numbers of queries to UniProt servers when working with hundreds or thousands of taxonomic affiliations, it can failed due to issues with the query.
A workaround (for example on a cluster), is to use the UniProt flat files containing the protein annotations.
Warning, the TrEMBL file takes a lot of space (around 150G compressed for the version 2022_05 andd 700G uncompressed).
One of the downside of this option is that it needs a lot of memory to handle indexing the TrEMBL file (around 32G using Biopython indexing) and it takes several hours to parse it.
But for dataset with thousands of taxonomic affiliations, this can be compensated by the fact that queyring the indexed files is more stable than querying a server.
For this option, you should give the path to the two annotation files (both the Swiss-Prot and the TrEMBL files) separated by ,such as: --annotation-files /db/uniprot/UniProt_2022_05/flat/uniprot_sprot.dat,/db/uniprot/UniProt_2022_05/flat/uniprot_trembl.dat.
The names of the files must contained: uniprot_sprot and uniprot_trembl to be able to differentiate them.
--bioservices: instead of using REST queries implemented in EsMeCaTa, relies on bioservices API to query UniProt. This requires the bioservices package.
esmecata workflow: Consecutive runs of the three steps by using eggNOG-mapper for the annotation
usage: esmecata workflow [-h] -i INPUT_FILE -o OUPUT_DIR -e EGGNOG_DATABASE [-b BUSCO] [-c CPU] [--ignore-taxadb-update] [--all-proteomes] [-s SPARQL] [--remove-tmp] [-l LIMIT_MAXIMAL_NUMBER_PROTEOMES] [-t THRESHOLD_CLUSTERING] [-m MMSEQS_OPTIONS] [--linclust]
[-r RANK_LIMIT] [--minimal-nb-proteomes MINIMAL_NUMBER_PROTEOMES] [--update-affiliations] [--bioservices] [--eggnog-tmp EGGNOG_TMP_DIR]
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input taxon file (excel, tsv or csv) containing a column associating ID to a taxonomic affiliation (separated by ;).
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
-e EGGNOG_DATABASE, --eggnog EGGNOG_DATABASE
Path to eggnog database.
-b BUSCO, --busco BUSCO
BUSCO percentage between 0 and 1. This will remove all the proteomes without BUSCO score and the score before the selected ratio of completion.
-c CPU, --cpu CPU CPU number for multiprocessing.
--ignore-taxadb-update
If you have a not up-to-date version of the NCBI taxonomy database with ete4, use this option to bypass the warning message and use the old version.
--all-proteomes Download all proteomes associated with a taxon even if they are no reference proteomes.
-s SPARQL, --sparql SPARQL
Use sparql endpoint instead of REST queries on Uniprot.
--remove-tmp Delete tmp files to limit the disk space used: files created by mmseqs (in mmseqs_tmp).
-l LIMIT_MAXIMAL_NUMBER_PROTEOMES, --limit-proteomes LIMIT_MAXIMAL_NUMBER_PROTEOMES
Choose the maximal number of proteomes after which the tool will select a subset of proteomes instead of using all the available proteomes (default is 99).
-t THRESHOLD_CLUSTERING, --threshold THRESHOLD_CLUSTERING
Proportion [0 to 1] of proteomes required to occur in a proteins cluster for that cluster to be kept in core proteome assembly. Default is 0.5.
-m MMSEQS_OPTIONS, --mmseqs MMSEQS_OPTIONS
String containing mmseqs options for cluster command (except --threads which is already set by --cpu command and -v). If nothing is given, esmecata will used the option "--min-seq-id 0.3 -c 0.8"
--linclust Use mmseqs linclust (clustering in linear time) to cluster proteins sequences. It is faster than mmseqs cluster (default behaviour) but less sensitive.
-r RANK_LIMIT, --rank-limit RANK_LIMIT
This option limits the rank used when searching for proteomes. All the ranks superior to the given rank will be ignored. For example, if 'family' is given, only taxon ranks inferior or equal to family will be kept. Look at the readme for more
information (and a list of rank names).
--minimal-nb-proteomes MINIMAL_NUMBER_PROTEOMES
Choose the minimal number of proteomes to be selected by EsMeCaTa. If a taxon has less proteomes, it will be ignored and a higher taxonomic rank will be used. Default is 5.
--update-affiliations
If the taxonomic affiliations were assigned from an outdated taxonomic database, this can lead to taxon not be found in ete4 database. This option tries to udpate the taxonomic affiliations using the lowest taxon name.
--bioservices Use bioservices instead of esmecata functions for protein annotation.
--eggnog-tmp EGGNOG_TMP_DIR
Path to eggnog tmp dir.
EsMeCTa will perform the search for proteomes, the protein clustering and the annotation using eggNOG-mapper.
esmecata workflow_uniprot: Consecutive runs of the three steps
usage: esmecata workflow_uniprot [-h] -i INPUT_FILE -o OUPUT_DIR [-b BUSCO] [-c CPU] [--ignore-taxadb-update] [--all-proteomes] [-s SPARQL] [--remove-tmp] [-l LIMIT_MAXIMAL_NUMBER_PROTEOMES] [-t THRESHOLD_CLUSTERING] [-m MMSEQS_OPTIONS] [--linclust]
[-p PROPAGATE_ANNOTATION] [--uniref] [--expression] [-r RANK_LIMIT] [--minimal-nb-proteomes MINIMAL_NUMBER_PROTEOMES] [--annotation-files ANNOTATION_FILES] [--update-affiliations] [--bioservices]
options:
-h, --help show this help message and exit
-i INPUT_FILE, --input INPUT_FILE
Input taxon file (excel, tsv or csv) containing a column associating ID to a taxonomic affiliation (separated by ;).
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
-b BUSCO, --busco BUSCO
BUSCO percentage between 0 and 1. This will remove all the proteomes without BUSCO score and the score before the selected ratio of completion.
-c CPU, --cpu CPU CPU number for multiprocessing.
--ignore-taxadb-update
If you have a not up-to-date version of the NCBI taxonomy database with ete4, use this option to bypass the warning message and use the old version.
--all-proteomes Download all proteomes associated with a taxon even if they are no reference proteomes.
-s SPARQL, --sparql SPARQL
Use sparql endpoint instead of REST queries on Uniprot.
--remove-tmp Delete tmp files to limit the disk space used: files created by mmseqs (in mmseqs_tmp).
-l LIMIT_MAXIMAL_NUMBER_PROTEOMES, --limit-proteomes LIMIT_MAXIMAL_NUMBER_PROTEOMES
Choose the maximal number of proteomes after which the tool will select a subset of proteomes instead of using all the available proteomes (default is 99).
-t THRESHOLD_CLUSTERING, --threshold THRESHOLD_CLUSTERING
Proportion [0 to 1] of proteomes required to occur in a proteins cluster for that cluster to be kept in core proteome assembly. Default is 0.5.
-m MMSEQS_OPTIONS, --mmseqs MMSEQS_OPTIONS
String containing mmseqs options for cluster command (except --threads which is already set by --cpu command and -v). If nothing is given, esmecata will used the option "--min-seq-id 0.3 -c 0.8"
--linclust Use mmseqs linclust (clustering in linear time) to cluster proteins sequences. It is faster than mmseqs cluster (default behaviour) but less sensitive.
-p PROPAGATE_ANNOTATION, --propagate PROPAGATE_ANNOTATION
Proportion [0 to 1] of the occurrence of an annotation to be propagated from the protein of a cluster to the reference protein of the cluster. 0 mean the annotations from all proteins are propagated to the reference and 1 only the annotation
occurring in all the proteins of the cluster (default).
--uniref Use uniref cluster to extract more annotations from the representative member of the cluster associated with the proteins. Needs the --sparql option.
--expression Extract expression information associated with the proteins. Needs the --sparql option.
-r RANK_LIMIT, --rank-limit RANK_LIMIT
This option limits the rank used when searching for proteomes. All the ranks superior to the given rank will be ignored. For example, if 'family' is given, only taxon ranks inferior or equal to family will be kept. Look at the readme for more
information (and a list of rank names).
--minimal-nb-proteomes MINIMAL_NUMBER_PROTEOMES
Choose the minimal number of proteomes to be selected by EsMeCaTa. If a taxon has less proteomes, it will be ignored and a higher taxonomic rank will be used. Default is 5.
--annotation-files ANNOTATION_FILES
Use UniProt annotation files (uniprot_trembl.txt and uniprot_sprot.txt) to avoid querying UniProt REST API. Need both paths to these files separated by a ",".
--update-affiliations
If the taxonomic affiliations were assigned from an outdated taxonomic database, this can lead to taxon not be found in ete4 database. This option tries to update the taxonomic affiliations using the lowest taxon name.
--bioservices Use bioservices instead of esmecata functions for protein annotation.
EsMeCaTa will perform the search for proteomes, the protein clustering and the annotation using UniProt.
EsMeCaTa outputs
EsMeCaTa proteomes
output_folder
├── proteomes_description
│ └── Cluster_1.tsv
│ └── Cluster_1.tsv
├── proteomes
│ └── Proteome_1.faa.gz
│ └── Proteome_2.faa.gz
│ └── Proteome_3.faa.gz
│ └── ...
├── association_taxon_taxID.json
├── empty_proteome.tsv
├── proteome_tax_id.tsv
├── esmecata_proteomes.log
├── esmecata_metadata_proteomes.json
├── stat_number_proteome.tsv
The proteomes_description contains list of proteomes find by esmecata on Uniprot associated with the taxonomic affiliation.
The proteomes contains all the proteomes that have been found to be associated with one taxon. It will be used for the clustering step.
association_taxon_taxID.json contains for each observation_name the name of the taxon and the corresponding taxon_id found with ete4.
empty_proteome.tsv contains UniProt proteome ID that have been downloaded but are empty.
proteome_tax_id.tsv contains the name, the taxon_id and the proteomes associated with each observation_name.
The file esmecata_proteomes.log contains the log associated with the command.
esmecata_metadata_proteomes.json is a log about the Uniprot release used and how the queries ware made (REST or SPARQL). It also gets the metadata associated with the command used with esmecata and the dependencies.
stat_number_proteome.tsv is a tabulated file containing the number of proteomes found for each observation name.
EsMeCaTa clustering
output_folder
├── cluster_founds
│ └── Taxon_Name_1.tsv
│ └── ...
├── computed_threshold
│ └── Taxon_Name_1.tsv
│ └── ...
├── mmseqs_tmp (can be cleaned to spare disk space using --remove-tmp option)
│ └── Taxon_Name_1
│ └── mmseqs intermediary files
│ └── ...
│ └── ...
├── reference_proteins
│ └── Taxon_Name_1.tsv
│ └── ...
├── reference_proteins_consensus_fasta
│ └── Taxon_Name_1.faa
│ └── ...
├── reference_proteins_representative_fasta
│ └── Taxon_Name_1.faa
│ └── ...
├── proteome_tax_id.tsv
├── esmecata_clustering.log
├── esmecata_metadata_clustering.json
├── stat_number_clustering.tsv
├── stat_openness_proteomes.tsv
├── taxonomy_diff.tsv
The cluster_founds contains one tsv file per taxon name used by EsMeCaTa. So multiple observation_name can be represented by a similar taxon name to avoid redundancy and limit the disk space used. These files contain the clustered proteins The first column contains the representative proteins of a cluster and the following columns correspond to the other proteins of the same cluster. The first protein occurs two time: one as the representative member of the cluster and a second time as a member of the cluster.
The computed_threshold folder contains the ratio of proteomes represented in a cluster compared to the total number of proteomes associated with a taxon. If the ratio is equal to 1, it means that all the proteomes are represented by a protein in the cluster, 0.5 means that half of the proteomes are represented in the cluster. This score is used when giving the -t argument.
The mmseqs_tmp folder contains the intermediary files of MMseqs2 for each taxon name. To save disk space, it is recommended to delete it with the option --remove-tmp.
The reference_proteins contains one tsv file per taxon name and these files contain the clustered proteins kept after clustering process. it is similar to cluster_founds but it contains only protein kept after clustering and threshold.
The reference_proteins_consensus_fasta contains the consensus proteins associated with a taxon name for the cluster kept after clustering process.
The reference_proteins_representative_fasta contains the representative proteins associated with a taxon name for the cluster kept after clustering process.
The proteome_tax_id.tsv file is the same than the one created in esmecata proteomes.
The file esmecata_clustering.log contains the log associated with the command.
esmecata_metadata_clustering.json is a log about the the metadata associated with the command used with esmecata and the dependencies.
stat_number_clustering.tsv is a tabulated file containing the number of shared proteins found for each observation name.
stat_openness_proteomes.tsv is a tabulated file containing the openness of panproteomes for each observation name. The openness is shown with the alpha values from Tettelin et al. (2008). It is computed with:
nb gene families = k * nb proteomes^{-alpha}
Where nb gene families is the number of newly discovered protein clusters when adding proteomes and nb proteomes is the number of proteomes.
If alpha is superior to 1, pangenome is closed: adding more proteomes do not increase number of newly discovered gene families (here protein clusters). The consensus proteome should be a good estimator of the protein contents for the taxon. If alpha is inferior to 1, pangenome is open: adding more proteomes increase the number of newly discovered gene families. It is possible that genes specific to the taxon are missed and then the estimation is more uncertain.
taxonomy_diff.tsv is a tabulated file indicating the taxon selected by EsMeCaTa compared to the lowest taxon in the taxonomic affiliations.
EsMeCaTa annotation
output_folder
├── annotation_reference
│ └── Cluster_1.tsv
│ └── ...
├── eggnog_output
│ └── taxon_rank.emapper.annotations
│ └── taxon_rank.emapper.hits
│ └── taxon_rank.emapper.seed_orthologs
│ └── ...
├── merge_fasta
│ └── taxon_rank.faa
| └── ...
├── pathologic
│ └── Cluster_1
│ └── Cluster_1.pf
│ └── ...
│ └── taxon_id.tsv
├── function_table.tsv
├── esmecata_annotation.log
├── esmecata_metadata_annotation.json
├── stat_number_annotation.tsv
The eggnog_output contains the resulting files of the eggNOG-mapper run, three files for each taxon name. In this way, eggNOG-mapper is run only one time by taxon name which reduces the redundancy if it had to be run on all the different observation_name.
The annotation_reference contains the prediction of eggNOG-mapper for the consensus protein of each observation_name. To create this file, EsMeCaTa finds the taxon name associated with the observation_name and extracts the annotation (EC numbers, GO termes, KEGG reaction).
The merge_fasta folder contains merged protein sequences of the clustering step to speed up the run of eggNOG-mapper.
The pathologic folder contains one sub-folder for each observation_name in which there is one PathoLogic file. There is also a taxon_id.tsv file which corresponds to a modified version of proteome_tax_id.tsv with only the observation_name and the taxon_id. This folder can be used as input to mpwt to reconstruct draft metabolic networks using Pathway Tools PathoLogic.
The file function_table.tsv contains the EC numbers and GO Terms present in each observation name.
The file esmecata_annotation.log contains the log associated with the command.
The esmecata_metadata_annotation.json serves the same purpose as the one used in esmecata proteomes to retrieve metadata about UniProt release at the time of the query. It also gets the metadata associated with the command used with esmecata and the dependencies.
stat_number_annotation.tsv is a tabulated file containing the number of GO Terms and EC numbers found for each observation name.
EsMeCaTa annotation_uniprot
output_folder
├── annotation
│ └── Taxon_name_1.tsv
│ └── ...
├── annotation_reference
│ └── Cluster_1.tsv
│ └── ...
├── expression_annotation (if --expression option)
│ └── Cluster_1.tsv
│ └── ...
├── pathologic
│ └── Cluster_1
│ └── Cluster_1.pf
│ └── ...
│ └── taxon_id.tsv
├── uniref_annotation (if --uniref option)
│ └── Cluster_1.tsv
│ └── ...
├── function_table.tsv
├── esmecata_annotation.log
├── esmecata_metadata_annotation.json
├── stat_number_annotation.tsv
The annotation folder contains a tabulated file for each taxon name (that can be associated with multiple observation_name). It contains the annotation retrieved with UniProt (protein_name, review, GO Terms, EC numbers, InterPros, Rhea IDs and gene name) associated with all the proteins in a proteome or associated with an observation_name.
The annotation_reference contains annotation only for the representative proteins, but the annotation of the other proteins of the same cluster can be propagated to the reference protein if the -p was used.
The expression_annotation contains expression annotation for the proteins of a taxon (if the --expression option was used).
The pathologic contains one sub-folder for each observation_name in which there is one PathoLogic file. There is also a taxon_id.tsv file which corresponds to a modified version of proteome_tax_id.tsv with only the observation_name and the taxon_id. This folder can be used as input to mpwt to reconstruct draft metabolic networks using Pathway Tools PathoLogic.
The file esmecata_annotation.log contains the log associated with the command.
The esmecata_metadata_annotation.json serves the same purpose as the one used in esmecata proteomes to retrieve metadata about UniProt release at the time of the query. It also gets the metadata associated with the command used with esmecata and the dependencies.
The uniref_annotation contains the annotation from the representative protein of the UniRef cluster associated with the proteins of a taxon (if the --uniref option was used).
stat_number_annotation.tsv is a tabulated file containing the number of GO Terms and EC numbers found for each observation name.
EsMeCaTa workflow
output_folder
├── 0_proteomes
├── proteomes_description
│ └── Cluster_1.tsv
│ └── Cluster_1.tsv
├── proteomes
│ └── Proteome_1.faa.gz
│ └── Proteome_2.faa.gz
│ └── Proteome_3.faa.gz
│ └── ...
├── association_taxon_taxID.json
├── empty_proteome.tsv
├── proteome_tax_id.tsv
├── esmecata_metadata_proteomes.json
├── stat_number_proteome.tsv
├── taxonomy_diff.tsv
├── 1_clustering
├── cluster_founds
│ └── Taxon_name_1.tsv
│ └── ...
├── computed_threshold
│ └── Taxon_name_1.tsv
│ └── ...
├── mmseqs_tmp (can be cleaned to spare disk space using --remove-tmp option)
│ └── Taxon_name_1
│ └── mmseqs intermediary files
│ └── ...
│ └── ...
├── reference_proteins
│ └── Taxon_name_1.tsv
│ └── ...
├── reference_proteins_consensus_fasta
│ └── Taxon_name_1.faa
│ └── ...
├── reference_proteins_representative_fasta
│ └── Taxon_name_1.faa
│ └── ...
├── proteome_tax_id.tsv
├── esmecata_metadata_clustering.json
├── stat_number_clustering.tsv
├── 2_annotation
├── annotation_reference
│ └── Cluster_1.tsv
│ └── ...
├── eggnog_output
│ └── Taxon_name_1.emapper.annotations
│ └── Taxon_name_1.emapper.hits
│ └── Taxon_name_1.emapper.seed_orthologs
│ └── ...
├── merge_fasta
│ └── taxon_rank.faa
| └── ...
├── pathologic
│ └── Cluster_1
│ └── Cluster_1.pf
│ └── ...
│ └── taxon_id.tsv
├── function_table.tsv
├── esmecata_annotation.log
├── esmecata_metadata_annotation.json
├── stat_number_annotation.tsv
├── esmecata_workflow.log
├── esmecata_metadata_workflow.json
├── stat_number_workflow.tsv
The files in the folders 0_proteomes, 1_clustering and 2_annotation are the same than the other presented in the previous steps.
The file esmecata_workflow.log contains the log associated with the command.
The esmecata_metadata_workflow.json retrieves metadata about UniProt release at the time of the query, the command used and its duration.
stat_number_workflow.tsv is a tabulated file containing the number of proteomes, shared proteins, GO Terms and EC numbers found for each observation name.
esmecata workflow_uniprot has the same output files except that the outputs of the annotation step corresponds to the output of esmecata annotation_uniprot.
EsMeCaTa precomputed
The output of esmecata precomputed is similar to the output of esmecata workflow but with fewer results as the database does not contain all the files created by esmecata:
output_folder
├── 0_proteomes
├── association_taxon_taxID.json
├── empty_proteome.tsv
├── proteome_tax_id.tsv
├── esmecata_metadata_proteomes.json
├── stat_number_proteome.tsv
├── taxonomy_diff.tsv
├── 1_clustering
├── computed_threshold
│ └── Taxon_name_1.tsv
│ └── ...
├── reference_proteins_consensus_fasta
│ └── Taxon_name_1.faa
│ └── ...
├── proteome_tax_id.tsv
├── esmecata_metadata_clustering.json
├── stat_number_clustering.tsv
├── 2_annotation
├── annotation_reference
│ └── Cluster_1.tsv
│ └── ...
├── pathologic
│ └── Cluster_1
│ └── Cluster_1.pf
│ └── ...
│ └── taxon_id.tsv
├── function_table.tsv
├── esmecata_metadata_annotation.json
├── stat_number_annotation.tsv
├── esmecata_precomputed.log
├── esmecata_metadata_precomputed.json
├── stat_number_precomputed.tsv
Post-processing analysis
EsMeCaTa report
Using the command esmecata_report, it is possible to create an html report summarising the results from an esmecata run (either from workflow or precomputed).
usage: esmecata_report [-h] [--version] {create_report,create_report_proteomes,create_report_clustering,create_report_annotation} ...
Create report files from esmecata output folder. For specific help on each subcommand use: esmecata {cmd} --help
options:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
valid subcommands:
{create_report,create_report_proteomes,create_report_clustering,create_report_annotation}
create_report Create report from esmecata output folder of workflow or precomputed subcommands.
create_report_proteomes
Create report from esmecata output folder of proteomes subcommand.
create_report_clustering
Create report from esmecata output folder of clustering subcommand.
create_report_annotation
Create report from esmecata output folder of annotation subcommand.
Requires: arakawa, plotly, kaleido, ontosunburst.
It can be used with this command:
esmecata_report create_report -i input_taxonomic_affiliations.tsv -f esmecata_precomputed_output_folder -o output_folder
It will create several files, especially a esmecata_summary.html showing several figures summarising the results of the EsMeCaTa run.
For example:
EsMeCaTa gseapy
An enrichment analysis can be performed to identify functions specific to a phylum compared to the whole community of the input files by using gseapy and orsum.
usage: esmecata_gseapy [-h] [--version] {gseapy_taxon} ...
Create enrichment analysis files from esmecata results. For specific help on each subcommand use: esmecata {cmd} --help
options:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
valid subcommands:
{gseapy_enrichr}
gseapy_enrichr Extract enriched functions from taxon using gseapy and orsum.
Requires: bioservices, pronto, gseapy and orsum
usage: esmecata_gseapy gseapy_enrichr [-h] -f INPUT_FOLDER -o OUPUT_DIR --grouping STR [--taxon-rank INPUT_TAXON] [--taxa-list STR] [--function-list STR] [--annot-names INPUT_FILE] [--orsumMinTermSize INT] [--gseapyCutOff FLOAT] [--taxon-id INPUT_FILE]
options:
-h, --help show this help message and exit
-f INPUT_FOLDER, --folder INPUT_FOLDER
Annotation input folder or file. For folder, it corresponds to EsMeCaTa annotation output folder. For file, it ocrresponds to a function table containing annotation as column and organism as row.
-o OUPUT_DIR, --output OUPUT_DIR
Output directory path.
--grouping STR Grouping factor used for enrichment analysis (either "tax_rank", "selected" or "selected_function").
--taxon-rank INPUT_TAXON
Taxon rank to cluster observation names of EsMeCaTa together (default "phylum").
--taxa-list STR When using value "selected" for option "grouping", you have to give this file indicating the different groups of taxon to compare.
--function-list STR When using value "selected_function" for option "grouping", you have to give this file indicating the different groups of functions to compare.
--annot-names INPUT_FILE
Pathname to json file indicating annotation names.
--orsumMinTermSize INT
MinTermSize of orsum.
--gseapyCutOff FLOAT Adjust-Pval cutoff for gseapy enrichr, default: 0.05 (--cut-off argument of gseapy).
--taxon-id INPUT_FILE
Pathname to taxon ID file indicating taxonomic ID associated with organisms name. Required for tax_rank analysis if the annotation input is a file and not an esmecata annotation folder.
There are currently three ways to use gseapy_enrichr:
-
by grouping observation names according to their taxonomic ranks (by default
phylum) with the parameter--grouping tax_rank. If you give a function table as input to-f, you will have to provide a taxon ID file with the--taxon-idparameter. -
by grouping observation names into groups defined by the user with a tsv file with the parameter
--grouping selected. The tabulated input file is given by the user with the parameter--taxa-listand should look like this:
| Group 1 | Group 2 | Group 3 |
|---|---|---|
| Cluster_1 | Cluster_2 | Cluster_6 |
| Cluster_5 | Cluster_3 | Cluster_7 |
| Cluster_10 | Cluster_4 | Cluster_8 |
- by grouping set of functions to find in which observation names they are enriched with parameter
--grouping selected_function. If the annotation input corresponds to esmecata annotation results, it expects EC number and GO Terms. But by giving a function table instead, you can choose your own annotation type. The tabulated input file showing the group is given by the user with the parameter--function-listand should look like this:
| Group 1 | Group 2 |
|---|---|
| 1.1.1.86 | 2.8.4.1 |
| GO:2000112 | GO:0009326 |
| 7.2.1.4 | |
| GO:2001141 |
There are two parameters mandatory for the different modes:
-
the
-fparameter takes as input the annotation folder of esmecata (either the output folder ofesmecata annotationor the2_annotationofesmecata workflow) or a tabulated file containing annotation as columns and organism as rows (such as picrust2 predictions (for example by combining filescombined_EC_predicted.tsvandcombined_KO_predicted.tsv), example can be found in test_data folder). -
the
-oparameter corresponds to the path to the output folder.
Warning: if a tabulated file is given as input (to -f), if you want to use the tax_rank grouping, you should give a taxon ID file to the --taxon-id parameter (example of such file can be found here). Example of taxon ID file (tax_id column corresponds to NCBI taxonomic ID):
| observation_name | tax_id | tax_rank |
|---|---|---|
| org_01 | 2207 | genus |
| org_06 | 1508657 | genus |
So you can either call esmecata_gseapy gseapy_enrichr with tax_rank grouping parameter:
esmecata_gseapy gseapy_enrichr -f esmecata_annotation_output_folder -o output_folder --grouping tax_rank --taxon-rank phylum
Or by using a taxa list file with selected grouping parameter:
esmecata_gseapy gseapy_enrichr -f esmecata_annotation_output_folder -o output_folder --grouping selected --taxa-list manually_selected_groups.tsv
Or by using a function list file with selected grouping parameter:
esmecata_gseapy gseapy_enrichr -f esmecata_annotation_output_folder -o output_folder --grouping selected_function --function-list manually_selected_function_groups.tsv
It can also be used by giving an annotation file:
esmecata_gseapy gseapy_enrichr -f function_table.tsv -o output_folder --grouping selected_function --function-list manually_selected_function_groups.tsv --annot-names annotation_names.json
Additional arguments can be given to use gseapy or orsum options such as:
--gseapyCutOffto set adjusted p-value cut-off for gseapy enrichr term (by default it is 0.05).--orsumMinTermSizeto set the MinTermSize of orsum (the minimum size of the terms to be processed).--annot-namesto give enriched element names (for example annotation names). It expects a json file, the default one is present here. But you can make your won by putting as key the ID of the annotation/element and as value the annotaiton/element name. By default, it downloads annotation names for EC numbr, GO Terms and KEGG Orthologs.
This generates several ouputs among them:
enrichr_module: raw results from gseapy enrichr with subfolder associated with the different groups and the resulting files generated by gseapy. Gseapy enrichr is explained in the gseapy doc.enrich_matrix.tsv: resulting from gseapy enrichr analysis, it shows enriched element as row, group in which they are enriched in column and the adjusted p-value (inferior to 0.05 or thegseapyCutOffparameter) as value. If element is not enriched in a group, aNAis shown.orsum_input_folder: input files generated byesmecata gseapyfromgseapy enrichrresults to provide input to orsum.orsum_output_folder: output files generated by orsum. Description of these outputs can be found on orsum github page.
SPARTA
A computational pipeline has been developed to ease the interpretation of EsMeCaTa predictions when comparing samples or groups. This method relies on Random Forests classifiers to discern groups and outputs variables of importance (either taxa and EsMeCaTa predicted functions). The method has been published in PLOS Computational Biology and is available on GitHub.
Tabigecy and bigecyhmm
EsMeCaTa's consensus proteomes can be used as input to bigecyhmm to predict impact of microbial communities on coarse-grained representation of biogeochemical cycles using Hidden-Markov Models. Then with bigecyhmm_visualisation, it is possible to add abundance file to weight funcitons with organisms abundances.
EsMeCaTa create_db
Create precomputed database from esmecata output folders or merge already present precomputed databases. This command is mainly for the developers of esmecata to automatise the creation of the precomputed database. But if you want to create a precomputed database of your esmecata run for reproducibility it is also possible. For example, it was used to create the precomputed databases for the dataset of the article of EsMeCaTa.
usage: esmecata_create_db [-h] [--version] {from_workflow,merge_db} ...
Create database file from esmecata run. For specific help on each subcommand use: esmecata {cmd} --help
options:
-h, --help show this help message and exit
--version show program's version number and exit
subcommands:
valid subcommands:
{from_workflow,merge_db}
from_workflow Create database from esmecata workflow output.
merge_db Merge multiple zip files corresponding to EsMeCaTa databases.
Requires: esmecata, pandas.
It can be used with this command:
esmecata_create_db from_workflow -i esmecata_workflow_output_folder -o output_folder -c 5
The precomputed database (in zip format) will be in the output_folder and named esmecata_database.zip.
To merge several precomputed databases, you can use the following command:
esmecata_create_db from_workflow -i esmecata_database_1.zip,esmecata_database_2.zip,esmecata_database_3.zip -o output_folder
Troubleshooting
Issue with incompatible versions of ete4 and UniProt NCBI Taxonomy databases
A common issue encountered when using EsMeCaTa is that the NCBI Taxonomy database present in the ete4 package (and used to parse the input taxonomic affiliations) is different from the ones used by UniProt. This can lead to several issues at different levels of EsMeCaTa. A possible solution is to update the NCBI Taxonomy database of ete4 with the following command:
python3 -c "from ete4 import NCBITaxa; ncbi = NCBITaxa(); ncbi.update_taxonomy_database()"
Issue with ete4 trying to reach NCBI server on computer/HPC without internet access
This issue has been discussed in this GitHub issue. After a fresh install of ete4, a first call of EsMeCaTa makes ete4 installs its database. To achieve this, ete4 needs an internet connection to download the NCBI Taxonomy database file. This can be an issue on system without internet access or with limited internet. A workaround is to generate ete4 SQL database before using EsMeCaTa. To do this, install ete4, download the NCBI Taxonomy database file (taxdump.tar.gz) and use the following command:
python3 -c "from ete4 import NCBITaxa; ncbi = NCBITaxa(taxdump_file='taxdump.tar.gz')"
This command makes ete4 generate its SQL database from the local taxdump.tar.gz file instead of downloading it from NCBI server. In this way, you can bypass the need of an internet connection. After this, ete4 should work without an internet connectionwhcih can be useful if you try to work with esmecata precomputed command (such as with Tabigecy).
Citation
If you have used EsMeCaTa, please cite:
Arnaud Belcour, Pauline Hamon-Giraud, Alice Mataigne, Baptiste Ruiz, Yann Le Cunff, Jeanne Got, Lorraine Awhangbo, Mégane Lebreton, Clémence Frioux, Simon Dittami, Patrick Dabert, Anne Siegel, Samuel Blanquart. Estimating consensus proteomes and metabolic functions from taxonomic affiliations. bioRxiv 2022.03.16.484574; doi: https://doi.org/10.1101/2022.03.16.484574
If you have used EsMeCaTa precomputed database, please cite:
Arnaud Belcour, Loris Megy, Sylvain Stephant, Caroline Michel, Sétareh Rad, Petra Bombach, Nicole Dopffel, Hidde de Jong and Delphine Ropers. Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data Bioinformatics, Volume 41, Issue Supplement_1, July 2025, Pages i49–i57, https://doi.org/10.1093/bioinformatics/btaf230
License
This software is licensed under the GNU GPL-3.0-or-later, see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file esmecata-0.6.9.tar.gz.
File metadata
- Download URL: esmecata-0.6.9.tar.gz
- Upload date:
- Size: 170.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
493ad6305861c634d3937f7fc98b4ee5b00de864be509cd1e0a19b09bddc41d7
|
|
| MD5 |
96e8e577dc11e58b4d0ad188b57c9a8c
|
|
| BLAKE2b-256 |
0271cd362aff33a1a9a22adebb4a709636c44cf64fe1a64325352d1cfb760d2e
|
Provenance
The following attestation bundles were made for esmecata-0.6.9.tar.gz:
Publisher:
python-publish.yml on AuReMe/esmecata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
esmecata-0.6.9.tar.gz -
Subject digest:
493ad6305861c634d3937f7fc98b4ee5b00de864be509cd1e0a19b09bddc41d7 - Sigstore transparency entry: 761744428
- Sigstore integration time:
-
Permalink:
AuReMe/esmecata@7a05f94f159cb9ca81e43c0883fa852e3a67f8c3 -
Branch / Tag:
refs/tags/0.6.9 - Owner: https://github.com/AuReMe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7a05f94f159cb9ca81e43c0883fa852e3a67f8c3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file esmecata-0.6.9-py3-none-any.whl.
File metadata
- Download URL: esmecata-0.6.9-py3-none-any.whl
- Upload date:
- Size: 142.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37116b7d5fb826cebabdbebd76f3475d28a6fea65b76af94a43001b4e3b9cf18
|
|
| MD5 |
7460cd7a8111734a326adbe9d23fac67
|
|
| BLAKE2b-256 |
24efe8ba7ef3fd9a9eaabdc305569f4f9b215fda9c2bc4e71d5a6191f21677ed
|
Provenance
The following attestation bundles were made for esmecata-0.6.9-py3-none-any.whl:
Publisher:
python-publish.yml on AuReMe/esmecata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
esmecata-0.6.9-py3-none-any.whl -
Subject digest:
37116b7d5fb826cebabdbebd76f3475d28a6fea65b76af94a43001b4e3b9cf18 - Sigstore transparency entry: 761744436
- Sigstore integration time:
-
Permalink:
AuReMe/esmecata@7a05f94f159cb9ca81e43c0883fa852e3a67f8c3 -
Branch / Tag:
refs/tags/0.6.9 - Owner: https://github.com/AuReMe
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7a05f94f159cb9ca81e43c0883fa852e3a67f8c3 -
Trigger Event:
release
-
Statement type: