Skip to main content

Predict biogeochemical cycles from protein fasta files.

Project description

PyPI version

bigecyhmm: Biogeochemical cycle HMMs search

Bigecyhmm is a Python package to search for genes associated with biogeochemical cycles in protein sequence fasta files. It begins as a self-contained, lightweight reimplementation of a subtask performed in METABOLIC but has since grown. Bigecyhmm default behaviour searches for enzymes associated with carbon, sulfur, nitrogen and phosphorus cycles using HMMs from METABOLIC article, KEGG, PFAM, TIGR. It can be also used with a custom database and then will output network representation of the cycle.

0 Table of contents

1 Dependencies

bigecyhmm is developed to be as minimalist as possible. It requires:

  • PyHMMER: to perform HMM search.
  • Pillow: to create biogeochemical cycle diagrams.

The HMMs used are stored inside the package as a zip file (hmm_files.zip). It makes this python package a little heavy (around 19 Mb) but in this way, you do not have to download other files and can directly use it.

For bigecyhmm_visualisation, you also needs the following packages:

  • pandas: To read the input files.
  • seaborn: to create most of the figures.
  • plotly: to create most of the figures.
  • kaleido: required to create the figure.

For bigecyhmm_custom, you also needs the following package:

  • networkx: to handle custom biogeochemical cycle as a graph.
  • matplotlib: to create automatically (bad) visualisation of the cycle.

2 Installation

It can be installed from PyPI:

pip install bigecyhmm

Or it can be installed with pip by cloning the repository:

git clone https://github.com/ArnaudBelcour/bigecyhmm.git

cd bigecyhmm

pip install -e .

For bigecyhmm_visualisation, you also needs to run:

pip install pandas seaborn plotly kaleido

For bigecyhmm_custom, you also needs to run:

pip install networkx matplotlib

3 bigecyhmm

3.1 Usage

You can used the tools with two calls:

  • by giving as input a protein fasta file:
bigecyhmm -i protein_sequence.faa -o output_dir
  • by giving as input a folder containing multiple fasta files:
bigecyhmm -i protein_sequences_folder -o output_dir

There is one option:

  • -c to indicate the number of core used. It is only useful if you have multiple protein fasta files as the added cores will be used to run another HMM search on a different protein fasta file.

3.2 Output

It gives as output:

  • a folder hmm_results: one tsv files showing the hits for each protein fasta file.
  • function_presence.tsv a tsv file showing the presence/absence of generic functions associated with the HMMs that matched.
  • a folder diagram_input, the necessary input to create Carbon, Nitrogen, Sulfur and other cycles with the R script modified from the METABOLIC repository using the following command: Rscript draw_biogeochemical_cycles.R bigecyhmm_output_folder/diagram_input_folder/ diagram_output TRUE. This script requires the diagram package that could be installed in R with install.packages('diagram').
  • a folder diagram_figures contains biogeochemical diagram figures drawn from template situated in bigecyhmm/templates.
  • bigecyhmm.log: log file.
  • bigecyhmm_metadata.json: bigecyhmm metadata (Python version used, package version used).
  • function_presence.tsv: occurrence of the functions in the different input protein files.
  • pathway_presence.tsv: occurrence of the major metabolic pathways in the different inputs files.
  • pathway_presence_hmms.tsv: HMMs with matches for the major metabolic pathways in the different inputs files.
  • Total.R_input.txt: ratio of the occurrence of major metabolic pathways in the all communities.

4 bigecyhmm_visualisation

There is a second command associated with bigecyhmm (bigecyhmm_visualisation), to create visualisation of the results.

To create the associated figures, there are other dependencies:

  • pandas
  • seaborn
  • plotly
  • kaleido

Two subcommands are available for bigecyhmm_visualisation:

  • bigecyhmm_visualisation esmecata: to create visualisation from EsMeCaTa and bigecyhmm outputs folder (with optionally an abundance file).
  • bigecyhmm_visualisation genomes: to create visualisation from bigecyhmm output folder (with optionally an abundance file).

There are four parameters:

  • --esmecata: EsMeCaTa output folder associated with the run (as the visualisation works on esmecata results). Only required for bigecyhmm_visualisation esmecata.
  • --bigecyhmm: bigecyhmm output folder associated with the run. Required for both bigecyhmm_visualisation esmecata and bigecyhmm_visualisation genomes.
  • --abundance-file: abundance file indicating the abundance for each organisms selected by EsMeCaTa. Optional for both bigecyhmm_visualisation esmecata and bigecyhmm_visualisation genomes.
  • -o: an output folder. Required for both bigecyhmm_visualisation esmecata and bigecyhmm_visualisation genomes.

4.1 Function occurrence and abundance

For visualisation, two values are used to represent the functions. First, the occurrence corresponding to the number of organisms having this function dividing by the total number of organisms in the community. If you give an abundance file, a second value is used, the abundance (computed for each sample in the abundance file). The abundance of a function is the sum of the abundance of organisms having it divided by the sum of abundance of all organisms in the sample.

For example, if we look at the function Formate oxidation fdoG in a community. If 20 organisms in this community have this function on a community having a total of 80 organisms, the occurrence of this function is 0.25 (20 / 80). Then, let's say that these 20 organisms have a summed abundance of 600 and the total abundance of all organisms in the community is 1200, then the abundance of the function is 0.5 (600 / 1200).

4.2 Output of bigecyhmm_visualisation

Several output are created by bigecyhmm_visualisation.

output_folder
├── function_abundance
│   ├── cycle_diagrams_abundance
│   |   └── sample_1_carbon_cycle.png
│   |   └── sample_1_nitrogen_cycle.png
│   |   └── ...
│   ├── function_participation
│   |   └── sample_1.tsv
│   |   └── ...
│   ├── cycle_participation
│   |   └── sample_1.tsv
│   |   └── ...
│   └── barplot_esmecata_found_taxon_sample.png
│   └── barplot_esmecata_found_organism_sample.tsv
│   └── cycle_abundance_sample.tsv
│   └── cycle_abundance_sample_melted.tsv
│   └── cycle_abundance_sample_raw.tsv
│   └── function_abundance_sample.tsv
│   └── heatmap_abundance_samples.png
│   └── hmm_functional_profile.tsv
│   └── polar_plot_abundance_samples.png
├── function_occurrence
│   └── cycle_occurence.tsv
│   └── diagram_carbon_cycle.png
│   └── diagram_nitrogen_cycle.png
│   └── diagram_sulfur_cycle.png
│   └── diagram_other_cycle.png
│   └── function_occurrence.tsv
│   └── function_occurrence_in_organism.tsv
│   └── heatmap_occurrence.png
│   └── pathway_presence_in_organism.tsv
│   └── polar_plot_occurrence.png
├── bigecyhmm_visualisation.log
├── bigecyhmm_visualisation_metadata.json

function_abundance is a folder containing all visualisation associated with abundance values. It contains:

  • cycle_diagrams_abundance: a folder containing 4 cycle diagrams (carbon, sulfur, nitrogen and other) from METABOLIC per sample from the abundance file. For each sample, it gives the abundance and the relative abundance of the major function.
  • function_participation: a folder containing one tabulated file per sample from the abundance file. For each sample, it gives the function abundance associated with each organism in the community.
  • cycle_participation: a folder containing one tabulated file per sample from the abundance file. For each sample, it gives the cycle abundance associated with each organism in the community.
  • barplot_esmecata_found_taxon_sample.png: a barplot displaying the coverage of EsMeCaTa according to the abundances from samples. Each bar corresponds to a sample, the y-axis shows the relative abundances of the organisms in the sample. The color indicates which taxonomic rank has been used by EsMeCaTa to predict the consensus proteomes. If EsMeCaTa was not able to predict a consensus proteomes, it is displayed in category Not found. With this figure, you can have an idea if there is enough predictions for the different samples in the dataset and at which taxonomic ranks these predictiosn have been made. Thus allowing the estimation of the quality of the predictions: predictions are better if they are closer to lower taxonomic ranks (genus family). barplot_esmecata_found_organism_sample.tsv is the input file used to create the figure.
  • function_abundance_sample.tsv: a tabulated file containing the relative abundance of each function according to the abundance of the associated organisms in the different sample. Rows correspond to the functions and columns correspond to the samples. It is used to create the heatmap_abundance_samples.png file. The file hmm_functional_profile.tsv contains the absolute abundance of the functions.
  • heatmap_abundance_samples.png: a heatmap showing the abundance for all the HMMs searched by bigecyhmm in the different samples.
  • cycle_abundance_sample.tsv: a tabulated file showing the relative abundance of major functions in biogeochemical cycles according to the organisms. Rows correspond to the major functions and columns correspond to the samples. The file cycle_abundance_sample_melted.tsv is a melted version of this file. The file cycle_abundance_sample_raw.tsv contains the absolute abundance of the functions.
  • polar_plot_abundance_samples.png: a polar plot showing the abundance of major functions in the samples.

function_occurrence is a folder containing all visualisation associated with occurrence values. It contains:

  • cycle_occurence.tsv: a tabulated file showing the occurrence of major functions in biogeochemical cycles. Rows correspond to the major function and the column corresponds to the community.
  • diagram_*.png: diagram representing a biogeochemical cycles (carbon, nitrogen, sulfur, other) from METABOLIC. It shows the number of organisms with predicted major functions and the relative occurrence of these functions.
  • function_occurrence.tsv: a tabulated file containing the ratio for each function. Rows correspond to the function and the column corresponds to the community. It is used to create the heatmap_occurrence.png file.
  • function_occurrence_in_organism.tsv: a tabulated file containing the occurrence of function in each organism of the samples.
  • heatmap_occurrence.png: a heatmap showing the occurrence for all the HMMs searched by bigecyhmm in the community (all the input protein files).
  • pathway_presence_in_organism.tsv: a tabulated file containing the occurrence of cycle funcitons in each organism of the samples.
  • polar_plot_occurrence.png: a polar plot showing the occurrence of major functions in the samples.
  • swarmplot_function_ratio_community.png: a swarmplot showing the occurrence of major functions in the samples.

bigecyhmm_visualisation.log is a log file.

bigecyhmm_visualisation_metadata.json is a metadata file giving information on the version of the package used.

5 Custom usage

5.1 Contribution to bigecyhmm internal database

If you are interested in specific functions associated with cycles present in bigecyhmm (carbon, sulfur, nitrogen, phosphorus) and want to propose an addition, you can create an issue or a Pull Request. Depending on the additions or modifications, it will be taken into account. Keep in mind that bigecyhmm's goal is to limit itself to a small internal database. If you want to completely add another cycle, please refer to the next subsection.

5.2 bigecyhmm_custom: using custom database

Warning: This is a prototype.

It is possible to create a completely custom database that is linked to a specific biogeochemical cycles (or metabolic networks) using bigecyhmm_custom.

5.2.1 Requirements

This command requires three packages:

  • PyHMMER: to perform HMM search.
  • networkx: to handle biogeochemical cycle as a graph.
  • matplotlib: to create automatically (bad) visualisation of the cycle.

5.2.2 Inputs

This command line expects two arguments:

  • -i: an input protein sequence fasta file/folder.
  • -d: a file/folder containing the custom databases. bigecyhmm_custom will iterate other the file/folder to search for every .json files. If it finds one, it will search for associated .tsv and .zip (files with the same name and at the same location but with either a tsv or zip extension). The three expected files are listed below:
    • a json file representing the biogeochemical cycle as a bipartite graph with nodes representing metabolite and function. Example can be found in the test folder, such as carbon cycle json file. The hmm field in the function node in the json is mandatory to indicate the HMMs associated with the functions of the cycle. The HMMs are represented as a string with , separating HMMs as a OR relation (meaning these HMMs are redundant) and ; as a AND relation (meaning that both HMMs are required).
    • a zip file containing the HMM profiles (.hmm files) such as the one used by bigecyhmm (hmm_files.zip). If no file is present in the folder, bigecyhmm will use its internal HMM database. You can search for HMM in KEGG Ortholog database, Protein Family Models from NIH, PFAM or EggNOG. It is also possible to build them, an example can be found with pyhmmer.
    • a tsv file containing the threshold for the different HMMs. If no file is present in the folder, bigecyhmm will use its internal template file for threshold. An example can be found in bigecyhmm internal database (hmm_table_template.tsv) or in the test folder (hmm_table_template.tsv).

An example with mini database is present in the test folder.

Here are several examples of inputs:

  • Only a json file, bigecyhmm will use its internal HMM database to search for HMM files from the json file (associated argument -d custom_db_cycle.json):
custom_db_cycle.json
  • A folder with one json file and tsv/zip files (associated argument -d custom_db_cycle):
custom_db_cycle
├── custom_db_cycle.json
├── custom_db_cycle.tsv
├── custom_db_cycle.zip
  • A folder with several json files (associated argument -d custom_db_cycle):
custom_db_cycle
├── carbon_cycle.json
├── carbon_cycle.tsv
├── carbon_cycle.zip
├── nitrogen_cycle.json
├── nitrogen_cycle.tsv
├── nitrogen_cycle.zip
├── sulfur_cycle.json
├── sulfur_cycle.tsv
├── sulfur_cycle.zip

Usage example:

bigecyhmm_custom -i protein_sequences.faa -d custom_db -o output_folder

It can take five optional arguments:

  • --abundance-file: an abundance file containing the abundance of the organisms associated with the protein sequences given as input in different samples.
  • --measure-file: a measurement file containing the measures of metabolites of the biogeochemical cycle in different samples.
  • --esmecata: by giving an esmecata output folder, bigecyhmm_custom maps taxon_id to organism names to associate organism abundance with esmecata predicitons.
  • -m: JSON file containing gene associated with protein motifs to check for predictions. This verification comes from the METABOLIC article (you can find information about it, in the section Motif validation). The protein motif corresponds to a regex associated with amnio-acids or X (the latter being any amino-acid). The idea of this verification is to check if an expected amino-acid motif is present in the sequence matching the associated HMM. You can see an example file in the test folder (motif.json). The name of the gene corresponds to the name of its HMM. If no file is given, it will be using the default ones from METABOLIC (you can find it here as a dicitonary).
  • -p: JSON file containing association between two genes to check for predictions. This verification comes from the METABOLIC article (you can find information about it, in the section Motif validation). It ensures that a sequence is properly associated with a specific HMM and not to anotehr yet similar HMM. An example file can be found in the test folfer (motif_pair.json). It contains association between two gene names. The HMM search results of the sequence against these two gnee profiles are compared to find the one with a better score. The name of the gene corresponds to the name of its HMM. If no file is given, it will be using the default ones from METABOLIC (you can find it here as a dicitonary).

5.2.3 Outputs

bigecyhmm_custom creates inside the output folder on folder per input custom json file. It outputs similar files than bigecyhmm classical output except for the cycle visualisation. As it is more difficult to provide a direct visualisation from a custom database bigecyhmm_custom relies on network representation to create these visualisations. To do so, it creates network file (cycle_diagram_bipartite_occurrence.graphml) as output. These files can be used in network software (such as Cytoscape, or pyvis) to generate visualisation. It also tries to create a visualisation with networkx and matpltolib but they are not very good.

If you have given abundance and measure files, a second network file (cycle_diagram_bipartite_abundance.graphml) is created where function nodes are associated with the summed abundance of organisms in the different samples and metabolite node are associated with their measures in the different samples.

6 Citation

If you have used bigecyhmm in an article, please cite:

Arnaud Belcour, Loris Megy, Sylvain Stephant, Caroline Michel, Sétareh Rad, Petra Bombach, Nicole Dopffel, Hidde de Jong and Delphine Ropers. Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data bioRxiv 2025.01.30.635649, 2025, https://doi.org/10.1101/2025.01.30.635649

  • PyHMMER for the search on the HMMs:

Martin Larralde and Georg Zeller. PyHMMER: a python library binding to HMMER for efficient sequence analysis. Bioinformatics, 39(5):btad214, 2023. https://doi.org/10.1093/bioinformatics/btad214

  • HMMer website for the search on the HMMs:

HMMER. http://hmmer.org. Accessed: 2022-10-19.

  • the following articles for the creation of the custom HMMs:

Zhou, Z., Tran, P.Q., Breister, A.M. et al. METABOLIC: high-throughput profiling of microbial genomes for functional traits, metabolism, biogeochemistry, and community-scale functional networks. Microbiome 10, 33, 2022. https://doi.org/10.1186/s40168-021-01213-8

Anantharaman, K., Brown, C., Hug, L. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat Commun 7, 13219, 2016. https://doi.org/10.1038/ncomms13219

  • the following article for KOfam HMMs:

Takuya Aramaki, Romain Blanc-Mathieu, Hisashi Endo, Koichi Ohkubo, Minoru Kanehisa, Susumu Goto, Hiroyuki Ogata, KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold, Bioinformatics, Volume 36, Issue 7, 2020, Pages 2251–2252, https://doi.org/10.1093/bioinformatics/btz859

  • the following article for TIGRfam HMMs:

Jeremy D. Selengut, Daniel H. Haft, Tanja Davidsen, Anurhada Ganapathy, Michelle Gwinn-Giglio, William C. Nelson, Alexander R. Richter, Owen White, TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes, Nucleic Acids Research, Volume 35, Issue suppl_1, 2007, Pages D260–D264, https://doi.org/10.1093/nar/gkl1043

  • the following article for Pfam HMMs:

Robert D. Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y. Eberhardt, Sean R. Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, Erik L. L. Sonnhammer, John Tate, Marco Punta, Pfam: the protein families database, Nucleic Acids Research, Volume 42, Issue D1, 2014, Pages D222–D230, https://doi.org/10.1093/nar/gkt1223

  • the following articles for phosphorus cycle:

Boden, J.S., Zhong, J., Anderson, R.E. et al. Timing the evolution of phosphorus-cycling enzymes through geological time using phylogenomics. Nature Communications, 15, 3703 (2024). https://doi.org/10.1038/s41467-024-47914-0

Siles, J. A., Starke, R., Martinovic, T., Fernandes, M. L. P., Orgiazzi, A., & Bastida, F. Distribution of phosphorus cycling genes across land uses and microbial taxonomic groups based on metagenome and genome mining. Soil Biology and Biochemistry, 174, 108826, 2022. https://doi.org/10.1016/j.soilbio.2022.108826

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigecyhmm-0.1.7.tar.gz (18.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bigecyhmm-0.1.7-py3-none-any.whl (18.2 MB view details)

Uploaded Python 3

File details

Details for the file bigecyhmm-0.1.7.tar.gz.

File metadata

  • Download URL: bigecyhmm-0.1.7.tar.gz
  • Upload date:
  • Size: 18.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for bigecyhmm-0.1.7.tar.gz
Algorithm Hash digest
SHA256 37c13f89183ade59ea5eab6be0f03282ad921206fa63aa2168550654d64cfb9f
MD5 8adac57def29ac83696319e7ff933b14
BLAKE2b-256 47d3dda4678ff8cb56f37a615e1b3f8628986ce62e87e27dc9d6188f66791641

See more details on using hashes here.

Provenance

The following attestation bundles were made for bigecyhmm-0.1.7.tar.gz:

Publisher: python-publish.yml on ArnaudBelcour/bigecyhmm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bigecyhmm-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: bigecyhmm-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 18.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for bigecyhmm-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e9c8cece66f3b3578e351710f911a922311a7a8ab8d83fe9efb5c4b29f6c1459
MD5 78da2c6467678025220b66668a1f907f
BLAKE2b-256 1ae71d47c606fa580b2820c7fcafc4780830bedeb1de9eb9b93e8338a84b0cd6

See more details on using hashes here.

Provenance

The following attestation bundles were made for bigecyhmm-0.1.7-py3-none-any.whl:

Publisher: python-publish.yml on ArnaudBelcour/bigecyhmm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page