Predict biogeochemical cycles from protein fasta files.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

bigecyhmm: Biogeochemical cycle HMMs search

This is a package to search for genes associated with biogeochemical cycles in protein sequence fasta files. The HMMs come from METABOLIC article, KEGG, PFAM, TIGR.

bigecyhmm: Biogeochemical cycle HMMs search

Dependencies

bigecyhmm is developed to be as minimalist as possible. It requires:

PyHMMER: to perform HMM search.
Pillow: to create biogeochemical cycle diagrams.

The HMMs used are stored inside the package as a zip file (hmm_files.zip). It makes this python package a little heavy (around 15 Mb) but in this way, you do not have to download other files and can directly use it.

Installation

It can be installed from PyPI:

pip install bigecyhmm

Or it can be installed with pip by cloning the repository:

git clone https://github.com/ArnaudBelcour/bigecyhmm.git

cd bigecyhmm

pip install -e .

Run bigecyhmm

You can used the tools with two calls:

by giving as input a protein fasta file:

bigecyhmm -i protein_sequence.faa -o output_dir

by giving as input a folder containing multiple fasta files:

bigecyhmm -i protein_sequences_folder -o output_dir

There is one option:

-c to indicate the number of core used. It is only useful if you have multiple protein fasta files as the added cores will be used to run another HMM search on a different protein fasta files.

Output of bigecyhmm

It gives as output:

a folder hmm_results: one tsv files showing the hits for each protein fasta file.
function_presence.tsv a tsv file showing the presence/absence of generic functions associated with the HMMs that matched.
a folder diagram_input, the necessary input to create Carbon, Nitrogen, Sulfur and other cycles with the R script modified from the METABOLIC repository using the following command: Rscript draw_biogeochemical_cycles.R bigecyhmm_output_folder/diagram_input_folder/ diagram_output TRUE. This script requires the diagram package that could be installed in R with install.packages('diagram').
a folder diagram_figures contains biogeochemical diagram figures drawn from template situated in bigecyhmm/templates.
bigecyhmm.log: log file.
bigecyhmm_metadata.json: bigecyhmm metadata (Python version used, package version used).
function_presence.tsv: occurrence of the functions in the different input protein files.
pathway_presence.tsv: occurrence of the major metabolic pathways in the different inputs files.
pathway_presence_hmms.tsv: HMMs with matches for the major metabolic pathways in the different inputs files.
Total.R_input.txt: ratio of the occurrence of major metabolic pathways in the all communities.

bigecyhmm_visualisation

There is a second command associated with bigecyhmm (bigecyhmm_visualisation), to create visualisation of the results.

To create the associated figures, there are other dependencies:

pandas
seaborn
plotly
kaleido

Two subcommands are available for bigecyhmm_visualisation:

bigecyhmm_visualisation esmecata: to create visualisation from EsMeCaTa and bigecyhmm outputs folder (with optionally an abundance file).
bigecyhmm_visualisation genomes: to create visualisation from bigecyhmm output folder (with optionally an abundance file).

There are four parameters:

--esmecata: EsMeCaTa output folder associated with the run (as the visualisation works on esmecata results). Only required for bigecyhmm_visualisation esmecata.
--bigecyhmm: bigecyhmm output folder associated with the run. Required for both bigecyhmm_visualisation esmecata and bigecyhmm_visualisation genomes.
--abundance-file: abundance file indicating the abundance for each organisms selected by EsMeCaTa. Optional for both bigecyhmm_visualisation esmecata and bigecyhmm_visualisation genomes.
-o: an output folder. Required for both bigecyhmm_visualisation esmecata and bigecyhmm_visualisation genomes.

Function occurrence and abundance

For visualisation, two values are used to represent the functions. First, the occurrence corresponding to the number of organisms having this function dividing by the total number of organisms in the community. If you give an abundance file, a second value is used, the abundance (computed for each sample in the abundance file). The abundance of a function is the sum of the abundance of organisms having it divided by the sum of abundance of all organisms in the sample.

For example, if we look at the function Formate oxidation fdoG in a community. If 20 organisms in this community have this function on a community having a total of 80 organisms, the occurrence of this function is 0.25 (20 / 80). Then, let's say that these 20 organisms have a summed abundance of 600 and the total abundance of all organisms in the community is 1200, then the abundance of the function is 0.5 (600 / 1200).

Output of bigecyhmm_visualisation

Several output are created by bigecyhmm_visualisation.

output_folder
├── function_abundance
│   ├── cycle_diagrams_abundance
│   |   └── sample_1_carbon_cycle.png
│   |   └── sample_1_nitrogen_cycle.png
│   |   └── ...
│   ├── function_participation
│   |   └── sample_1.tsv
│   |   └── ...
│   ├── cycle_participation
│   |   └── sample_1.tsv
│   |   └── ...
│   └── cycle_abundance_sample.tsv
│   └── function_abundance_sample.tsv
│   └── heatmap_abundance_samples.png
│   └── polar_plot_abundance_samples.png
├── function_occurrence
│   └── cycle_occurence.tsv
│   └── diagram_carbon_cycle.png
│   └── diagram_nitrogen_cycle.png
│   └── diagram_sulfur_cycle.png
│   └── diagram_other_cycle.png
│   └── function_occurrence.tsv
│   └── heatmap_occurrence.png
│   └── polar_plot_occurrence.png
├── bigecyhmm_visualisation.log
├── bigecyhmm_visualisation_metadata.json

function_abundance is a folder containing all visualisation associated with abundance values. It contains:

cycle_diagrams_abundance: a folder containing 4 cycle diagrams (carbon, sulfur, nitrogen and other) from METABOLIC per sample from the abundance file. For each sample, it gives the abundance and the relative abundance of the major function.
function_participation: a folder containing one tabulated file per sample from the abundance file. For each sample, it gives the function abundance associated with each organism in the community.
cycle_participation: a folder containing one tabulated file per sample from the abundance file. For each sample, it gives the cycle abundance associated with each organism in the community.
function_abundance_sample.tsv: a tabulated file containing the ratio of abundance of each function in the different sample. Rows correspond to the functions and columns correspond to the samples. It is used to create the heatmap_abundance_samples.png file.
heatmap_abundance_samples.png: a heatmap showing the abundance for all the HMMs searched by bigecyhmm in the different samples.
cycle_abundance_sample.tsv: a tabulated file showing the abundance of major functions in biogeochemical cycles. Rows correspond to the major functions and columns correspond to the samples.
polar_plot_abundance_samples.png: a polar plot showing the abundance of major functions in the samples.

function_occurrence is a folder containing all visualisation associated with occurrence values. It contains:

cycle_occurence.tsv: a tabulated file showing the occurrence of major functions in biogeochemical cycles. Rows correspond to the major function and the column corresponds to the community.
diagram_*.png: diagram representing a biogeochemical cycles (carbon, nitrogen, sulfur, other) from METABOLIC. It shows the number of organisms with predicted major functions and the relative occurrence of these functions.
function_occurrence.tsv: a tabulated file containing the ratio for each function. Rows correspond to the function and the column corresponds to the community. It is used to create the heatmap_occurrence.png file.
heatmap_occurrence.png: a heatmap showing the occurrence for all the HMMs searched by bigecyhmm in the community (all the input protein files).
polar_plot_occurrence.png: a polar plot showing the occurrence of major functions in the samples.
swarmplot_function_ratio_community.png: a swarmplot showing the occurrence of major functions in the samples.

bigecyhmm_visualisation.log is a log file.

bigecyhmm_visualisation_metadata.json is a metadata file giving information on the version of the package used.

Citation

If you have used bigecyhmm in an article, please cite:

this github repository for bigecyhmm.
PyHMMER for the search on the HMMs:

Martin Larralde and Georg Zeller. PyHMMER: a python library binding to HMMER for efficient sequence analysis. Bioinformatics, 39(5):btad214, May 2023. https://doi.org/10.1093/bioinformatics/btad214

HMMer website for the search on the HMMs:

HMMER. http://hmmer.org. Accessed: 2022-10-19.

the following articles for the creation of the custom HMMs:

Zhou, Z., Tran, P.Q., Breister, A.M. et al. METABOLIC: high-throughput profiling of microbial genomes for functional traits, metabolism, biogeochemistry, and community-scale functional networks. Microbiome 10, 33 (2022). https://doi.org/10.1186/s40168-021-01213-8

Anantharaman, K., Brown, C., Hug, L. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat Commun 7, 13219 (2016). https://doi.org/10.1038/ncomms13219

the following article for KOfam HMMs:

Takuya Aramaki, Romain Blanc-Mathieu, Hisashi Endo, Koichi Ohkubo, Minoru Kanehisa, Susumu Goto, Hiroyuki Ogata, KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold, Bioinformatics, Volume 36, Issue 7, April 2020, Pages 2251–2252, https://doi.org/10.1093/bioinformatics/btz859

the following article for TIGRfam HMMs:

Jeremy D. Selengut, Daniel H. Haft, Tanja Davidsen, Anurhada Ganapathy, Michelle Gwinn-Giglio, William C. Nelson, Alexander R. Richter, Owen White, TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes, Nucleic Acids Research, Volume 35, Issue suppl_1, 1 January 2007, Pages D260–D264, https://doi.org/10.1093/nar/gkl1043

the following article for Pfam HMMs:

Robert D. Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y. Eberhardt, Sean R. Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, Erik L. L. Sonnhammer, John Tate, Marco Punta, Pfam: the protein families database, Nucleic Acids Research, Volume 42, Issue D1, 1 January 2014, Pages D222–D230, https://doi.org/10.1093/nar/gkt1223

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ARNb

Release history Release notifications | RSS feed

0.1.8

Sep 30, 2025

0.1.7

Jul 7, 2025

0.1.6

Apr 14, 2025

This version

0.1.5

Jan 29, 2025

0.1.4

Jan 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigecyhmm-0.1.5.tar.gz (16.0 MB view details)

Uploaded Jan 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bigecyhmm-0.1.5-py3-none-any.whl (15.9 MB view details)

Uploaded Jan 29, 2025 Python 3

File details

Details for the file bigecyhmm-0.1.5.tar.gz.

File metadata

Download URL: bigecyhmm-0.1.5.tar.gz
Upload date: Jan 29, 2025
Size: 16.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for bigecyhmm-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`4824468cdfd5e5752eac2e0d5b35ab96a1c48e6e1287fe310b6ee8e83d668f9f`
MD5	`b7c6001aca7deb69346ac4d074c5f749`
BLAKE2b-256	`0c0946cf99a69104994d5a88a9aaee74a71d2c8ce2413015f8bc1772337de4a4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bigecyhmm-0.1.5.tar.gz:

Publisher: python-publish.yml on ArnaudBelcour/bigecyhmm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bigecyhmm-0.1.5.tar.gz
- Subject digest: 4824468cdfd5e5752eac2e0d5b35ab96a1c48e6e1287fe310b6ee8e83d668f9f
- Sigstore transparency entry: 166687769
- Sigstore integration time: Jan 29, 2025
Source repository:
- Permalink: ArnaudBelcour/bigecyhmm@54d4766746ee053e397aaf1bab91c40ab7809e2b
- Branch / Tag: refs/tags/0.1.5
- Owner: https://github.com/ArnaudBelcour
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@54d4766746ee053e397aaf1bab91c40ab7809e2b
- Trigger Event: release

File details

Details for the file bigecyhmm-0.1.5-py3-none-any.whl.

File metadata

Download URL: bigecyhmm-0.1.5-py3-none-any.whl
Upload date: Jan 29, 2025
Size: 15.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for bigecyhmm-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1739ca59767cc5051eb257b8ffc0de0ae16e63f164caae7ba34fc90c1ec21f2`
MD5	`c1ebaf4697d6a93cc8c0f6e67ac5f2e6`
BLAKE2b-256	`55aa52f4f1b32d1a744695531a1894797190c4048675f7809229422338977393`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bigecyhmm-0.1.5-py3-none-any.whl:

Publisher: python-publish.yml on ArnaudBelcour/bigecyhmm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bigecyhmm-0.1.5-py3-none-any.whl
- Subject digest: b1739ca59767cc5051eb257b8ffc0de0ae16e63f164caae7ba34fc90c1ec21f2
- Sigstore transparency entry: 166687772
- Sigstore integration time: Jan 29, 2025
Source repository:
- Permalink: ArnaudBelcour/bigecyhmm@54d4766746ee053e397aaf1bab91c40ab7809e2b
- Branch / Tag: refs/tags/0.1.5
- Owner: https://github.com/ArnaudBelcour
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@54d4766746ee053e397aaf1bab91c40ab7809e2b
- Trigger Event: release

bigecyhmm 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

bigecyhmm: Biogeochemical cycle HMMs search

Table of contents

Dependencies

Installation

Run bigecyhmm

Output of bigecyhmm

bigecyhmm_visualisation

Function occurrence and abundance

Output of bigecyhmm_visualisation

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance