Microbial Genome Prospecting (MiGenPro) combines phenotype and genomic linked data. Migenpro serves as a framework for the generation of machine learning models that predict microbial traits from genome sequences.

These details have been verified by PyPI

Project links

Homepage

GitLab Statistics

Maintainers

mikeloomanx

These details have not been verified by PyPI

Project description

Coverage codequality

MiGenPro - Microbial Genome Processing Toolkit

MiGenPro: A flexible linked data framework for phenotype-genotype prediction of microbial traits using machine learning.

Functionalities

Genome annotation based on taxonomy identifiers.
Data formatting and cleaning for microbial genome datasets.
Conversion of raw query data into structured feature and phenotype matrices.
Advanced filtering options to remove low-frequency features or phenotypes.
Parallel processing support for efficient handling of large datasets.
Easy training and prediction with machine learning models on microbial characteristics.

Quickstart

Installation with pip in a special conda environment

conda create -n migenpro -c bioconda ;   
conda activate migenpro ;
pip install migenpro

Run the workflow from phenotype graph to phenotype prediction using the following command:

migenpro --df --gq --ml --annotation \
  --sapp_jar ./binaries/SAPP-2.0.jar \
  --phenotype_query_file sparql_phenotype:demo_gram.sparql \
  --phenotype_hdt_file ./data/bacdive.hdt \
  --genome_query_file sparql_genome:DomainCopyNumber.sparql \
  --abs_frequency 1 \
  --threads 20 \
  --sampling_type SMOTEN --train --predict --output ./demo_output \
  --cwl_file ./binaries/workflow-hub-cwl-runner.cwl

The --param_grids flag can be used to optimise the parameters of the machine learning models. An example json file is available at: tests/resources/param_grid.json

The individual steps:

Querying phenotype graphs
Annotating genomes
Querying the annotated genomes
Training machine learning models
Predicting phenotypes with existing models
Feature importance analysis
Summarising the results

1. Querying phenotype graphs

migenpro --df \
    --phenotype_query_file sparql_phenotype:demo_gram.sparql \
    --phenotype_hdt_file ./output/bacdive.hdt \
    --abs_frequency 1 \
    --sapp_jar binaries/SAPP-2.0.jar \
    --output ./output/

2. Annotating genomes

Genome annotation is done by default using the workflow: https://workflowhub.eu/workflows/1170/ this can be changed using the --cwl_file flag with a workflow of your choice granted that it takes a fasta file as input. You can speed up this process with the --threads flag.

migenpro --annotation \
    --genome_query_file sparql_genome:DomainCopyNumber.sparql \
    --sapp_jar ./binaries/SAPP-2.0.jar

3. Querying the annotated genomes

migenpro --gq \
    --genome_query_file path/to/genome_query_file.sparql \
    --sapp_jar binaries/SAPP-2.0.jar \
    --output ./output/

4. Training machine learning models

We will now use the default parameters for training the models. If you wish to optimise the parameters you can do this using the --param_grids flag. To modify training settings you can use the

migenpro --ml \
      --feature_matrix ./output/feature_matrix.tsv \
      --phenotype_matrix ./output/phenotype_matrix.tsv \
      --output ./output/

5. Predicting phenotypes with existing models

You can do this through the docker container or from the source code.

You will need to obtain a protein domain matrix of the desired genomes you can do this using the java code.
For ease of use we will use the python scripts that were made with the following command. The default output directory is "output/mloutput" if desired you can change this using the --output [output_directory_location]

migenpro --ml \
      --feature_matrix ./output/feature_matrix.tsv \
      --phenotype_matrix ./output/phenotype_matrix.tsv \
      --output ./output/

6. Feature importance analysis

migenpro --fi \
        --models path/to/models \
        --feature_matrix path/to/features.tsv \
        --phenotype_matrix path/to/phenotype.tsv \
        --output ./output/

7. Summarising the results

Wait for the script to finish and retrieve the results of your prediction from the output directory. There the predictions are given in the following format:

Genome	Phenotype	Prediction	Confidence
GCA123	Temperature	mesophilic	0.96

migenpro --summarise \
        --output ./output/

Contributing

Pull the git repo:

git pull git@gitlab.com:pig-paradigm/migenpro.git
cd migenpro

Installing the needed dependencies.

A pip requirements.txt file is located in the installation directory which you can install using the following command.

conda create -n migenpro python=3.12.5 --file installation/requirements.txt

Recreating the results from the study

The files needed to recreate our results are located on https://zenodo.org/records/16995284. Apply the steps from this tutorial namely the /data_visualisation/construct_all_graphs_from_summaries.ipynb to recreate the graphs.

Maintainers

Jasper J. Koehorst (@jjkoehorst) and Mike Loomans (@MikeLoomans1999)

Project details

These details have been verified by PyPI

Project links

Homepage

GitLab Statistics

Maintainers

mikeloomanx

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

Dec 17, 2025

This version

0.1.3

Dec 15, 2025

0.1.2

Sep 4, 2025

0.1.0

Sep 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

migenpro-0.1.3.tar.gz (55.7 kB view details)

Uploaded Dec 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

migenpro-0.1.3-py3-none-any.whl (57.8 kB view details)

Uploaded Dec 15, 2025 Python 3

File details

Details for the file migenpro-0.1.3.tar.gz.

File metadata

Download URL: migenpro-0.1.3.tar.gz
Upload date: Dec 15, 2025
Size: 55.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for migenpro-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`96196fdd1ba8a9f9207bd202657e8445fac44991cb60a4215eda094e1ec2c795`
MD5	`2fc18e959f3bf658e7a2695b80da318e`
BLAKE2b-256	`aa3715248d08220f34f60a95501903032f5c14e06e5f434e0eddaa87ecbda2cf`

See more details on using hashes here.

File details

Details for the file migenpro-0.1.3-py3-none-any.whl.

File metadata

Download URL: migenpro-0.1.3-py3-none-any.whl
Upload date: Dec 15, 2025
Size: 57.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for migenpro-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da7267995e9a6300dae2594ad5460a4d37deaa1690bc08598357eb8c8213d3d3`
MD5	`23f28565113755cdb04efaae1bec2af4`
BLAKE2b-256	`e01402ba282445542e9f1de2f2ae906c65570e2097b0f0f07d4d0972c51167fd`

See more details on using hashes here.

MiGenPro 0.1.3

Navigation

Verified details

Project links

GitLab Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

MiGenPro - Microbial Genome Processing Toolkit

Functionalities

Quickstart

Installation with pip in a special conda environment

1. Querying phenotype graphs

2. Annotating genomes

3. Querying the annotated genomes

4. Training machine learning models

5. Predicting phenotypes with existing models

6. Feature importance analysis

7. Summarising the results

Contributing

Pull the git repo:

Installing the needed dependencies.

Recreating the results from the study

Maintainers

Project details

Verified details

Project links

GitLab Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes