A cutting-edge deep learning approach that leverages language processing neural network to accurately identify known BGCs and extrapolate novel ones.

These details have not been verified by PyPI

Project links

Homepage

Project description

BGC-Prophet

BGC-Prophet, a deep learning approach that leverages language processing neural network model to accurately identify known BGCs and extrapolate novel ones.

Installation

Install BGC-Prophet using pip:

pip install bgc-prophet

Or you can download the offline installation package from the GitHub release page and install BGC-Prophet using the following command:

pip install bgc_prophet-0.1.0-py3-none-any.whl

BGC-Prophet is developed under the environment of Python3, and uses Pytroch to build the model, GPU devices are recommended to accelerate model infernece.

Usage

BGC-Prophet Pipline

BGC-Prophet can detect and classify BGCs in several genomic sequences. The input sequence must be in FASTA format containing Amino Acid sequences to comply with the ESM model. If you need to input FASTA files containing nucleotide sequences, you can use Prodigal or other algorithms to convert them into protein-coding sequences. The process involves some steps:

Utilizing the ESM2 model to extract word embeddings for each protein-coding gene in the sequences.
Organizing multiple genomes and split them into gene sequences of length 128.
Using a trained detection model to identify BGC gene at a given threshold.
Finally, applying a classification model to categorize the detected BGCs and outputting the results in a CSV file.

bgc_prophet pipeline --genomesDir ./pathtogenomesdirectory/ --modelPath ./pathto/annotator.pt --saveIntermediate --name nameoftask --threshold 0.5 --max_gap 3 --min_count 2 --classifierPath ./pathto/classifier.pt  --classify_t 0.5

use bgc_prophet pipeline --help command for more explanation of parameters.

Download models

You can download trained models from Github releases page:

wget https://github.com/HUST-NingKang-Lab/BGC-Prophet/files/12733164/model.tar.gz

annoator.pt model is used to dectct BGC genes, and classifier.pt model is used to classify BGCs.

Step by Step Operation

Get Embedding

We use ESM2-8M model to get genes' embedding vector, so the input must be in amino acid sequence format, and the last layer output of the ESM model is selected as the final word embedding vector for the amino acid sequences. You can use the following command:

bgc_prophet extract esm2_t6_8M_UR50D ./genome.fasta ./lmdb_genomes --toks_per_batch 40960 --include mean

This operation takes a gene context to be explored as input, with each gene represented by an amino acid sequence, and outputs a folder in the LMBD format, storing the corresponding gene's word embedding vectors.

If you need to obtain multiple FASTA files, you can specify the "--directory" or "-d" parameter, and the FASTA location parameter should be specified as a folder.

Special amino acid symbols like "J" should be replaced with "L" or "I" manually. This operation has a minor impact on the overall generation of gene embeddings.

Organize Genomes

Organize multiple genomes into a csv file ,which can be used to split sequences.

bgc_prophet organize --genomesDir ./genomesFastaDirectory/ --outputPath ./output/ --name organize --threads 10

This operation will generate a csv file(organize.csv) organizing all genomes and their sequences' ids.

Split Sequences

Split the genome into gene sequences of length 128.

bgc_prophet split --genomesPath ./output/organize.csv --outputPath ./output/ --name split --threads 10

This operatione will get a csv file(split.csv), all genomes will be split into gene id sequences of length 128.

Gene Prediction

Use a trained dectection model to indetify BGC genes at a given threshold.

bgc_prophet predict --datasetPath ./output/split.csv --modelPath ./annotator.pt --outputPath ./output/ --lmdbPath ./lmdb_genomes --name prediction --device cuda --saveIntermediate

This command will use GPU to detect BGCs' gene, if 'saveIntermediate' parameter is specified, results of prediction will be saved as a numpy file.

Output format

Merge genes within a distance of 'max_gap' to form a single BGC, and filter out BGCs composed of fewer than 'min_count' genes, predict and output the BGC with the highest confidence and broadest coverage.

bgc_prophet output --datasetPath ./output/split.csv \
--outputPath ./output/ --loadIntermediate ./output/intermediate_prediction.npy \
--name output --threshold 0.5 --max_gap 3 --min_count 2

The 'TDlabels' column of dataframe loaded from split.csv will be updated, then will output a new csv file named output.csv.

Biosynthetic Classify

Apply a trained classifier to categorize the detected BGCs.

bgc_prophet classify --datasetPath ./output.csv \
--classifierPath ./pathto/classifier.pt \
--outputPath ./output/ --lmdbPath ./lmdb_genomes \
--name classify --device cuda

The finall output will be save as a csv file, containing dection and classification results.

Publications

Deciphering the Biosynthetic Potential of Microbial Genomes Using a BGC Language Processing Neural Network Model [bioRxiv] [Nucleic Acids Research]

Maintainer

Name	Email	Organization
Qilong Lai	laiql@connect.hku.hk	PhD student, Department of Computer Science, The University of Hong Kong
Shuai Yao	yaoshuai@stu.pku.edu.cn	PhD student, Academy for Advanced interdisciplinary Studies, Peking University
Yuguo Zha	hugozha@hust.edu.cn	PhD student, School of Life Science and Technology, Huazhong University of Science & Technology
Haohong Zhang	haohongzh@gmail.com	PhD student, School of Life Science and Technology, Huazhong University of Science & Technology
Kang Ning	ningkang@hust.edu.cn	Professor, School of Life Science and Technology, Huazhong University of Science & Technology

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

Jan 9, 2026

0.1.1

Oct 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bgc_prophet-0.1.2-py3-none-any.whl (43.1 kB view details)

Uploaded Jan 9, 2026 Python 3

File details

Details for the file bgc_prophet-0.1.2-py3-none-any.whl.

File metadata

Download URL: bgc_prophet-0.1.2-py3-none-any.whl
Upload date: Jan 9, 2026
Size: 43.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for bgc_prophet-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0eae2cd64842bec98b93c3f5ea3b7aa0d45fa34d8f4b449e72c085a1864fb98a`
MD5	`e01bc0878ba25dc288b2869f7903deed`
BLAKE2b-256	`2c373a2989e6e6482d5a4f82ac4f668d2a32f0a997cce6d135f13d62320fa0f0`

See more details on using hashes here.

bgc-prophet 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BGC-Prophet

Installation

Usage

BGC-Prophet Pipline

Download models

Step by Step Operation

Get Embedding

Organize Genomes

Split Sequences

Gene Prediction

Output format

Biosynthetic Classify

Publications

Maintainer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes