Skip to main content

Detection of biosynthetic sub-clusters

Project description


iPRESTO (integrated Prediction and Rigorous Exploration of biosynthetic Sub-clusters Tool) is a collection of python scripts for the detection of gene sub-clusters in a set of Biosynthetic Gene Clusters (BGCs) in GenBank format. BGCs are tokenised by representing each gene as a combination of its Pfam domains, where subPfams are used to increase resolution. Tokenised BGCs are filtered for redundancy using similarity network with an Adjacency Index of domains as a distance metric. For the detection of sub-clusters two methods are used: PRESTO-STAT, which is based on the statistical algorithm from Del Carratore et al. (2019), and the novel method PRESTO-TOP, which uses topic modelling with Latent Dirichlet Allocation. The sub-clusters found with iPRESTO can then be linked to Natural Product substructures.

Developed by Joris Louwen. Supervisors: Marnix Medema (PI), Justin van der Hooft and Satria Kautsar. All from the Bioinformatics group at Wageningen University.



To use iPRESTO, there are some main scripts to use, which are explained with example commands below. All main scripts have a -h or --help option for additional command line arguments and default values. Generally, the input for iPRESTO analysis is a directory with BGCs in GenBank format, and a hmmpressed pHMM database. turns the input directory into a csv file with tokenised BGCs (called clusterfile.csv) and filters out redundant BGCs.

python3 -i my_gbk_dir -o output_dir --hmm_path Pfam_A.hmm
        --exclude final -c 12 -e True performs the PRESTO-STAT method. It can start with the same input as, but it is also possible to start from a clusterfile.csv with the flag --start_from_clusterfile. Redundancy filtering is on by default but can be turned of by toggling --no_redundancy_filtering.

#presto-stat with GBK folder input
python3 -i my_gbk_dir -o output_dir --hmm_path Pfam_A.hmm
        --exclude final -c 12 -e True -p 0.1 --include_list biosynthetic_domains.txt

#presto_stat with clusterfile input
# -i -o and --hmm_path have to be supplied symbolically
python3 my_clusterfile.csv -c 12
        --no_redundancy_filtering -i symbolic -o symbolic --hmm_path symbolic allows for querying a list of statistical sub-clusters as produced by Input should be a clusterfile.

python3 -i my_clusterfile.csv -m my_modules.txt
        -c 12 -o my_clusterfile performs the PRESTO-TOP method. It takes a clusterfile as input and has many commandline options to modify its behaviour in for example the construction of the LDA model. With the -r one can query an existing LDA model.

#Creating an LDA model and querying it at the same time.
python3 -i my_clusterfile.csv -o my_output_folder -c 10 -t 1000 -C 3000
        -I 2000 --min_genes 2 -f 0.95 -n 75 --classes my_bgc_classes.txt
        --known_subclusters known_subcl.txt
#Querying an existing model with -r
python3 -i my_clusterfile.csv -o my_output_folder -c 10 -t 1000
        --min_genes 2 -f 0.95 -n 75 --classes my_bgc_classes.txt
        --known_subclusters known_subcl.txt -r my_lda_model_location creates powerful visualisations of the sub-cluster output. One can provide one or more BGCs in GenBank format.

#one BGC
python3 --one -f BGC0000052.gbk -c domains_colour_file.tsv
        -d preprocessing_domhits_file.txt -o BGC0000052.html
        -s bgcs_queried_to_presto_stat_modules_list.txt -l bgc_topics.txt
        --include_list biosynthetic_domains.txt
#multiple BGCs
python3 -f file_with_gbk_locations.txt
        -c domains_colour_file.tsv -d preprocessing_domhits_file.txt
        -o BGC0000052.html -s bgcs_queried_to_presto_stat_modules_list.txt
        -l bgc_topics.txt --include_list biosynthetic_domains.txt

See below for an example clusterfile. Genes (and BGC names) are separated by commas, domains in the same gene by semi-colons and genes without domains are represented by a dash.


Other scripts fullfill additional roles for more functionality. subPfams can be created with


iPRESTO is build in python3.6. It requires the HMMER suit (, as well as some python packages. The required python packages are automatically installed when using pip or

#example install with pip
python3 -m pip --user install iPRESTO

#installing without dependencies
python3 -m pip --user --no-deps install iPRESTO

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for iPRESTO, version 1.0.3
Filename, size File type Python version Upload date Hashes
Filename, size iPRESTO-1.0.3-py3-none-any.whl (185.8 kB) File type Wheel Python version py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page