Skip to main content

Detection of biosynthetic sub-clusters

Project description

iPRESTO

iPRESTO (integrated Prediction and Rigorous Exploration of biosynthetic Sub-clusters Tool) is a command line tool for the detection of gene sub-clusters in a set of Biosynthetic Gene Clusters (BGCs) in GenBank format. BGCs are tokenised by representing each gene as a combination of its Pfam domains, where subPfams are used to increase resolution. Tokenised BGCs are filtered for redundancy using similarity network with an Adjacency Index of domains as a distance metric. For the detection of sub-clusters two methods are used: PRESTO-STAT, which is based on the statistical algorithm from Del Carratore et al. (2019), and the novel method PRESTO-TOP, which uses topic modelling with Latent Dirichlet Allocation. The sub-clusters found with iPRESTO can then be linked to Natural Product substructures.

Developed by:

Joris J.R. Louwen1, Satria A. Kautsar1, Sven van der Burg2, Marnix H. Medema1*, Justin J.J. van der Hooft1,3*

  1. Bioinformatics Group, Wageningen University, Wageningen, the Netherlands
  2. Netherlands eScience Center, Amsterdam, the Netherlands
  3. Department of Biochemistry, University of Johannesburg, Johannesburg, South Africa

*Corresponding authors

E-mail: marnix.medema@wur.nl, justin.vanderhooft@wur.nl

Workflow

Dependencies

iPRESTO is build and tested in python3.6. The required python packages are automatically installed when using pip or setup.py. We recommend installing iPRESTO in a conda environment like so:

# create new environment
conda create -n ipresto python=3.6

# activate new environment
conda activate ipresto

# install ipresto and dependencies
python -m pip install iPRESTO

iPRESTO also requires the HMMER suite. If HMMER is not installed on your system, it can be installed with conda in your ipresto environment:

conda install -c bioconda hmmer

Querying existing sub-cluster models

The sub-cluster models that were created in the publication can be downloaded from Zenodo at https://doi.org/10.5281/zenodo.6953657. They can then be used to query your own BGCs for sub-clusters. At that link you can also find the HMMs used in the publication (with the subPfam HMMs) and an example clusterfile with tokenised BGCs from the antiSMASH-DB dataset.

Usage

ipresto.py executes the main functionalities of iPRESTO. See below for example commands. Toggle -h or --help for additional command line arguments and default values (also available for many other additional scripts). Generally, the input for iPRESTO analysis is a directory with BGCs in GenBank format, and a hmmpressed pHMM database.

ipresto.py performs pre-processing, the PRESTO-STAT and PRESTO-TOP methods, and visualisation. It takes a directory of BGCs in GenBank format and a hmmpressed pHMM database as input. Redundancy filtering is on by default but can be turned of by toggling --no_redundancy_filtering. To query existing sub-cluster models based on the publication:

#ipresto with GBK folder input querying existing models
python ipresto.py -i my_gbk_dir -o output_dir --hmm_path Pfam_100subs_tc.hmm -c 12
        --include_list files/biosynthetic_domains.txt  # only include biosyntetic domains (see publication)
        --stat_subclusters PRESTO-STAT_subclusters.txt
        --top_motifs_model lda_model
        --no_redundancy_filtering  # probably you want to query all BGCs

It is also possible to start from a clusterfile.csv with the flag --start_from_clusterfile to not have to read the domtables everytime.

#ipresto with clusterfile input querying existing models
# -i and --hmm_path have to be supplied symbolically
python ipresto.py --start_from_clusterfile my_clusterfile.csv -o output_dir -c 12
        --include_list files/biosynthetic_domains.txt  # only include biosyntetic domains (see publication)
        --stat_subclusters PRESTO-STAT_subclusters.txt
        --top_motifs_model lda_model
        --no_redundancy_filtering  # probably you want to query all BGCs
        -i bla --hmm_path bla  # to prevent ipresto from crashing (will get updated in the future)

Omitting --stat_subclusters or --top_motifs_model will result in PRESTO-STAT and PRESTO-TOP detecting new sub-cluster models in your set of input BGCs. When building new models you probably want to filter out redundant BGCs (so omit the --no_redundancy_filtering flag).

#ipresto with GBK folder input creating new sub-cluster models
python ipresto.py -i my_gbk_dir -o output_dir --hmm_path Pfam_100subs_tc.hmm -c 12
        --include_list files/biosynthetic_domains.txt  # only include biosyntetic domains (see publication)
        -p 0.2  # you can override the defualt p-value for PRESTO-STAT
        -t 500  # you can override the defualt number of topics for PRESTO-TOP

Visualisations of the sub-cluster output are made by ipresto.py, but visualisations can also be made separately using subcluster_arrower.py. One can provide one or more BGCs in GenBank format.

#one BGC
python3 ipresto/subcluster_arrower.py --one -f BGC0000052.gbk
        -c files/domains_colour_file.tsv -d preprocessing_domhits_file.txt
        -o BGC0000052.html -s input_presto_stat_subclusters.txt
        -l bgc_topics_filtered.txt --include_list files/biosynthetic_domains.txt
#multiple BGCs
python3 ipresto/subcluster_arrower.py -f file_with_gbk_locations.txt
        -c files/domains_colour_file.tsv -d preprocessing_domhits_file.txt
        -o BGC0000052.html -s input_presto_stat_subclusters.txt
        -l bgc_topics_filtered.txt --include_list files/biosynthetic_domains.txt

See below for an example clusterfile. Genes (and BGC names) are separated by commas, domains in the same gene by semi-colons and genes without domains are represented by a dash.

BGC_name1,Lactamase_B,adh_short,ketoacyl-synt;Ketoacyl-synt_C,-\n
BGC_name2,-,Lant_dehydr_N;Lant_dehydr_C,LANC_like\n

Other scripts fullfill additional roles for more functionality. subPfams can be created with https://github.com/satriaphd/build_subpfam. subPfams used in the publication can be downloaded from the Zenodo https://doi.org/10.5281/zenodo.6953657.

Output

Most important outputs are *_presto_stat_subclusters.txt, presto_top/bgc_topics_filtered.txt and *_ipresto_output_visualisation.html. They contain the input BGCs queried to PRESTO-STAT and the PRESTO-TOP sub-clusers, and visualisation of the sub-cluster results, respectively.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iPRESTO-1.1.0.tar.gz (821.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page