Skip to main content

Metagenomic binning with semi-supervised siamese neural network

Project description

SemiBin: Semi-supervised Metagenomic Binning Using Siamese Neural Networks

Command tool for metagenomic binning with semi-supervised deep learning using information from reference genomes in Linux and MacOS.

BioConda Install Test Status Documentation Status License: MIT

CONTACT US: Please use GitHub issues for bug reports and the SemiBin users mailing-list for more open-ended discussions or questions.

If you use this software in a publication please cite:

Pan, S.; Zhu, C.; Zhao, XM.; Coelho, LP. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun 13, 2326 (2022). https://doi.org/10.1038/s41467-022-29843-y

Basic usage of SemiBin

A tutorial of running SemiBin from scrath can be found here SemiBin tutorial.

Installation:

conda create -n SemiBin
conda activate SemiBin
conda install -c conda-forge -c bioconda semibin

The inputs to the SemiBin are contigs (assembled from the reads) and BAM files (reads mapping to the contigs). In the docs you can see how to generate the inputs starting with a metagenome.

Running with single-sample binning (for example: human gut samples):

SemiBin single_easy_bin -i contig.fa -b S1.sorted.bam -o output --environment human_gut

Running with multi-sample binning:

SemiBin multi_easy_bin -i contig_whole.fa -b *.sorted.bam -o output

The output includes the bins in the output_recluster_bins directory (including the bin.*.fa and recluster.*.fa).

Please find more options and details below and read the docs.

Advanced Installation

SemiBin runs on Python 3.7-3.10.

Bioconda

The simplest mode is shown above. However, if you want to use SemiBin with GPU (which is faster if you have one available), you need to install PyTorch with GPU support:

conda create -n SemiBin
conda activate SemiBin
conda install -c conda-forge -c bioconda semibin
conda install -c pytorch-lts pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts

MacOS note: you can only install the CPU version of PyTorch in MacOS with conda and you need to install from source to take advantage of a GPU (see #72). For more information on how to install PyTorch, see their documentation.

Source

You will need the following dependencies:

The easiest way to install the dependencies is with conda:

conda install -c conda-forge -c bioconda mmseqs2=13.45111 # (for GTDB support)
conda install -c bioconda bedtools hmmer prodigal samtools
conda install -c bioconda fraggenescan

Once the dependencies are installed, you can install SemiBin by running:

python setup.py install

Examples of binning

SemiBin runs on single-sample, co-assembly and multi-sample binning. Here we show the simple modes as an example. For the details and examples of every SemiBin subcommand, please read the docs.

Binning assemblies from long reads

Since version 1.4, SemiBin proposes new algorithm (ensemble based DBSCAN algorithm) for binning assemblies from long reads. To use it, you can used the subcommands bin_long or pass the option --sequencing-type=long_read to the single_easy_bin or multi_easy_bin subcommands.

Self-supervised mode

Since version 1.3, SemiBin supports completely self-supervised learning, which bypasses the need to annotate contigs with MMSeqs2. In benchmarks, self-supervised learning is both faster (4x faster; using only 11% of RAM at peak) and generates 8.3-21.5% more high-quality bins compared to the version tested in the manuscript To use it, pass the option --training-mode=self to the single_easy_bin or multi_easy_bin subcommands.

Easy single/co-assembly binning mode

Single sample and co-assembly are handled the same way by SemiBin.

You will need the following inputs:

  1. A contig file (contig.fa in the example below)
  2. BAM file(s) from mapping short reads to the contigs, sorted (mapped_reads.sorted.bam in the example below)

The single_easy_bin command can be used to produce results in a single step.

For example:

SemiBin \
    single_easy_bin \
    --input-fasta contig.fa \
    --input-bam mapped_reads.sorted.bam \
    --environment human_gut \
    --output output

Alternatively, you can train a new model for that sample, by not passing in the --environment flag:

SemiBin \
    single_easy_bin \
    --input-fasta contig.fa \
    --input-bam mapped_reads.sorted.bam \
    --output output

The following environments are supported:

  • human_gut
  • dog_gut
  • ocean
  • soil
  • cat_gut
  • human_oral
  • mouse_gut
  • pig_gut
  • built_environment
  • wastewater
  • chicken_caecum (Contributed by Florian Plaza Oñate)
  • global

The global environment can be used if none of the others is appropriate. Note that training a new model can take a lot of time and disk space. Some patience will be required. If you have a lot of samples from the same environment, you can also train a new model from them and reuse it.

Easy multi-samples binning mode

The multi_easy_bin command can be used in multi-samples binning mode:

You will need the following inputs:

  1. A combined contig file
  2. BAM files from mapping

For every contig, format of the name is <sample_name>:<contig_name>, where : is the default separator (it can be changed with the --separator argument). NOTE: Make sure the sample names are unique and the separator does not introduce confusion when splitting. For example:

>S1:Contig_1
AGATAATAAAGATAATAATA
>S1:Contig_2
CGAATTTATCTCAAGAACAAGAAAA
>S1:Contig_3
AAAAAGAGAAAATTCAGAATTAGCCAATAAAATA
>S2:Contig_1
AATGATATAATACTTAATA
>S2:Contig_2
AAAATATTAAAGAAATAATGAAAGAAA
>S3:Contig_1
ATAAAGACGATAAAATAATAAAAGCCAAATCCGACAAAGAAAGAACGG
>S3:Contig_2
AATATTTTAGAGAAAGACATAAACAATAAGAAAAGTATT
>S3:Contig_3
CAAATACGAATGATTCTTTATTAGATTATCTTAATAAGAATATC

You can use this to get the combined contig:

SemiBin concatenate_fasta -i contig*.fa -o output

If either the sample or the contig names use the default separator (:), you will need to change it with the --separator,-s argument.

After mapping samples (individually) to the combined FASTA file, you can get the results with one line of code:

SemiBin multi_easy_bin -i concatenated.fa -b *.sorted.bam -o output

Output

The output folder will contain:

  1. Features computed from the data and used for training and clustering
  2. Saved semi-supervised deep learning model
  3. Output bins
  4. Table with basic information about each bin
  5. Some intermediate files

By default, reconstructed bins are in output_recluster_bins directory.

For more details about the output, read the docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SemiBin-1.4.0.tar.gz (37.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

SemiBin-1.4.0-py3-none-any.whl (37.7 MB view details)

Uploaded Python 3

File details

Details for the file SemiBin-1.4.0.tar.gz.

File metadata

  • Download URL: SemiBin-1.4.0.tar.gz
  • Upload date:
  • Size: 37.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.8.2 readme-renderer/27.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.63.0 importlib-metadata/4.8.1 keyring/23.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12

File hashes

Hashes for SemiBin-1.4.0.tar.gz
Algorithm Hash digest
SHA256 612703fc654e79aff988bdd96c7f1ea863a55d8cc6699c66c84045ad4182c546
MD5 fca787a7f6be8fc8ab59c1069170b3b6
BLAKE2b-256 92905f5b50551be0b4fd8295eaf9653a3c3201dafa07f00923f0e150a9627adf

See more details on using hashes here.

File details

Details for the file SemiBin-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: SemiBin-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 37.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.8.2 readme-renderer/27.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.63.0 importlib-metadata/4.8.1 keyring/23.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12

File hashes

Hashes for SemiBin-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 250138d768e0d6d0e2fd6cf33b3c6e32c650a14fc02cbbbcbd3a820d9fdfeda7
MD5 663f6190a904767ca07355595736d634
BLAKE2b-256 1805f9cabed6592a712c847cc09931f4460b496b6f886a54c5d1d2ce0338fa1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page