metapi

a pipeline to construct a genome catalogue from metagenomics data

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

A general metagenomics data mining system focus on robust microbiome research.

Overview

MAG workflow (WIP figure)

Installation

metapi works with Python 3.7+. You can install it via bioconda:

# It is recommended to install the latest version
➤ mamba install -c conda-forge -c bioconda metapi=2.4.0

Run

Help

➤ metapi --help

  .___  ___.  _______ .___________.    ___      .______    __
  |   \/   | |   ____||           |   /   \     |   _  \  |  |
  |  \  /  | |  |__   `---|  |----`  /  ^  \    |  |_)  | |  |
  |  |\/|  | |   __|      |  |      /  /_\  \   |   ___/  |  |
  |  |  |  | |  |____     |  |     /  _____  \  |  |      |  |
  |__|  |__| |_______|    |__|    /__/     \__\ | _|      |__|

            Omics for All, Open Source for All

 A general metagenomics data mining system focus on robust microbiome research.

    optional arguments:
    -h, --help     show this help message and exit
    -v, --version  print software version and exit

    available subcommands:

    init         init project
    mag_wf       metagenome-assembly-genome pipeline
    gene_wf      metagenome-assembly-gene pipeline
    sync         metapi sync project

Init

➤ metapi init --help
usage: metapi init [-h] [-d WORKDIR] [--check-samples] [-s SAMPLES]
                        [-b {simulate,trimming,rmhost,assembly,binning,checkm}]
                        [--trimmer {sickle,fastp}]
                        [--rmhoster {bwa,bowtie2,minimap2,kraken2,kneaddata}]
                        [--assembler {idba-ud,megahit,metaspades,spades,opera-ms} [{idba-ud,megahit,metaspades,spades,opera-ms} ...]]
                        [--binner BINNER [BINNER ...]]

optional arguments:
  -h, --help            show this help message and exit
  -d, --workdir WORKDIR
                        project workdir (default: ./)
  --check-samples       check samples, default: False
  -s, --samples SAMPLES
                        desired input:
                        samples list, tsv format required.
                        
                        if begin from trimming, rmhost, or assembly:
                            if it is fastq:
                                the header is: [sample_id, assembly_group, binning_group, fq1, fq2]
                            if it is sra:
                                the header is: [sample_id, assembly_group, binning_group, sra]
                        
                        if begin from simulate:
                                the header is: [id, genome, abundance, reads_num, model]
                        
  -b, --begin {simulate,trimming,rmhost,assembly,binning,checkm}
                        pipeline starting point (default: trimming)
  --trimmer {sickle,fastp}
                        which trimmer used (default: fastp)
  --rmhoster {bwa,bowtie2,minimap2,kraken2,kneaddata}
                        which rmhoster used (default: bowtie2)
  --assembler {idba-ud,megahit,metaspades,spades,opera-ms} [{idba-ud,megahit,metaspades,spades,opera-ms} ...]
                        which assembler used, required when begin with binning, can be changed in config.yaml (default: ['metaspades'])
  --binner BINNER [BINNER ...]
                        wchich binner used (default: ['metabat2', 'concoct', 'maxbin2', 'vamb', 'dastools'])

Note
- When we do metapi init, metapi will help us to create project structure, include config.yaml, profiles/, envs/, logs/ and results/.
  - config.yaml: workflow configuration, can be edited
  - profiles/: used when run pipeline on cluster, can be edited
    - For SGE user, please see profiles/sge/cluster.yaml
    - For SLURM user, please see profiles/slurm/cluster.yaml
  - envs/: conda environment file for some rules, can be edited to update software version by yourself
- We recommand you to create a directory, and cd it and put samples.tsv in the directory, then do metapi init -s samples.tsv -d ./.
- Samples file is required when init project. Samples input format requirement can see here.
- Begin point can be simulate, trimming, rmhost, assembly and binning, it means:
  - simulate: simulate reads
  - trimming: samples need trimming, do quality control
  - rmhost: samples is quality control well, only need to remove host sequence, will not do trimming
  - assembly: samples is clean, just do assembly, will not do trimming and rmhost
  - (WIP) binning: supply samples and assembly, then do binning, will no do trimming, rmhost and assembly
- When metapi init executed, then corresponding configuration will be writen into config.yaml.
  Of course, you can edit config.yaml to update config, then when run metapi mag_wf or metapi gene_wf,
  metapi will understand it. Just edit it, see what will happen.

mag_wf

➤ metapi mag_wf --help
          [--cluster-engine {slurm,sge,lsf,pbs-torque}]
          [--wait WAIT] [--use-conda] [--conda-prefix CONDA_PREFIX]
          [--conda-create-envs-only]
          [TASK]

positional arguments:
  TASK                  pipeline end point. Allowed values are
  
  simulate_all,
  prepare_short_reads_all,
  prepare_long_reads_all,
  prepare_reads_all,
  raw_fastqc_all,
  raw_report_all, raw_all,
  trimming_sickle_all,
  trimming_fastp_all,
  trimming_report_all,
  trimming_all,
  rmhost_bwa_all,
  rmhost_bowtie2_all,
  rmhost_minimap2_all,
  rmhost_kraken2_all,
  rmhost_kneaddata_all,
  rmhost_report_all,
  rmhost_all,
  qcreport_all,
  assembly_megahit_all,
  assembly_idba_ud_all,
  assembly_metaspades_all,
  assembly_spades_all,
  assembly_plass_all,
  assembly_opera_ms_all, 
  assembly_metaquast_all,
  assembly_report_all,
  assembly_all,
  alignment_base_depth_all,
  alignment_report_all,
  alignment_all,
  binning_metabat2_coverage_all,
  binning_metabat2_all,
  binning_maxbin2_all, 
  binning_concoct_all,
  binning_graphbin2_all,
  binning_dastools_all,
  binning_vamb_prepare_all, 
  binning_vamb_all,
  binning_report_all,
  binning_all,
  identify_virsorter2_all,
  identify_deepvirfinder_all,
  identify_single_all,
  identify_phamb_all,
  identify_multi_all,
  identify_all,
  predict_scaftigs_gene_prodigal_all,
  predict_scaftigs_gene_prokka_all,
  predict_bins_gene_prodigal_all,
  predict_bins_gene_prokka_all,
  predict_scaftigs_gene_all,
  predict_bins_gene_all,
  predict_all,
  checkm_all,
  dereplicate_mags_drep_all,
  dereplicate_mags_all,
  taxonomic_all,
  upload_sequencing_all,
  upload_assembly_all,
  upload_all,
  all (default: all)

optional arguments:
  -h, --help            show this help message and exit
  -d, --workdir WORKDIR
                        project workdir (default: ./)
  --check-samples       check samples, default: False
  --config CONFIG       config.yaml (default: ./config.yaml)
  --profile PROFILE     cluster profile name (default: ./profiles/slurm)
  --cores CORES         all job cores, available on '--run-local' (default: 240)
  --local-cores LOCAL_CORES
                        local job cores, available on '--run-remote' (default: 8)
  --jobs JOBS           cluster job numbers, available on '--run-remote' (default: 30)
  --list                list pipeline rules
  --debug               debug pipeline
  --dry-run             dry run pipeline
  --run-local           run pipeline on local computer
  --run-remote          run pipeline on remote cluster
  --cluster-engine {slurm,sge,lsf,pbs-torque}
                        cluster workflow manager engine, support slurm(sbatch) and sge(qsub) (default: slurm)
  --wait WAIT           wait given seconds (default: 60)
  --use-conda           use conda environment
  --conda-prefix CONDA_PREFIX
                        conda environment prefix (default: ~/.conda/envs)
  --conda-create-envs-only
                        conda create environments only

mag_wf example

# init project
➤ metapi init -d . -s samples.tsv -b trimming

# create conda environments, which will create envs at ~/.conda/envs
# run it firstly, recommend
➤ metapi mag_wf --conda-create-envs-only

# run pipeline with conda
➤ metapi mag_wf all --use-conda --run-local

# run raw_fastqc
➤ metapi mag_wf raw_fastqc_all --use-conda --run-local

# run trimming
➤ metapi mag_wf trimming_all --use-conda --run-local

# run rmhost
➤ metapi mag_wf rmhost_all --use-conda --run-local

# run qc report
➤ metapi mag_wf qcreport_all --use-conda --run-local

# run assembly
➤ metapi mag_wf assembly_all --use-conda --run-local

# run binning (microbial MAG)
➤ metapi mag_wf binning_all --use-conda --run-local

# run identify (virus MAG)
➤ metapi mag_wf identify_all --use-conda --run-local

# run gene predict
➤ metapi mag_wf predict_all --use-conda --run-local

# run checkm
➤ metapi mag_wf checkm_all --use-conda --run-local

#run taxonomic 
➤ metapi mag_wf taxonomic_all --use-conda --run-local

# run mag_wf all
➤ metapi mag_wf --use-conda --run-local

# run mag_wf all on SGE/SLURM cluster
➤ metapi mag_wf all --use-conda --run-remote

Input details

The input fastq need gzip compress.
If sample_id col contain same id, then the reads of each sample will be merged.
For the same assembly_group, the reads will be combined together to do co-assembly, then the reads of each sample will map to the co-assembled results to calculate co-abundance for each assembly_group, then using MetaBAT2, MaxBin2, CONCOCT (You can specific which binner to use) to do binning.
For the same binning_group, the assembled results will be combined together to build minimap2 index, then the reads of each sample will map to the combined index to calculate co-abundance, then using vamb to do binning
assembly_group control the assembly and binning strategy; binning_group control the multisplit binning strategy developed by vamb.
If all assembly_group are unique, for MetaBAT2, MaxBin2, CONCOCT, etc., that is the traditional single-assembly and single-binning strategy.
If binning_group has multiple groups, then vamb will perform multisplit binning in each binning_group separately, which is useful for very large-scale data. For example, if there are 10,000 samples, and multisplit binning is performed once for every thousand samples, the binning group can be set to 10 different groups.

Input format

begin from trimming, rmhost or assembly:

Paired-end reads

sample_id	assembly_group	binning_group	fq1	fq2
s1	ag1	bg1	s1.1.fq.gz	s1.2.fq.gz
s2	ag1	bg1	s2.1.fq.gz	s2.2.fq.gz
s3	ag2	bg2	s3.1.fq.gz	s3.2.fq.gz
s4	ag2	bg2	s4.1.fq.gz	s4.2.fq.gz

Paired-end reads(interleaved), update config.yaml::reads_interleaved=true

sample_id	assembly_group	binning_group	fq1
s1	ag1	bg1	s1.fq.gz
s2	ag1	bg1	s2.fq.gz
s3	ag2	bg2	s3.fq.gz
s4	ag2	bg2	s4.fq.gz

(WIP) Paired-end reads with long reads, update config.yaml::have_long=true

sample_id	assembly_group	binning_group	fq1	fq2	fq_long
s1	ag1	bg1	s1.1.fq.gz	s1.2.fq.gz	s1.long.fq.gz
s2	ag1	bg1	s2.1.fq.gz	s2.2.fq.gz	s2.long.fq.gz
s3	ag2	bg2	s3.1.fq.gz	s3.2.fq.gz	s3.long.fq.gz
s4	ag2	bg2	s4.1.fq.gz	s4.2.fq.gz	s4.long.fq.gz

(WIP)Paired-end reads(interleaved) with long reads, update config.yaml::reads_interleaved=true and config.yaml::have_long=true

sample_id	assembly_group	binning_group	fq1	fq_long
s1	ag1	bg1	s1.fq.gz	s1.long.fq.gz
s2	ag1	bg1	s2.fq.gz	s2.long.fq.gz
s3	ag2	bg2	s3.fq.gz	s3.long.fq.gz
s4	ag2	bg2	s4.fq.gz	s4.long.fq.gz

Single-end reads, update config.yaml::reads_layout=se

sample_id	assembly_group	binning_group	fq1
s1	ag1	bg1	s1.fq.gz
s2	ag1	bg1	s2.fq.gz
s3	ag2	bg2	s3.fq.gz
s4	ag2	bg2	s4.fq.gz

SRA (only support paired-end reads) : SRA can be dumpped to Paired-end fastq reads

sample_id	assembly_group	binning_group	sra
s1	ag1	bg1	s1.sra
s2	ag1	bg1	s2.sra
s3	ag2	bg2	s3.sra
s4	ag2	bg2	s4.sra

begin from simulate, only support paired-end reads

id genome abundance reads_num model

s1 g1.fa 1.0 1M hiseq

s2 g1.fa 0.5 2M hiseq

s2 g2.fa 0.5 2M hiseq

s3 g1.fa 0.2 3M hiseq

s3 g2.fa 0.3 3M hiseq

s3 g3.fa 0.5 3M hiseq

id	genome	abundance	reads_num	model
s1	g1.fa	1.0	1M	hiseq
s2	g1.fa	0.5	2M	hiseq
s2	g2.fa	0.5	2M	hiseq
s3	g1.fa	0.2	3M	hiseq
s3	g2.fa	0.3	3M	hiseq
s3	g3.fa	0.5	3M	hiseq

It means:

The sample s1 contain 1M reads which come from g1, the relative abundance of species g1 is 1.0.

The sample s2 contain 2M reads, 1M reads come from g1 and 1M reads come from g2. the relative abundance of species g1 is 0.5, the relative abundance of species g2 is 0.5.

The sample s3 contain 3M reads, 0.6M reads come from g1, 0.9M reads come from g2 and 1.5M reads come from g3, the relative abundance of species g1 is 0.2, the relative abundance of species g2 is 0.3, the relative abundance of species g3 is 0.5.

Then metapi will use InSilicoSeq to generate metagenomics shotgun reads.

(WIP) begin from binning

sample_id	assembly_group	binning_group	fq1	fq2	scaftigs
s1	ag1	bg1	s1.1.fq.gz	s1.2.fq.gz	ag1.scaftigs.fa.gz
s2	ag1	bg1	s2.1.fq.gz	s2.2.fq.gz	ag1.scaftigs.fa.gz
s3	ag2	bg2	s3.1.fq.gz	s3.2.fq.gz	ag2.scaftigs.fa.gz
s4	ag2	bg2	s4.1.fq.gz	s4.2.fq.gz	ag2.scaftigs.fa.gz

(WIP) begin from checkm, or dereplicate, or taxonomic assignment (fq1 and fq2 can be empty)

sample_id bins

s1 s1.bin.1.fa

s1 s1.bin.2.fa

s1 s1.bin.3.fa

s2 s2.bin.1.fa

s2 s2.bin.2.fa

s2 s3.bin.3.fa

sample_id	bins
s1	s1.bin.1.fa
s1	s1.bin.2.fa
s1	s1.bin.3.fa
s2	s2.bin.1.fa
s2	s2.bin.2.fa
s2	s3.bin.3.fa

Default output structure (begin from trimming)

- config.yaml
- logs/
- profiles/
- envs/
- results/
    00.raw/
    01.trimming/
    02.rmhost/
    03.qcreport/
    04.assembly/
    05.alignment/
    06.binning/
    06.identify/
    07.predict/
    08.checkm/
    09.dereplicate/
    10.taxonomic/
    99.upload/

We will try our best to keep the directory structure uniform. Sequence files are generally placed in the reads directory, and report files are generally placed in the report directory.
If you are not very clear about the output of the whole process, it is recommended to directly raise an issue or look at the code.

Getting help

If you want to report a bug or issue, or have problems with installing or running the software, please create a new issue. This is the preferred way of getting support. Alternatively, you can mail me.

Contributing

Contributions welcome! Send me a pull request or get in touch.

When contributing a PR, please use the dev branch. For style, code will be checked using flake8, black, and snakefmt. These modules can be installed via conda, conda install black flake8 flake8-bugbear snakefmt or via pip pip install black flake8 flake8-bugbear snakefmt.

Contributors

Jie Zhu - @alienzj
Fangming Yang - @yangfangming
Yanmei Ju - @juyanmei
Weiting Liang - @weiting-liang

Citation

Over 50,000 Metagenomically Assembled Draft Genomes for the Human Oral Microbiome Reveal New Taxa
Genomics, Proteomics & Bioinformatics, 2021, https://doi.org/10.1016/j.gpb.2021.05.001

License

This module is licensed under the terms of the GPLv3 license.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

This version

2.4.0

Aug 19, 2022

2.3.0

Aug 17, 2022

2.2.0

Jun 25, 2022

2.1.4

Jun 5, 2022

2.1.3

Jun 3, 2022

2.1.2

Jun 2, 2022

2.1.1

May 24, 2022

2.1.0

May 24, 2022

2.0.0

Apr 19, 2022

1.1.0

Apr 23, 2021

1.0.5

Nov 20, 2020

1.0.4

Nov 11, 2020

1.0.3

Nov 9, 2020

1.0.2

Oct 28, 2020

1.0.1

Oct 16, 2020

0.9.3

Jul 4, 2020

0.9.2

Jul 3, 2020

0.9.1

Jun 30, 2020

0.9.0

Jun 22, 2020

0.8.0

Jun 7, 2020

0.7.10

Jun 2, 2020

0.7.9

May 28, 2020

0.7.8

May 13, 2020

0.7.7

May 13, 2020

0.7.6

Apr 28, 2020

0.7.5

Apr 28, 2020

0.7.4

Apr 27, 2020

0.7.3

Apr 25, 2020

0.7.2

Apr 25, 2020

0.7.1

Apr 25, 2020

0.7.0

Apr 25, 2020

0.6.4

Apr 17, 2019

0.6.3

Apr 16, 2019

0.6.2

Apr 16, 2019

0.6.1

Apr 15, 2019

0.6.0

Apr 15, 2019

0.5.2

Apr 15, 2019

0.5.1

Apr 15, 2019

0.5.0

Apr 15, 2019

0.1.3

Aug 5, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metapi-2.4.0.tar.gz (110.2 kB view details)

Uploaded Aug 19, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metapi-2.4.0-py3-none-any.whl (151.2 kB view details)

Uploaded Aug 19, 2022 Python 3

File details

Details for the file metapi-2.4.0.tar.gz.

File metadata

Download URL: metapi-2.4.0.tar.gz
Upload date: Aug 19, 2022
Size: 110.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for metapi-2.4.0.tar.gz
Algorithm	Hash digest
SHA256	`db1b9f325f5a447904e348c5ddb0754bec74b5e2ae40803f05d2992247472a2f`
MD5	`131241fdee87806998d50cf120ef05ee`
BLAKE2b-256	`17645285643be8b8ca209a4d688a0250c68bff025cca0d0b0d466a039884e048`

See more details on using hashes here.

File details

Details for the file metapi-2.4.0-py3-none-any.whl.

File metadata

Download URL: metapi-2.4.0-py3-none-any.whl
Upload date: Aug 19, 2022
Size: 151.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for metapi-2.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c878fdd0d5f90c757d07040327b097b80d47a4195653868d13fdea25673225fe`
MD5	`9618cabf61a13cbc32be13ce9aeea3fa`
BLAKE2b-256	`b11315385ff861a6993cf27608e638d321004d764f2eae5d07494a92f70b1ee9`

See more details on using hashes here.

metapi 2.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

MAG workflow (WIP figure)

Installation

Run

Help

Init

mag_wf

mag_wf example

Input details

Input format

Default output structure (begin from trimming)

Getting help

Contributing

Contributors

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes