metagenomics pipline for BIOF501
Project description
Simple Metagenomics
A BIOF501 term project for inferring protein annotations of metagenome-assembled genomes (MAG) from metagenomic reads hosted on NCBI's sequence read archive (SRA)
For the Impatient
Setup:
pip install simple-metagenomics
smg setup -r ./ref
To run with default subsampling (to 1% of the original for improved runtime):
smg run -r ./ref -i SRR19573024 -o ./out
To run with no subsampling:
smg run -r ./ref -s 1 -i SRR19573024 -o ./out
Background and Rationale
Throughout the various biomes of Earth, complex consortia of microorganisms thrive and cycle nutrients at scales ranging from symbiosis to global biogeochemical cycles. The study of these consortia has contributed to advances in many fields, including health in the context of host microbiomes [1], renewable energy in the context of biofuels [2], and ecology in the context of distributed metabolisms [3]. Since only a select few microbes have been successfully cultured in laboratory conditions, the typical approach is to interrogate the microbial gene content of a sample directly using metagenomics.
The aim of this pipeline is to provide the simplest possible method for downloading and then converting raw metagenomic sequences into meaningful annotations. For additional details, please refer to the implementation section.
Usage
Manual Dependencies
- Linux OS/amd64
- Python, version>=3.4 (so that you also have pip)
- Singularity
Installation
We recommend that you use a virtual environment
via conda...
conda create --no-default-packages --name smg python
conda activate smg
or via venv
pip install venv
python -m venv ./smg
source ./smg/bin/activate
In the environment, install simple metagenomics
pip install simple-metagenomics
Select a folder to save additional reference resources (./ref
).
smg setup -r ./ref
Execution
Obtain the SRA run ID for a whole genome metagenomics sequencing entry. For example, we use SRR19573024
, which points to reads for a cyanobacteria bioreactor community [4]. ./ref
refers to the same folder used in the last installation step.
Example search
smg run -r ./ref -i SRR19573024 -o ./out
Once complete, look for annotation tables under ./out/SRR19573024/diamond/
.
Expected runtime: ~30 minutes with 16 threads and subsampled to 1%.
Expected output:
./out # base output path specified with "-o"
├── .snakemake # snakemake generated files, including logs
├── snakemake # snakemake cache
├── SRR19573024
├── sra_raw # original fastqs from SRA
├── input # subsampled fastqs
├── megahit # intermediate metagenomic assembly
├── maxbin2 # intermediate bins
├── prodigal # intermediate ORFs per bin
├── diamond
├── 001.fasta.tsv # annotation table for 1 bin
├── 002.fasta.tsv # 2 bins should be resolved from SRR19573024 by default
Columns: Query ID (ORF), Subject title (annotation), Percentage of identical matches, Expected value
Interestingly, photosynthesis genes were found in both bins, including photosystems I and II. Bin 001, however, showed a greater potential to fix nitrogen since nifB, nifS, and nifU were identified which accounts for 3 out of the 4 genes of a known nitrogen fixation operon [5]. While the remaining gene, fdxN, was not explicitly identified, a ferredoxin nitrite reductase was found in its' stead. Only nifB was found in bin 002.
Implementation
The workflow is managed by snakemake [6] with all workflow-related dependencies packaged into a Docker container to maximize reproducibiltiy. Due to its' rising popularity, especially in the research community, Singularity [7] may be used as an alternative to Docker. The container image is hosted on Quay.io and automatically pulled during setup. sra_download: Using sra toolkit, we download the paired-paired end fastqs pointed to by the given SRA run ID. subsample: A python script randomly subsamples the fastq reads to the given percentage using numpy Megahit [8]: The subsampled reads are assembled into longer segments (contigs). Maxbin2 [9]: These segments are then clusted into bins based on tetranucleotide frequency and read coverage. Prodigal [10]: The contigs of each bin are then scanned for open reading frames (ORF) by using a dynamic programming algorithm that takes into account ribosomal binding sites, start & stop codons, and ORF length. Diamond [11]: Predicted ORFs are annotated based on the degree of homology with known reference sequences in the Clusters of Orthologous Genes (COG) [12] database. |
Command Line Interface
$ smg
simple-metagenomics v1.0
https://github.com/Tony-xy-Liu/simple-metagenomics
Syntax: smg COMMAND [OPTIONS]
Where COMMAND is one of:
setup
run
for additional help, use:
smg COMMAND -h
$ smg setup
usage: smg setup [-h] -r PATH [-c TYPE]
optional arguments:
-h, --help show this help message and exit
-r PATH where to save required resources
-c TYPE the resource container type, choose from: "singularity"
(default) or "docker"
the following arguments are required: -r
$ smg run
usage: smg run [-h] -r PATH -i SRA_ID -o PATH [-s DECIMAL] [-t INT] [--mock]
optional arguments:
-h, --help show this help message and exit
-r PATH path to saved required resources from running: smg setup
-i SRA_ID example: SRR19573024
-o PATH output folder
-s DECIMAL subsample fraction for raw reads, set to 1 for no subsampling,
default:0.01
-t INT threads, default:16
--mock dry run snakemake
the following arguments are required: -r, -i, -o
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for simple_metagenomics-1.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c0eee509a0de15d86c6cd379ecb5c000900e1fc803c0f2bf769118a254cf9e9 |
|
MD5 | c4fa10ca90914b7e46d18a2bc0f77207 |
|
BLAKE2b-256 | 2fd986f69eef8de78b8ddcf063da56bbd2e6e5a47b7853a4b3e0a8887a4e4679 |