Skip to main content

ConDiGA: Contigs directed gene annotation for accurate protein sequence database construction in metaproteomics

Project description

ConDiGA

GitHub Code style: black

CI

ConDiGA (Contigs Directed Gene Annotation) is an accurate taxonomic annotation pipeline from metagenomic data to construct accurate protein sequence databases for deep metaproteomic coverage.

Setting up ConDiGA

Downloading ConDiGA

You can clone the ConDiGA repository to your machine.

git clone https://github.com/metagentools/ConDiGA.git

Now go into the ConDiGA folder using the command

cd ConDiGA/

Using conda

Once you have installed conda, make sure you are in the ConDiGA folder. Now run the following commands to create a conda environment and activate it to run condiga.

conda env create -f environment.yml
conda activate condiga

Using pip

You can run the following command to install condiga using pip. Make sure you are in the ConDiGA folder.

pip install .

Note: If you use pip to setup ConDiGA, you will have to install Minimap2 and TaxonKit manually and add it to your system path. Irrespective of the package manager, if you want to use Kaiju results, you have to download and setup the NCBI taxdump database for TaxonKit.

Test the setup

After setting up, run the following command to ensure that condiga is working.

condiga --help

Usage

Usage: condiga [OPTIONS]

  ConDiGA: Contigs directed gene annotation for accurate protein sequence
  database construction in metaproteomics.

Options:
  -c, --contigs PATH              path to the contigs file  [required]
  -ta, --taxa PATH                path to the taxanomic classification results
                                  file  [required]
  -g, --genes PATH                path to the genes file  [required]
  -cov, --coverages PATH          path to the contig coverages file
                                  [required]
  -as, --assembly-summary PATH    path to the assembly_summary.txt file
                                  [required]
  -ra, --rel-abundance FLOAT RANGE
                                  minimum relative abundance cut-off
                                  [default: 0.0001; 0<=x<=1]
  -gc, --genome-coverage FLOAT RANGE
                                  minimum genome coverage cut-off  [default:
                                  0.001; 0<=x<=1]
  -mt, --map-threshold FLOAT RANGE
                                  minimum mapping length threshold cut-off
                                  [default: 0.5; 0<=x<=1]
  -t, --nthreads INTEGER          number of threads to use  [default: 8]
  -o, --output PATH               path to the output folder  [required]
  --help                          Show this message and exit.

Preprocessing

Before running ConDiGA, you have to process your data as follows.

Step 1: Assemble reads into contigs

You have to assemble your reads into contigs using MEGAHIT as follows. Currently, ConDiGA only supports MEGAHIT assemblies.

megahit -1 Reads/reads_1.fq.gz -2 Reads/reads_2.fq.gz -o MEGAHIT_output -t 16

Step 2: Taxonomically annotate contigs

Next, you have to perform taxonomic annotation on your contigs. You can use any tool such as Kraken2, Kaiju or even BLAST.

As an example, let's run Kraken2 as follows. $DBNAME is the path to your Kraken database.

kraken2 --threads 16 --db $DBNAME --use-names --output kraken_res_0.1.txt --confidence 0.1 --report kraken_report_0.1.txt MEGAHIT_output/final.contigs.fa

Now, you can run the convert command to convert your result to a form that can be used as input to condiga. The result will be saved to the Kraken folder. Currently, convert supports results from Kraken2, Kaiju and BLAST.

mkdir Kraken
convert -i kraken_res_0.1.txt -t kraken -o Kraken

NOTE: Since, different annotation tools output results in different formats, you have to format the annotation results using covert which will output the result in a standard format readable by ConDiGA.

Step 3: Obtain coverage of contigs

You can use CoverM to get the coverage values of contigs as follows.

coverm contig -1 Reads/reads_1.fastq -2 Reads/reads_2.fastq -r MEGAHIT_output/final.contigs.fa -o contig_coverage.tsv -t 16

Step 4: Predict genes in contigs

You can predict the genes in the contigs using MetaGeneMark as follows. You will find the nucleotide and amino acid sequences of the predicted genes in a file named final.contigs.fa.lst.

gmhmmp -m MetaGeneMark_v1.mod final.contigs.fa

Step 5: Download assembly_summary.txt file.

You can download the assembly summary file for bacteria from NCBI as follows.

wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

Running ConDiGA

Once you have preprocessed your data and obtained all the necessary files, you can run condiga as follows.

condiga -c final.contigs.fa -ta Kraken/kraken_result.txt -g final.contigs.fa.lst -cov contig_coverages.tsv -as assembly_summary.txt -o <output_folder>

Citation

If you use ConDiGA in your work, please cite the bioRxiv preprint as follows.

@article {Wu2023.04.19.537311,
	author = {Enhui Wu and Vijini Mallawaarachchi and Jinzhi Zhao and Yi Yang and Hebin Liu and Xiaoqing Wang and Chengpin Shen and Yu Lin and Liang Qiao},
	title = {Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics},
	elocation-id = {2023.04.19.537311},
	year = {2023},
	doi = {10.1101/2023.04.19.537311},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {Microbiota are closely associated to human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. One of the main challenges is constructing a protein sequence database that best fits the microbiota sample. Herein, we proposed an accurate taxonomic annotation pipeline from metagenomic data for deep metaproteomic coverage, namely contigs directed gene annotation (ConDiGA). We mixed 12 known bacterial species to derive a synthetic microbial community to benchmark metagenomic and metaproteomic pipelines. With the optimized taxonomic annotation strategy by ConDiGA, we built a protein sequence database from the metagenomic data for metaproteomic analysis and identified about 12,000 protein groups, which was very close to the result obtained with the reference proteome protein sequence database of the 12 species. We also demonstrated the practicability of the method in real fecal samples, achieved deep proteome coverage of human gut microbiome, and compared the function and taxonomy of gut microbiota at metagenomic level and metaproteomic level. Our study can tackle the current taxonomic annotation reliability problem in metagenomics-derived protein sequence database for metaproteomics. The unique dataset of metagenomic and the metaproteomic data of the 12 bacterial species is publicly available as a standard benchmarking sample for evaluating various analysis pipelines. The code of ConDiGA is open access at GitHub for the analysis of real microbiota samples.Competing Interest StatementThe authors have declared no competing interest.},
	URL = {https://www.biorxiv.org/content/early/2023/04/20/2023.04.19.537311},
	eprint = {https://www.biorxiv.org/content/early/2023/04/20/2023.04.19.537311.full.pdf},
	journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

condiga-0.2.0.tar.gz (14.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page