tools to support genome and metagenome analysis
Project description
genome-grist - reference-based exploration of Illumina metagenomes
In brief
genome-grist automates a number of tasks around genome-based metagenome interpretation.
One key point of genome-grist is this: we can take advantage of sourmash gather to find the smallest set of genomes to which to map metagenome reads. genome-grist automates all the stuff AROUND doing that!
So, genome-grist
is a toolkit to do the following:
- download a metagenome
- process it into trimmed reads, and make a sourmash signature
- search the sourmash signature with 'gather' against sourmash databases, e.g. all of genbank
- download the matching genomes from genbank
- map all metagenome reads to genomes using minimap
- extract matching reads iteratively based on gather, successively eliminating reads that matched to previous gather matches
- run mapping on “leftover” reads to genomes
- summarize all mapping results
Installation
The command:
python -m pip install genome-grist
will install the latest version. Plase use python3.7 or later. We suggest using an isolated conda environment; the following commands should work for conda:
conda create -n grist python=3.7 pip
conda activate grist
python -m pip install genome-grist
Quick start:
Run the following three commands.
First, download SRA sample HSMA33MX, trim reads, and build a sourmash signature:
genome-grist process HSMA33MX smash_reads
Next, run sourmash signature against genbank:
genome-grist process HSMA33MX gather_genbank
(NOTE, this depends on the latest genbank genomes and won't work for most people just yet; for now, use cached results from the repo:
cp tests/test-data/HSMA33MX.x.genbank.gather.csv outputs/genbank/
touch outputs/genbank/HSMA33MX.x.genbank.gather.out
)
Finally, download the reference genomes, map reads and produce a summary report:
genome-grist process HSMA33MX summarize -j 8
(You can run all of the above with make test
in the repo.)
The summary report will be in outputs/reports/report-HSMA33MX.html
.
You can see some example reports for this and other data sets online:
- HSMA33MX report
- Illumina metagenome from Shakya et al., 2014) (ref)
- sample 1 from Hu et al., 2016 (oil well metagenome) (ref)
Compute requirements
You'll need enough disk space to store about 5 copies of your raw metagenome.
The peak memory requirement is in the k-mer trimming and sourmash gather steps. You'll probably want between 30 and 60 GB of RAM for those, although for smaller or less diverse metagenomes, you will use a lot less.
Full set of top-level process
targets
- download_reads
- trim_reads
- smash_reads
- gather_genbank
- download_matching_genomes
- map_reads
- summarize
Support
genome-grist is alpha-level software. Please be patient and kind :).
Please ask questions and add comments by filing github issues.
Why the name grist
?
'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See Grist.
(It is not the computing grist!)
CTB Nov 8, 2020
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.