Quality control of assembled genomes
Project description
Eric Hugoson (eric@hugoson.org / hugoson@evolbio.mpg.de / @EricHugo)
Lionel Guy (guy.lionel@gmail.com / lionel.guy@imbim.uu.se / @LionelGuy)
Introduction
With the increasing rate of the production of genomics data, particularly metagenomic data, there is need for more and faster quality control of resulting assembled genomes. An increasing usage of metagenome assembled genomes (MAGs) means often working with incomplete genomes, which can be acceptable provided the researcher is aware of this. Therefore there is a clear use for software able to rapidly and accurately provide the stats that describe the quality and completeness of the genomes or genomic bins of interest.
miComplete allows a user to provide a list of genomes or genomic bins to retrieve some basic statistics regarding the given genomes (size, GC-content, N- and L50, N- and L90). Further a set of marker genes in HMM format can be provided to also retrieve completeness and redundance of those markers in each genome. Additionally, a set of weights for the marker genes can be provided to also retrieve the weighted versions of completeness and redundancy which can inform the user a bit more of the actual state completeness (see description). Alternatively, the user can calculate new weights for any given set of marker genes provided.
miComplete is still in a state of development, bugs may be encountered. Feedback, bug reports, and feature requests are welcome through Bitbucket’s issue system.
Description
miComplete is a compact software aimed at rapidly and accurately determining of the quality of assembled genomes, often metagenome assembled bins. miComplete also aims at providing a more reliable completeness and redundancy metric via a system of weighting the impact of different marker genes presence or absence differently.
Completeness
In miComplete completeness is calculated based on the presence/absence a set of marker genes provided as a set of HMMs. The presence or absence of the marker genes is determined by HMMER3 (see dependencies) if the hit reported is below the cutoff e-value provided as well as passing the bias check. Duplicated marker genes are also gathered, and found duplications are reported as redundancy.
Weights
Because not all marker genes are equal in determining the completeness of a genome. Some genes associate closely together within a genome (particularly genes organised into operons), and thus viewing two genes that typically associate together within an operon as providing the same completeness information as two unrelated genes could be profoundly misleading. miComplete is able to calculate a weighted version of completeness and redundancy that attempts to factor in how closely the provided marker genes typically associate with other provided marker genes.
Linkage
Weights can be calculated for any given set of marker genes in miComplete. This is can be done by a user by providing a set of reference genomes (note that these need to be single contig chromosomes). The reference genomes can be any set that the user wishes, but as general rule the larger and more diverse number of genomes the better weights. At the end of the run a boxplot of the distribution of weights for all markers is produced along with a file of the relative weights to be used in future runs.
Dependencies
Python (>=3.5)
External software
Executables should be available in the user’s $PATH.
HMMER3
HMMER: biosequence analysis using profile hidden Markov models, by Sean Eddy and coworkers. Tested with v. 3.1b2. Available from <http://hmmer.org/>.
prodigal
A gene prediction software by Doug Hyatt. Tested with v. 2.6.3. Download at: <https://github.com/hyattpd/Prodigal>
Python libraries
If built from the package these will be installed automatically, otherwise can easily be installed using pip (Install pip).
Required
Biopython (>= 1.70) ($ pip install biopython)
Numpy (>= 1.13.1) ($ pip install numpy)
Matplotlib (>= 2.0.2) ($ pip install matplotlib)
Termcolor (>= 1.1.0) ($ pip install termcolor)
Note: matplotlib as implemented requires a user interface. By default it uses Tkinter, which can be installed via your systems package manager. To alter backend used follow these instructions.
Installation
Python package
miComplete can easily be installed along with all python dependencies:
$ pip install micomplete
Assuming that the python bin is in your $PATH, can then be run as:
$ miComplete
Git
Choose an appropriate location, e.g. your home:
$ cd $HOME
Clone the latest version of the repository:
$ git clone http://bitbucket.org/evolegiolab/micomplete.git
Create symlink to some directory in your $PATH (in this example $HOME/bin):
$ cd micomplete $ ls micomplete $ ln -s $(realpath micomplete/micomplete.py) $HOME/bin/miComplete
Optionally, add the folder micomplete in your PATH. The scripts should be kept at their original location.
Usage
Positional arguments
A file of sequence(s) along with type (fna, faa, gbk) provided in a tabular format
The file has to contain per line both a path (relative or absolute) to a genomic file as well as the type separated by a tab. Optionally it can also be given a custom name separately from the filename in a third column:
/seq/genomic_sources/legionella_pneumophila.gbk gbk /seq/genomic_sources/coxiella_burnetii.fna fna /seq/genomic_sources/e_coli.fna fna MG1655_reference (...)
Optional arguments
- -h, --help
show help message and exit
- -c, --completeness
Perform completeness check (also requires a set of HMMs to have been provided)
- --hlist
Write list of Present, Absent and Duplicated markers for each organism to file
- --hmms HMMS
Specifies a set of HMMs to be used for completeness check or linkage analysis
- --weights WEIGHTS
Specify a set of weights for the HMMs specified, (optional)
- --linkage
Specifies that the provided sequences should be used to calculate the weights of the provided HMMs
- --lenient
By default miComplete drops hits with too high bias or too low best domain score. This argument disables that behaivor, permitting any hit that meets the evalue requirements.
- --no-linkage-cutoff
Disable cutoff fraction of the entire fasta which needs to be contained in a single contig in order to be included in linkage calculations. Disable this is likely to result in some erroneos calculations.
- --evalue EVALUE
Specify e-value cutoff to be used for completeness check, default=4e-10
- --bias BIAS
Specify the bias cutoff as a fraction of score defined by hammer.
- --domain-cutoff
Specify the largest allowed difference between best domain evalue and protein evalue.
- --cutoff CUTOFF
Specify cutoff percentage of markers required to be present in genome for it be included in linkage calculat. Default = 0.9
- --threads THREADS
Specify number of threads to be used in parallel
- --log LOG
Log name (default=miComplete.log)
- -v, --verbose
Enable verbose logging
- --debug
Debug logging
- -o, --outfile OUTFILE
Name of outfile can be specified with this argument. By default prints to stdout.
Examples
Create a sequence tab file. Here it is best to avoid relative paths unless you know you will be running miComplete from the same relative directory. A correctly formatted input tab file can be done by hand or using a small utility script included with miComplete:
find $(realpath .) -maxdepth 1 -type f -name "*.fna" | miCompletelist.sh > test_set.tab
Sequence tab file, test_set.tab:
/seq/genomic_sources/legionella_longbeachae.fna fna /seq/genomic_sources/coxiella_burnetii.fna fna /seq/genomic_sources/coxiella-like_endosymbiont.fna fna
Example 1 - Basic stats
This example merely produces basic information about the given sequences:
$ miComplete test_set.tab Name Length GC-content N50 L50 N90 L90 legionella_longbeachae 4149158 37.13 4077332 1 4077332 1 coxiella_burnetii 2032807 42.6 1995488 1 1995488 1 coxiella-like_endosymbiont 1733840 38.17 1733840 1 1733840 1
miComplete prints result to stdout in tabular format, this can favourably be redirected towards a file with a pipe and examined with spreadsheet reader.
$ miComplete test_set.tab > results.tab
Example 2 - Completeness
This example will produce the same basic statistics, but also completeness and redundance:
$ miComplete test_set.tab -c --hmms ~/git/micomplete/share/Bact105.hmm Name Length GC-content Present Markers Completeness Redundancy N50 L50 N90 L90 legionella_longbeachae 4149158 37.13 105 1.0000 1.0095 4077332 1 4077332 1 coxiella_burnetii 2032807 42.6 105 1.0000 1.0000 1995488 1 1995488 1 coxiella-like_endosymbiont 1733840 38.17 102 0.9714 1.0686 1733840 1 1733840 1
That is great, but the run time is starting to increase significantly primarily due to needing to translate four genomes to proteomes. We can speed up the process by running all four parallel with --threads 4:
$ miComplete test_set.tab -c --hmms share/Bact105.hmm --threads 4 > results.tab
Example 3 - Weighted completeness
This example will also produce the weighted completeness:
$ miComplete test_set.tab -c --hmms ~/git/micomplete/share/Bact105.hmm --weights ~/git/micomplete/share/Bact105.weights Name Length GC-content Present Markers Completeness Redundancy Weighted completeness Weighted redundancy N50 L50 N90 L90 legionella_longbeachae 4149158 37.13 105 1.0000 1.0095 1.0 1.0151 4077332 1 4077332 1 coxiella_burnetii 2032807 42.6 105 1.0000 1.0000 1.0 1.0 1995488 1 1995488 1 coxiella-like_endosymbiont 1733840 38.17 102 0.9714 1.0686 0.9476 1.0855 1733840 1 1733840 1
Example 4 - Creating weights
Finally we will create our own set of weights given a set of marker genes for which we do not already have weights. In this example only three bacteria from the same order are used to create weights. Generally one should create weights with as a large number of well distributed (or at least as widely distributed as the data you intend to use the weights for) genomes:
$ miComplete test_set.tab -c --hmms share/Bact105.hmm --linkage --threads 4 > Bact105.weights
Also produces a box plot (distplot.png) of the distribution of weights for each marker gene.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.