unitig-caller: wrapper around mantis to detect presence of sequence elements
Project description
unitig-caller
Determines presence/absence of sequence elements in bacterial sequence data using Bifrost Build and Query functions. Uses assemblies and/or reads as inputs.
The implementation of unitig-caller is a wrapper around Bifrost which formats files for use with pyseer, as well as an implementation which calls sequences using an FM-index.
Build mode creates a compact de Bruijn graph using Bifrost. Query mode converts the .gfa file produced by Build mode to a .fasta, using an associated colours file to query the presence of unitigs in the source genomes used to build the original de Bruijn graph.
Simple mode finds presence of unitigs in a new population using an FM-index.
Install
Use unitig-caller
if installed through pip/conda, or
python unitig_caller-runner.py
if using a clone of the code.
With conda (recommended)
Get it from bioconda:
conda install unitig-caller
If you haven't set this up, first install miniconda. Then add the correct channels:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
With pip
Get it from PyPI:
pip install unitig-caller
Requires bifrost version 1.0.3 installed, and accessible via PATH (see steps for installation at Bifrost github page).
From source
Requires cmake
, pthreads
, pybind11
and a C++17 compiler (e.g. gcc >=7.3), in addition
to the pip requirements.
git clone https://github.com/johnlees/unitig-caller --recursive
python setup.py install
Usage
There are three ways to use this package:
- Build a population graph to extract unitigs for GWAS with pyseer like unitig-counter (
--build
). - Find these unitigs in a new population using a graph (
--build
and--query
). - Find these unitigs in a new population using an index (
--simple
).
For 1), run --build
mode followed by --query
mode.
Both 2) and 3) give the same results with different index tools, both finding unitigs so pyseer models can be applied to a new population.
For 2), first run --build
mode to make a graph for the
new population. Then run --query
mode with this graph, but the --unitigs
from the original population.
For 3), run --simple
mode giving the new genomes as --refs
and the --unitigs
from the original population.
These modes are detailed below
Running Build mode
This uses Bifrost Build to generate a compact de Bruijn graph. By default this a coloured compact de Bruijn graph.
unitig-caller --build --refs refs.txt --reads reads.txt --output out_prefix
--refs
is a required .txt file listing paths of input assemblies or read files
(.fasta or .fastq), each on a new line. Must be specified as either 'refs.txt' for assemblies
or 'reads.txt' for read files. No header row.
--reads
is an optional .txt file listing paths to additional sequence files of different type
to those specified in --input1 (e.g. if 'refs.txt' is given in --refs
, then 'reads.txt' will
be given in --reads
and vice versa), each on new line. No header row.
--output
is the prefix for output files.
By default de Bruijn graphs are coloured, with an accompanying .bfg_colors being
generated alongside the .gfa file. To turn this off, use --no_colour
. Note, Query mode
cannot be run without a .bfg_colors file.
To generate a clean de Bruijn graph (clip tips and delete isolated contigs shorter
than k k-mers in length), specify --clean
.
Build mode automatically generates a .fasta file containing unitigs found within the graph.
Running Query mode
Before running Query mode, generate a coloured compact de Bruijn graph using Build mode. Then run the Query command as below.
unitig-caller --query --graph-prefix in_prefix --unitigs query_unitigs.fasta --output out_prefix
--graph-prefix
is the required prefix for the .gfa, .bfg_colors and unitigs .fasta files generated from
--build
mode applied to the new population.
--unitigs
is an optional .fasta file, specifying a separate unitigs .fasta file that was
generated by --build
mode on another graph. If not specified, unitigs from the graph will be used,
generating calls for this population.
--output
is the prefix for output files.
The sensitivity of querying can be altered by passing a float argument to --ratiok
(between 0 and 1, default 1.0), which determines the threshold proportion of k-mers of a
specific colour present in a unitig for colour classification. Specifying --inexact
will
search the graph for both exact and inexact k-mers (1 substitution or indel) from queries.
Lowering --ratiok
and/or specifying --inexact
will result in more colour hits per unitig,
but will increase probability of false positives and run-time.
Running simple mode
This uses suffix arrays (FM-index) provided by SeqAn3 to perform string matches:
unitig-caller --simple --refs strain_list.txt --unitigs queries.txt --output calls
--refs
is a required file listing input assemblies, name followed by location
of fasta file (tab separated), each on a new line. No header row.
--unitigs
is a required list of the unitig sequences to call. The unitigs need
to be in the first column (tab separated). A header row is assumed, so
output from pyseer etc can be directly used.
calls_pyseer.txt
will contain unitig calls in seer/pyseer k-mer format.
By default FM-indexes are saved in the same location as the assembly files so that they can
be quickly loaded by subsequent runs. To turn this off use --no-save-idx
.
Option reference
usage: unitig-caller [-h] (--build | --query | --simple) [--refs REFS]
[--reads READS] [--graph-prefix GRAPH_PREFIX]
[--unitigs UNITIGS] [--output OUTPUT] [--no_colour]
[--clean] [--ratiok RATIOK] [--inexact]
[--kmer_size KMER_SIZE] [--minimizer_size MINIMIZER_SIZE]
[--no-save-idx] [--threads THREADS] [--bifrost BIFROST]
[--version]
Call unitigs in a population dataset
optional arguments:
-h, --help show this help message and exit
Mode of operation:
--build Build coloured/uncoloured de Bruijn graph using
Bifrost
--query Query unitig presence/absence across input genomes
--simple Use FM-index to make calls
Unitig-caller input/output:
--refs REFS Ref file to use to --build bifrost graph (or with
--simple)
--reads READS Read file to use to --build bifrost graph
--graph-prefix GRAPH_PREFIX
Prefix of bifrost graph to --query
--unitigs UNITIGS fasta file of unitigs to query (--query or --simple)
--output OUTPUT Prefix for output [default = 'unitig_caller']
Build Input/output:
--no_colour Specify for uncoloured de Bruijn Graph [default =
False]
--clean Clean DBG (clip tips and delete isolated contigs
shorter than k k-mers in length) [default = False]
Query Input/output:
--ratiok RATIOK ratio of k-mers from queries that must occur in the
graph to be considered as belonging to colour [default
= 1.0]
--inexact Graph is searched with exact and inexact k-mers (1
substitution or indel) from queries [default = False]
Bifrost options:
--kmer_size KMER_SIZE
K-mer size for graph building/querying [default = 31]
--minimizer_size MINIMIZER_SIZE
Minimizer size to be used for k-mer hashing [default =
23]
Simple mode options:
--no-save-idx Do not save FM-indexes for reuse
Other:
--threads THREADS Number of threads to use [default = 1]
--bifrost BIFROST Location of bifrost executable [default = Bifrost]
--version show program's version number and exit
Citation
If you use this, please cite the Bifrost paper:
Holley G., Melsted, P. Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs. bioRxiv 695338 (2019). doi: https://doi.org/10.1101/695338
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.