Skip to main content

unitig-caller: wrapper around mantis to detect presence of sequence elements

Project description

unitig-caller

Dev build Status Anaconda-Server Badge

Determines presence/absence of sequence elements in bacterial sequence data using Bifrost Build and Query functions. Uses assemblies and/or reads as inputs.

The implementation of unitig-caller is a wrapper around Bifrost which formats files for use with pyseer, as well as an implementation which calls sequences using an FM-index.

Build mode creates a compact de Bruijn graph using Bifrost. Query mode converts the .gfa file produced by Build mode to a .fasta, using an associated colours file to query the presence of unitigs in the source genomes used to build the original de Bruijn graph.

Simple mode finds presence of unitigs in a new population using an FM-index.

Install

Use unitig-caller if installed through pip/conda, or python unitig_caller-runner.py if using a clone of the code.

With conda (recommended)

Get it from bioconda:

conda install unitig-caller

If you haven't set this up, first install miniconda. Then add the correct channels:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

With pip

Get it from PyPI:

pip install unitig-caller

Requires bifrost version 1.0.3 installed, and accessible via PATH (see steps for installation at Bifrost github page).

From source

Requires cmake, pthreads, pybind11 and a C++17 compiler (e.g. gcc >=7.3), in addition to the pip requirements.

git clone https://github.com/johnlees/unitig-caller --recursive
python setup.py install

Usage

There are three ways to use this package:

  1. Build a population graph to extract unitigs for GWAS with pyseer like unitig-counter (--build).
  2. Find these unitigs in a new population using a graph (--build and --query).
  3. Find these unitigs in a new population using an index (--simple).

For 1), run --build mode followed by --query mode.

Both 2) and 3) give the same results with different index tools, both finding unitigs so pyseer models can be applied to a new population.

For 2), first run --build mode to make a graph for the new population. Then run --query mode with this graph, but the --unitigs from the original population.

For 3), run --simple mode giving the new genomes as --refs and the --unitigs from the original population.

These modes are detailed below

Running Build mode

This uses Bifrost Build to generate a compact de Bruijn graph. By default this a coloured compact de Bruijn graph.

unitig-caller --build --refs refs.txt --reads reads.txt --output out_prefix

--refs is a required .txt file listing paths of input assemblies or read files (.fasta or .fastq), each on a new line. Must be specified as either 'refs.txt' for assemblies or 'reads.txt' for read files. No header row.

--reads is an optional .txt file listing paths to additional sequence files of different type to those specified in --input1 (e.g. if 'refs.txt' is given in --refs, then 'reads.txt' will be given in --reads and vice versa), each on new line. No header row.

--output is the prefix for output files.

By default de Bruijn graphs are coloured, with an accompanying .bfg_colors being generated alongside the .gfa file. To turn this off, use --no_colour. Note, Query mode cannot be run without a .bfg_colors file.

To generate a clean de Bruijn graph (clip tips and delete isolated contigs shorter than k k-mers in length), specify --clean.

Build mode automatically generates a .fasta file containing unitigs found within the graph.

Running Query mode

Before running Query mode, generate a coloured compact de Bruijn graph using Build mode. Then run the Query command as below.

unitig-caller --query --graph-prefix in_prefix --unitigs query_unitigs.fasta --output out_prefix

--graph-prefix is the required prefix for the .gfa, .bfg_colors and unitigs .fasta files generated from --build mode applied to the new population.

--unitigs is an optional .fasta file, specifying a separate unitigs .fasta file that was generated by --build mode on another graph. If not specified, unitigs from the graph will be used, generating calls for this population.

--output is the prefix for output files.

The sensitivity of querying can be altered by passing a float argument to --ratiok (between 0 and 1, default 1.0), which determines the threshold proportion of k-mers of a specific colour present in a unitig for colour classification. Specifying --inexact will search the graph for both exact and inexact k-mers (1 substitution or indel) from queries. Lowering --ratiok and/or specifying --inexact will result in more colour hits per unitig, but will increase probability of false positives and run-time.

Running simple mode

This uses suffix arrays (FM-index) provided by SeqAn3 to perform string matches:

unitig-caller --simple --refs strain_list.txt --unitigs queries.txt --output calls

--refs is a required file listing input assemblies, name followed by location of fasta file (tab separated), each on a new line. No header row.

--unitigs is a required list of the unitig sequences to call. The unitigs need to be in the first column (tab separated). A header row is assumed, so output from pyseer etc can be directly used.

calls_pyseer.txt will contain unitig calls in seer/pyseer k-mer format.

By default FM-indexes are saved in the same location as the assembly files so that they can be quickly loaded by subsequent runs. To turn this off use --no-save-idx.

Option reference

usage: unitig-caller [-h] (--build | --query | --simple) [--refs REFS]
                     [--reads READS] [--graph-prefix GRAPH_PREFIX]
                     [--unitigs UNITIGS] [--output OUTPUT] [--no_colour]
                     [--clean] [--ratiok RATIOK] [--inexact]
                     [--kmer_size KMER_SIZE] [--minimizer_size MINIMIZER_SIZE]
                     [--no-save-idx] [--threads THREADS] [--bifrost BIFROST]
                     [--version]

Call unitigs in a population dataset

optional arguments:
  -h, --help            show this help message and exit

Mode of operation:
  --build               Build coloured/uncoloured de Bruijn graph using
                        Bifrost
  --query               Query unitig presence/absence across input genomes
  --simple              Use FM-index to make calls

Unitig-caller input/output:
  --refs REFS           Ref file to use to --build bifrost graph (or with
                        --simple)
  --reads READS         Read file to use to --build bifrost graph
  --graph-prefix GRAPH_PREFIX
                        Prefix of bifrost graph to --query
  --unitigs UNITIGS     fasta file of unitigs to query (--query or --simple)
  --output OUTPUT       Prefix for output [default = 'unitig_caller']

Build Input/output:
  --no_colour           Specify for uncoloured de Bruijn Graph [default =
                        False]
  --clean               Clean DBG (clip tips and delete isolated contigs
                        shorter than k k-mers in length) [default = False]

Query Input/output:
  --ratiok RATIOK       ratio of k-mers from queries that must occur in the
                        graph to be considered as belonging to colour [default
                        = 1.0]
  --inexact             Graph is searched with exact and inexact k-mers (1
                        substitution or indel) from queries [default = False]

Bifrost options:
  --kmer_size KMER_SIZE
                        K-mer size for graph building/querying [default = 31]
  --minimizer_size MINIMIZER_SIZE
                        Minimizer size to be used for k-mer hashing [default =
                        23]

Simple mode options:
  --no-save-idx         Do not save FM-indexes for reuse

Other:
  --threads THREADS     Number of threads to use [default = 1]
  --bifrost BIFROST     Location of bifrost executable [default = Bifrost]
  --version             show program's version number and exit

Citation

If you use this, please cite the Bifrost paper:

Holley G., Melsted, P. Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs. bioRxiv 695338 (2019). doi: https://doi.org/10.1101/695338

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for unitig-caller, version 1.1.0
Filename, size File type Python version Upload date Hashes
Filename, size unitig-caller-1.1.0.tar.gz (13.3 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page