Skip to main content

A tool for sequence, annotation, and methylome comparison between multiple bacterial genomes

Project description

Kable

A tool for sequence, annotation, and methylome comparison between multiple bacterial genomes

Introduction

Various bioinformatics tools exist to annotate gene sequences in the bacterial genome and to find variants; however, linking variants to specific genes or identifying changes to gene boundaries require additional steps. Further, there are currently no tools to compare the location of base modifications between genomes.

To address the shortcomings in current analysis tools, I present Kable. Kable utilizes a colored de Bruijn Graph to align multiple genomic sequences while simultaneously tracking associated feature information. Kable can handle gene annotations, base modification or any other data that can be described in a GFF or VCF format, including potential assembly errors identified by al2var. Kable provides three key features to the user. First, the user can search for a specific sequence or find a segment between two sequences and Kable will return any and all annotation data occurring within those sequences and display their exact locations in an alignment file. Second, Kable can identify variants between any number of genomes, outputting the locations of such variants to a VCF and displaying overlaying annotations in an alignment file. Finally, Kable offers the first method to directly compare the location of base modifications and record differences in the methylome to a specialized VCF file.

Getting Started

Requirements

  • Python v3+
  • BioSeq
  • Levenshtein python package

Installation

kable is available on PyPI and can be installed using pip.

pip install kable

OR Clone this repository and run python kable.py

Usage

Build graph

As input, kable build requires a tab deliminated manifest file. Each line of the manifest file represents one sample and contains a list of files including one FASTA file and any number of feature files in GFF or TSV format that pertain to that sample. The line can also include an optional orientation indicator (‘-‘ or ‘+’). By default, Kable will input the FASTA file in the forward direcotry, but if the orientation indicator is equal to '-', then Kable will input the reverse complement of the sample's sequence.

Kable build parses incoming genome sequences using a set k-mer length. By default, Kable uses a k size of 23, but the user can set a specific k-mer length using the ‘-k’ or ‘—kmer_size’ flag. Values as low as 3 are accepted, though the results may not be meaningful depending on the sequence length. If the user specifies a k-mer length below 3, Kable will automatically adjust the k-mer size to 3.

# Basic usage
python kable.py build -m manifest.tsv 

# Use a k-mer length of 33
python kable.py build -m manifest.tsv -k 33

Add to graph

The add module requires a pre-existing kable-generated graph and feature file along with a new manifest file. The manifest should follow the same formatting pattern used by the build module.

The add module will imoprt the graph, determine the k-mer size by the existing node length, then add any new samples included in the manifest file. New graph and feature files will be exported and labeled with 'add' in the savename.

# Basic usage
python kable.py add -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -m new_manifest.tsv

Find sequence variants

The find_vars module identifies sequence variants between the genome sequences. This module exports a VCF file for each genome sequence. It also aligns the variable sequences between genomes and shows the location of gene annotations in relation to the variants.

# Basic usage
python kable.py find_vars -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv

# Search for reverse complement and set maximum search depth to 100
python kable.py find_vars -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -r -d 100

# Save sub graph and exclude annotaiton data
python kable.py find_vars -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -w -e

Find base modifications

The kable mod_search module shows where base modifications occur in relation to gene annotations, summarizes the number and type of modifications in sequences, and counts the number and type of modifications in each genome.

# Basic usage
python kable.py mod_search -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv

# Search for reverse complement and set maximum search depth to 100
python kable.py mod_search -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -r -d 100

# Save sub graph and exclude annotaiton data
python kable.py mod_search -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -w -e

Query for a specific sequence

The Kable query module searches for a provided sequence in the graph and returns a list of which genones and contigs contain the sequence. The module will also export a pseudo alignemnt file that shows the location of genome features such as gene annotations and base modifications in relation to the sequence.

# Basic usage
python kable.py query -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -q sequence

# Include base modifications in pseudo alignment file
python kable.py query -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -q sequence -m

# Search for the reverse complement sequence
python kable.py query -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -q sequence -r

Find sequence between two points

The Kable in situ module idenfies and extracts all sequences present between two subsequnces. The sequences are named after the genome and contig they originated from and exported to a new FASTA file. If performing multiple queries, it is highly recommended to set the savename with the -n flag.

NOTE: the length of the search sequences should be the same length of the k-mer size or longer.

# Basic usage
python kable.py in_situ -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -1 seq1 -2 seq2

# Define output name
python kable.py in_situ -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -1 seq1 -2 seq2 -n search1

# Include base modifications in pseudo alignment file
python kable.py in_situ -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -1 seq1 -2 seq2 -m

# Search for the reverse complement sequence
python kable.py in_situ -g kable_output/kable_graph.gfa -f kable_output/kable_feats.tsv -1 seq1 -2 seq2 -r

General usage

Regardless of the speicific module that is run, Kable will create an output direcotry within the current working directory called kable_output/. All log files and generated output will be placed within this directory and given the prefix kable along with a module dependent suffix.

The location of this output directory can be changed for any module using the -p or --output-path flag as long as the path already exists. The name of the output directory can be changed for any module using the -o or --output_directory flag. Further, the prefix of the output files can be set using the -n or --savename flag.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kable-0.0.1.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kable-0.0.1-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file kable-0.0.1.tar.gz.

File metadata

  • Download URL: kable-0.0.1.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for kable-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9a4eb3f8fd2678b2687fcca4feedc4cd06725f323c9e28b17754453a84f6df8b
MD5 67c1c605551a5724d0551ae320174122
BLAKE2b-256 3d6e4c63369e02f6aa1b2911a40fa26f6c8a6cd1c97d728915222ff35e610923

See more details on using hashes here.

File details

Details for the file kable-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: kable-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for kable-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 daa2c41f04c28fde5fbf1acc9f5d0104e6f8578749610af4bf14b44e5c2ef811
MD5 6ce5d76195c935231441cd30b855b285
BLAKE2b-256 804fc23d4ec692061bb96fc78dac6254105916b2cf8a1056d8cb9bd2969e0c15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page