Skip to main content

a toolkit for evaluation of the lenght of k-mer in a given genome dataset for alignment-free phylogenimic analysis

Project description

PyPI version Upload Python Package

KITSUNE is a toolkit for evaluation of the lenght of k-mer in a given genome dataset for alignment-free phylogenimic analysis.

K-mer based approach is simple and fast yet has been widely used in many applications including biological sequence comparison. However, selection of an appropriate k-mer length to obtain a good information content for comparison is normally overlooked. Therefore, we have developed KITSUNE to aid k-mer length selection process based on a three steps aproach described in Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.

KITSUNE uses Jellyfish software for k-mer counting. Thanks to Jellyfish developer. Citation

KITSUNE will calculte the three matrices across considered k-emer range :

  1. Cumulative Relative Entropy (CRE)
  2. Averrage number of Common Feature (ACF)
  3. Obserbed Common Feature (OCF)

Moreverver, KITSUNE also provides various genomic distance calculations from the k-mer frequnce vectors that can be used for species identifiction or phylogenomic tree construction.

If you use KITSUNE in your research, please cite: Reference

Installation

Install throught pip:

pip install kitsune

Usage

Calculate CRE, ACF, and OFC value for specific kmer

Kitsune provides three commands to calculate an appropiate k-mer using CRE, ACF, and OCF:

kitsune cre genome_fasta/* -kf 5 -ke 10
kitsune acf genome_fasta/* -k 5
kitsune ofc genome_fasta/* -k 5

Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes

Kitsune provides a commands to calculate genomic distance using different distance estimation method.

distance option name
braycurtis Bray-Curtis distance
canberra Canberra distance
chebyshev Chebyshev distance
cityblock City Block (Manhattan) distance
correlation Correlation distance
cosine Cosine distance
euclidean Euclidean distance
jensenshannon Jensen-Shannon distance
sqeuclidean Squared Euclidean distance
dice Dice dissimilarity
hamming Hamming distance
jaccard Jaccard-Needham dissimilarity
kulsinski Kulsinski dissimilarity
rogerstanimoto Rogers-Tanimoto dissimilarity
russellrao Russell-Rao dissimilarity
sokalmichener Sokal-Michener dissimilarity
sokalsneath Sokal-Sneath dissimilarity
yule Yule dissimilarity
mash MASH distance
jsmash MASH Jensen-Shannon distance
jaccarddistp Jaccard-Needham dissimilarity Probability

Example of choosing distance option:

kitsune dmatrix genome1.fna genome2.fna -k 17 -d jaccard --canonical --fast -o output.txt
kitsune dmatrix genome1.fna genome2.fna -k 17 -d hensenshannon --canonical --fast -o output.txt

Find optimum k-mer from a given set of genome

Kitsune provides a comand to find optimum k-mer length in agiven set of genome.

First download the example files.Download

Then use kitsune kopt command

-i : path to list of genome files

-kl: The largest kmer-length to consider

-o: output file

**Please be aware that this comand will use big computational resources when large number of genomes and/or large genome size are used as the input.

kitsune kopt -i genome_list -kl 15 --canonical --fast -o output.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for kitsune, version 1.2.11
Filename, size File type Python version Upload date Hashes
Filename, size kitsune-1.2.11-py2.py3-none-any.whl (3.0 MB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size kitsune-1.2.11.tar.gz (3.0 MB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page