a toolkit for evaluation of the lenght of k-mer in a given genome dataset for alignment-free phylogenimic analysis
Project description
KITSUNE is a toolkit for evaluation of the lenght of k-mer in a given genome dataset for alignment-free phylogenimic analysis.
K-mer based approach is simple and fast yet has been widely used in many applications including biological sequence comparison. However, selection of an appropriate k-mer length to obtain a good information content for comparison is normally overlooked. Therefore, we have developed KITSUNE to aid k-mer length selection process based on a three steps aproach described in Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.
KITSUNE uses Jellyfish software Jellyfish for k-mer counting. Thanks to Jellyfish developer.
KITSUNE will calculte the three matrices across considered k-emer range :
Cumulative Relative Entropy (CRE)
Averrage number of Common Feature (ACF)
Obserbed Common Feature (OCF)
Moreverver, KITSUNE also provides various genomic distance calculations from the k-mer frequnce vectors that can be used for species identifiction or phylogenomic tree construction.
If you use KITSUNE in your research, please cite: Reference
Installation
Clone the repository and install it throught pip
pip install kitsune
Usage
Calculate CRE, ACF, and OFC value for specific kmer
Kitsune provides three commands to calculate an appropiate k-mer using CRE, ACF, and OCF.
kitsune cre genome_fasta/* -ks 5 -ke 10
kitsune acf genome_fasta/* -ks 5 -ke 10
kitsune ocf genome_fasta/* -ks 5 -ke 10
Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes
Kitsune provides a commands to calculate genomic distance using different distance estimation method.
distance option |
name |
---|---|
braycurtis |
Bray-Curtis distance |
canberra |
Canberra distance |
chebyshev |
Chebyshev distance |
cityblock |
City Block (Manhattan) distance |
correlation |
Correlation distance |
cosine |
Cosine distance |
euclidean |
Euclidean distance |
jensenshannon |
Jensen-Shannon distance |
sqeuclidean |
Squared Euclidean distance |
dice |
Dice dissimilarity |
hamming |
Hamming distance |
jaccard |
Jaccard-Needham dissimilarity |
kulsinski |
Kulsinski dissimilarity |
rogerstanimoto |
Rogers-Tanimoto dissimilarity |
russellrao |
Russell-Rao dissimilarity |
sokalmichener |
Sokal-Michener dissimilarity |
sokalsneath |
Sokal-Sneath dissimilarity |
yule |
Yule dissimilarity |
mash |
MASH distance |
jsmash |
MASH Jensen-Shannon distance |
jaccarddistp |
Jaccard-Needham dissimilarity Probability |
kitsune dmatrix genome1.fna genome2.fna -k 17 -d jaccard --canonical --fast -o output.txt
kitsune dmatrix genome1.fna genome2.fna -k 17 -d hensenshannon --canonical --fast -o output.txt
Find optimum k-mer from a given set of genome
Kitsune provides a comand to find optimum k-mer length in agiven set of genome.
First download the example files.Download
Then use kitsune kopt command
-i : path to list of genome files
-ks: The smallest kmer-length to consider
-kl: The largest kmer-length to consider
-o: output file
**Please be aware that this comand will use big computational resources when large number of genomes and/or large genome size are used as the input.
kitsune kopt -i genome_list -ks 7 -kl 15 --canonical --fast -o output.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kitsune-0.0.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4dbe1a4829aa1be845a964f1400feac2bc1797e4020270d4bb4ed272b8ef0c39 |
|
MD5 | 576d138bb6faf7e685d9b885b9e8a292 |
|
BLAKE2b-256 | 28cd8f8fcb0170cd0eb4edf27244b720f2293938b75a00d6884c3827ddd4fa53 |