a toolkit for evaluation of the lenght of k-mer in a given genome dataset for alignment-free phylogenimic analysis
Project description
KITSUNE is a toolkit for evaluation of the length of k-mer in a given genome dataset for alignment-free phylogenimic analysis.
K-mer based approach is simple and fast yet has been widely used in many applications including biological sequence comparison. However, selection of an appropriate k-mer length to obtain a good information content for comparison is normally overlooked. The optimum k-mer length is a prerequsite to obtain biological meaningful genomic distance for assesment of phylogenetic relationships. Therefore, we have developed KITSUNE to aid k-mer length selection process in a systematic way, based on a three-steps aproach described in Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.
KITSUNE uses Jellyfish software for k-mer counting. Thanks to Jellyfish developer. Citation
KITSUNE will calculte the three matrices across considered k-mer range:
Cumulative Relative Entropy (CRE)
Averrage number of Common Feature (ACF)
Obserbed Common Feature (OCF)
Moreover, KITSUNE also provides various genomic distance calculations from the k-mer frequency vectors that can be used for species identification or phylogenomic tree construction.
If you use KITSUNE in your research, please cite: KITSUNE: A Tool for Identifying Optimal K-mer Length for Alignment-free Phylogenomic Analysis Reference
Installation
Kitsune is developed under python version 3 environment. We recommend users use python >= v3.5.
Requirement packages:
biopython >= 1.68, scipy >= 0.18.1, numpy >= 1.1.0, tqdm >= 4.32
pip
pip install kitsune
Clone from github
git clone https://github.com/natapol/kitsune
cd kitsune/
python nstall setup.py
Usage
Overview of kitsune
command for listing help
$ kitsune --help
usage: kitsune <command> [<args>]
Commands can be:
cre <filename> Compute cumulative relative entropy.
acf <filenames> Compute average number of common feature between signatures.
ofc <filenames> Compute observed feature frequencies.
kopt <filenames> Compute recommended choice (optimal) of kmer within a given kmer interval for a set of genomes using the cre, acf and ofc.
dmatrix <filenames> Compute distance matrix.
Calculate CRE, ACF, and OFC value for specific kmer
Kitsune provides three commands to calculate an appropiate k-mer using CRE, ACF, and OCF:
Calculate CRE
$ kitsune cre -h
usage: kitsune [-h] [--fast] [--canonical] -ke KEND [-kf KFROM] [-t THREAD]
[-o OUTPUT]
filename
Calculate k-mer from cumulative relative entropy of all genomes
positional arguments:
filename a genome file in fasta format
optional arguments:
-h, --help show this help message and exit
--fast Jellyfish one-pass calculation (faster)
--canonical Jellyfish count only canonical mer
-ke KEND, --kend KEND
last k-mer
-kf KFROM, --kfrom KFROM
Calculate from k-mer
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
output filename
Calculate ACF
$ kitsune acf -h
usage: kitsune [-h] [--fast] [--canonical] -k KMERS [KMERS ...] [-t THREAD]
[-o OUTPUT]
filenames [filenames ...]
Calculate average number of common feature
positional arguments:
filenames genome files in fasta format
optional arguments:
-h, --help show this help message and exit
--fast Jellyfish one-pass calculation (faster)
--canonical Jellyfish count only canonical mer
-k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
have to state before
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
output filename
Calculate OFC
$ kitsune ofc -h
usage: kitsune [-h] [--fast] [--canonical] -k KMERS [KMERS ...] [-t THREAD]
[-o OUTPUT]
filenames [filenames ...]
Calculate observe feature occurrence
positional arguments:
filenames genome files in fasta format
optional arguments:
-h, --help show this help message and exit
--fast Jellyfish one-pass calculation (faster)
--canonical Jellyfish count only canonical mer
-k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
output filename
General Example
kitsune cre genome1.fna -kf 5 -ke 10
kitsune acf genome1.fna genome2.fna -k 5
kitsune ofc genome_fasta/* -k 5
Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes
Kitsune provides a commands to calculate genomic distance using different distance estimation method. Users can assess the impact of a selected k-mer length on the genomic distnace of choice below.
distance option |
name |
---|---|
braycurtis |
Bray-Curtis distance |
canberra |
Canberra distance |
chebyshev |
Chebyshev distance |
cityblock |
City Block (Manhattan) distance |
correlation |
Correlation distance |
cosine |
Cosine distance |
euclidean |
Euclidean distance |
jensenshannon |
Jensen-Shannon distance |
sqeuclidean |
Squared Euclidean distance |
dice |
Dice dissimilarity |
hamming |
Hamming distance |
jaccard |
Jaccard-Needham dissimilarity |
kulsinski |
Kulsinski dissimilarity |
rogerstanimoto |
Rogers-Tanimoto dissimilarity |
russellrao |
Russell-Rao dissimilarity |
sokalmichener |
Sokal-Michener dissimilarity |
sokalsneath |
Sokal-Sneath dissimilarity |
yule |
Yule dissimilarity |
mash |
MASH distance |
jsmash |
MASH Jensen-Shannon distance |
jaccarddistp |
Jaccard-Needham dissimilarity Probability |
euclidean_of_frequency |
Euclidean distance of Frequency |
Kitsune provides a choice of distance transformation proposed by Fan et.al.
Calculate a distance matrix
$ kitsune dmatrix -h
usage: kitsune [-h] [--fast] [--canonical] -k KMER [-i INPUT] [-o OUTPUT]
[-t THREAD] [--transformed] [-d DISTANCE] [-f FORMAT]
[filenames [filenames ...]]
Calculate a distance matrix
positional arguments:
filenames genome files in fasta format
optional arguments:
-h, --help show this help message and exit
--fast Jellyfish one-pass calculation (faster)
--canonical Jellyfish count only canonical mer
-k KMER, --kmer KMER
-i INPUT, --input INPUT
list of genome files in txt
-o OUTPUT, --output OUTPUT
output filename
-t THREAD, --thread THREAD
--transformed
-d DISTANCE, --distance DISTANCE
braycurtis, canberra, jsmash, chebyshev, cityblock,
correlation, cosine (default), dice, euclidean,
hamming, jaccard, kulsinsk, matching, rogerstanimoto,
russellrao, sokalmichener, sokalsneath, sqeuclidean,
yule, mash, jaccarddistp
-f FORMAT, --format FORMAT
Example of choosing distance option:
kitsune dmatrix genome1.fna genome2.fna -k 11 -d jaccard --canonical --fast -o output.txt
kitsune dmatrix genome1.fna genome2.fna -k 11 -d hensenshannon --canonical --fast -o output.txt
Find optimum k-mer from a given set of genomes
Kitsune provides a wrap-up comand to find optimum k-mer length for a given set of genome within a given kmer interval.
$ kitsune kopt -h
usage: kitsune [-h] [--fast] [--canonical] -kl KLARGE [-o OUTPUT]
[--closely_related] [-x CRE_CUTOFF] [-y ACF_CUTOFF] [-t THREAD]
filenames
Example: kitsune kopt genomeList.txt -kl 15 --canonical --fast -t 4 -o out.txt
positional arguments:
filenames A file that list the path to all genomes(fasta format)
with extension as (.txt,.csv,.tab) or no extension
optional arguments:
-h, --help show this help message and exit
--fast Jellyfish one-pass calculation (faster)
--canonical Jellyfish count only canonical mer
-kl KLARGE, --klarge KLARGE
largest k-mer length to consider, note: the smallest
kmer length is 4
-o OUTPUT, --output OUTPUT
output filename
--closely_related For closely related set of genomes, use this option
-x CRE_CUTOFF, --cre_cutoff CRE_CUTOFF
cutoff to use in selecting kmers whose cre's are <=
(cutoff * max(cre)), Default = 10 percent, ie x=0.1
-y ACF_CUTOFF, --acf_cutoff ACF_CUTOFF
cutoff to use in selecting kmers whose acf's are >=
(cutoff * max(acf)), Default = 10 percent, ie y=0.1
-t THREAD, --thread THREAD
Number of threads (integer)
Example dataset
First download the example files. Download
kitsune kopt genomeList.txt -kl 15 --canonical --fast -t 4 -o out.txt
**Please be aware that this command will use big computational resources when large number of genomes and/or large genome size are used as the input.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kitsune-1.3.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92b0bc8413c07dbfaf02def52cc4032c90adc37dcba2734c81a2d7631a3e51fc |
|
MD5 | d8027b9e7d11b9c4e4368d2c865e0e9c |
|
BLAKE2b-256 | aa3eadf17a05b427b44a74d45fb48389209751d59219ed24929f7a6b0a3ada0b |