A toolkit for evaluation of the lenght of k-mer in a given genome dataset for alignment-free phylogenimic analysis
Project description
KITSUNE: K-mer-length Iterative Selection for UNbiased Ecophylogenomics
KITSUNE is a toolkit for evaluation of the length of k-mer in a given genome dataset for alignment-free phylogenimic analysis.
K-mer based approach is simple and fast yet has been widely used in many applications including biological sequence comparison. However, selection of an appropriate k-mer length to obtain a good information content for comparison is normally overlooked. The optimum k-mer length is a prerequsite to obtain biological meaningful genomic distance for assesment of phylogenetic relationships. Therefore, we have developed KITSUNE to aid k-mer length selection process in a systematic way, based on a three-steps aproach described in Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.
KITSUNE will calculte the three matrices across considered k-mer range:
- Cumulative Relative Entropy (CRE)
- Average number of Common Features (ACF)
- Observed Common Features (OCF)
Moreover, KITSUNE also provides various genomic distance calculations from the k-mer frequency vectors that can be used for species identification or phylogenomic tree construction.
Note: If you use KITSUNE in your research, please cite: KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-free Phylogenomic Analysis
Installation
Kitsune is developed under python version 3 environment. We recommend users use python >= v3.5.
Requirement packages: scipy >= 0.18.1, numpy >= 1.1.0, tqdm >= 4.32
Kitsune also requires Jellyfish for k-mer counting as external software dependency. Thus, you need to install it before running the tool: https://github.com/gmarcais/Jellyfish
Install with pip
pip install kitsune
Install from source
# Clone the GitHub repository
git clone https://github.com/natapol/kitsune
# Move to the kitsune folder
cd kitsune/
# Install
python setup.py install
Usage
Overview of kitsune
command for listing help
$ kitsune --help
usage: kitsune <command> [<args>]
Available commands:
acf Compute average number of common features between signatures
cre Compute cumulative relative entropy
dmatrix Compute distance matrix
kopt Compute recommended choice (optimal) of kmer within a given kmer interval for a set of genomes using the cre, acf and ofc
ofc Compute observed feature frequencies
Use --help in conjunction with one of the commands above for a list of available options (e.g. kitsune acf --help)
Calculate CRE, ACF, and OFC value for specific kmer
Kitsune provides three commands to calculate an appropiate k-mer using CRE, ACF, and OCF:
Calculate CRE
$ kitsune cre -h
usage: kitsune (cre) [-h] --filename FILENAME [--fast] [--canonical] -ke KEND
[-kf KFROM] [-t THREAD] [-o OUTPUT]
Calculate k-mer from cumulative relative entropy of all genomes
optional arguments:
-h, --help show this help message and exit
--filename FILENAME A genome file in fasta format (default: None)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--canonical Jellyfish count only canonical mer (default: False)
-ke KEND, --kend KEND
Last k-mer (default: None)
-kf KFROM, --kfrom KFROM
Calculate from k-mer (default: 4)
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
Output filename (default: None)
Calculate ACF
$ kitsune acf -h
usage: kitsune (acf) [-h] --filenames FILENAMES [FILENAMES ...] [--fast]
[--canonical] -k KMERS [KMERS ...] [-t THREAD]
[-o OUTPUT]
Calculate an average number of common features pairwise between one genome
against others
optional arguments:
-h, --help show this help message and exit
--filenames FILENAMES [FILENAMES ...]
Genome files in fasta format (default: None)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--canonical Jellyfish count only canonical mer (default: False)
-k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
Have to state before (default: None)
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
Output filename (default: None)
Calculate OFC
$ kitsune ofc -h
usage: kitsune (ofc) [-h] --filenames FILENAMES [FILENAMES ...] [--fast]
[--canonical] -k KMERS [KMERS ...] [-t THREAD]
[-o OUTPUT]
Calculate an observe feature frequency
optional arguments:
-h, --help show this help message and exit
--filenames FILENAMES [FILENAMES ...]
Genome files in fasta format (default: None)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--canonical Jellyfish count only canonical mer (default: False)
-k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
Output filename (default: None)
General Example
kitsune cre --filename genome1.fna -kf 5 -ke 10
kitsune acf --filenames genome1.fna genome2.fna -k 5
kitsune ofc --filenames genome_fasta/* -k 5
Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes
Kitsune provides a commands to calculate genomic distance using different distance estimation method. Users can assess the impact of a selected k-mer length on the genomic distnace of choice below.
distance option | name |
---|---|
braycurtis | Bray-Curtis distance |
canberra | Canberra distance |
chebyshev | Chebyshev distance |
cityblock | City Block (Manhattan) distance |
correlation | Correlation distance |
cosine | Cosine distance |
euclidean | Euclidean distance |
jensenshannon | Jensen-Shannon distance |
sqeuclidean | Squared Euclidean distance |
dice | Dice dissimilarity |
hamming | Hamming distance |
jaccard | Jaccard-Needham dissimilarity |
kulsinski | Kulsinski dissimilarity |
rogerstanimoto | Rogers-Tanimoto dissimilarity |
russellrao | Russell-Rao dissimilarity |
sokalmichener | Sokal-Michener dissimilarity |
sokalsneath | Sokal-Sneath dissimilarity |
yule | Yule dissimilarity |
mash | MASH distance |
jsmash | MASH Jensen-Shannon distance |
jaccarddistp | Jaccard-Needham dissimilarity Probability |
euclidean_of_frequency | Euclidean distance of Frequency |
Kitsune provides a choice of distance transformation proposed by Fan et.al.
Calculate a distance matrix
$ kitsune dmatrix -h
usage: kitsune (dmatrix) [-h] [--filenames [FILENAMES [FILENAMES ...]]]
[--fast] [--canonical] -k KMER [-i INPUT] [-o OUTPUT]
[-t THREAD] [--transformed]
[-d {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}]
[-f FORMAT]
Calculate a distance matrix
optional arguments:
-h, --help show this help message and exit
--filenames [FILENAMES [FILENAMES ...]]
Genome files in fasta format (default: None)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--canonical Jellyfish count only canonical mer (default: False)
-k KMER, --kmer KMER
-i INPUT, --input INPUT
List of genome files in txt (default: None)
-o OUTPUT, --output OUTPUT
Output filename (default: None)
-t THREAD, --thread THREAD
--transformed
-d {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}, --distance {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}
-f FORMAT, --format FORMAT
Example of choosing distance option:
kitsune dmatrix --filenames genome1.fna genome2.fna -k 11 -d jaccard --canonical --fast -o output.txt
kitsune dmatrix --filenames genome1.fna genome2.fna -k 11 -d hensenshannon --canonical --fast -o output.txt
Find optimum k-mer from a given set of genomes
Kitsune provides a wrap-up comand to find optimum k-mer length for a given set of genome within a given kmer interval.
$ kitsune kopt -h
usage: kitsune (kopt) [-h] [--acf-cutoff ACF_CUTOFF] [--canonical]
[--closely-related] [--cre-cutoff CRE_CUTOFF] [--fast]
--filenames FILENAMES [--hashsize HASHSIZE]
[--in-memory] [--k-min K_MIN] --k-max K_MAX
[--lower LOWER] [--nproc NPROC] [--output OUTPUT]
[--threads THREADS]
Optimal kmer size selection for a set of genomes using Average number of
Common Features (ACF), Cumulative Relative Entropy (CRE), and Observed Common
Features (OCF). Example: kitsune kopt --filenames genomeList.txt --k-min 4
--k-max 12 --canonical --fast
optional arguments:
-h, --help show this help message and exit
--acf-cutoff ACF_CUTOFF
Cutoff to use in selecting kmers whose ACFs are >=
(cutoff * max(ACF)) (default: 0.1)
--canonical Jellyfish count only canonical kmers (default: False)
--closely-related Use in case of closely related genomes (default:
False)
--cre-cutoff CRE_CUTOFF
Cutoff to use in selecting kmers whose CREs are <=
(cutoff * max(CRE)) (default: 0.1)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--filenames FILENAMES
Path to the file with the list of genome files paths.
There should be at list 2 input genomes (default:
None)
--hashsize HASHSIZE Jellyfish initial hash size (default: 100M)
--in-memory Keep Jellyfish counts in memory (default: False)
--k-min K_MIN Minimum kmer size (default: 4)
--k-max K_MAX Maximum kmer size (default: None)
--lower LOWER Do not let Jellyfish output kmers with count < --lower
(default: 1)
--nproc NPROC Maximum number of CPUs to make it parallel (default:
1)
--output OUTPUT Path to the output file (default: None)
--threads THREADS Maximum number of threads for Jellyfish (default: 1)
Example dataset
First download the example files. Download
kitsune kopt --filenames genome_list --k-min 6 --k-max 21 --canonical --fast --threads 4 --nproc 2 --output out.txt
:warning: Please be aware that this command will use big computational resources when large number of genomes and/or large genome size are used as the input.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kitsune-1.3.5.tar.gz
.
File metadata
- Download URL: kitsune-1.3.5.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73af096870a02dd6b3774578b4e3d6b86122e99a27877977045330f7a24d2e41 |
|
MD5 | a6b50c518c463272f1ba25e1f0d72925 |
|
BLAKE2b-256 | f9f8bd2f58f41fabbd6d5f1394742523b60220171e52a9d50f2943b0ee41ad3f |
File details
Details for the file kitsune-1.3.5-py2.py3-none-any.whl
.
File metadata
- Download URL: kitsune-1.3.5-py2.py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f4e182ff276a8226d0f81823e8ecccda43be0df147d7c7c9c042d9c652c83b0 |
|
MD5 | 1d01092bcad841b6730da126eacba1b8 |
|
BLAKE2b-256 | f4964d37fee4d6768210903b54de2f3321c6b6e095085ff2c6613070ab14cfec |