kitsune

a toolkit for evaluation of the lenght of k-mer in a given genome dataset for alignment-free phylogenimic analysis

These details have not been verified by PyPI

Project links

Homepage

Project description

Upload Python Package

KITSUNE is a toolkit for evaluation of the length of k-mer in a given genome dataset for alignment-free phylogenimic analysis.

K-mer based approach is simple and fast yet has been widely used in many applications including biological sequence comparison. However, selection of an appropriate k-mer length to obtain a good information content for comparison is normally overlooked. The optimum k-mer length is a prerequsite to obtain biological meaningful genomic distance for assesment of phylogenetic relationships. Therefore, we have developed KITSUNE to aid k-mer length selection process in a systematic way, based on a three-steps aproach described in Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.

KITSUNE uses Jellyfish software for k-mer counting. Thanks to Jellyfish developer. Citation

KITSUNE will calculte the three matrices across considered k-mer range:

Cumulative Relative Entropy (CRE)
Averrage number of Common Feature (ACF)
Obserbed Common Feature (OCF)

Moreover, KITSUNE also provides various genomic distance calculations from the k-mer frequency vectors that can be used for species identification or phylogenomic tree construction.

If you use KITSUNE in your research, please cite: KITSUNE: A Tool for Identifying Optimal K-mer Length for Alignment-free Phylogenomic Analysis Reference

Installation

Kitsune is developed under python version 3 environment. We recommend users use python >= v3.5.

Requirement packages:

biopython >= 1.68, scipy >= 0.18.1, numpy >= 1.1.0, tqdm >= 4.32

pip

pip install kitsune

Clone from github

git clone https://github.com/natapol/kitsune
cd kitsune/
python nstall setup.py

Usage

Overview of kitsune

command for listing help

$ kitsune --help

usage: kitsune <command> [<args>]

Commands can be:
cre <filename>                    Compute cumulative relative entropy.
acf <filenames>                   Compute average number of common feature between signatures.
ofc <filenames>                   Compute observed feature frequencies.
kopt <filenames>                  Compute recommended choice (optimal) of kmer within a given kmer interval for a set of genomes using the cre, acf and ofc.
dmatrix <filenames>               Compute distance matrix.

Calculate CRE, ACF, and OFC value for specific kmer

Kitsune provides three commands to calculate an appropiate k-mer using CRE, ACF, and OCF:

Calculate CRE

$ kitsune cre -h
usage: kitsune [-h] [--fast] [--canonical] -ke KEND [-kf KFROM] [-t THREAD]
               [-o OUTPUT]
               filename

Calculate k-mer from cumulative relative entropy of all genomes

positional arguments:
  filename              a genome file in fasta format

optional arguments:
  -h, --help            show this help message and exit
  --fast                Jellyfish one-pass calculation (faster)
  --canonical           Jellyfish count only canonical mer
  -ke KEND, --kend KEND
                        last k-mer
  -kf KFROM, --kfrom KFROM
                        Calculate from k-mer
  -t THREAD, --thread THREAD
  -o OUTPUT, --output OUTPUT
                        output filename

Calculate ACF

$ kitsune acf -h
usage: kitsune [-h] [--fast] [--canonical] -k KMERS [KMERS ...] [-t THREAD]
               [-o OUTPUT]
               filenames [filenames ...]

Calculate average number of common feature

positional arguments:
  filenames             genome files in fasta format

optional arguments:
  -h, --help            show this help message and exit
  --fast                Jellyfish one-pass calculation (faster)
  --canonical           Jellyfish count only canonical mer
  -k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
                        have to state before
  -t THREAD, --thread THREAD
  -o OUTPUT, --output OUTPUT
                        output filename

Calculate OFC

$ kitsune ofc -h
usage: kitsune [-h] [--fast] [--canonical] -k KMERS [KMERS ...] [-t THREAD]
               [-o OUTPUT]
               filenames [filenames ...]

Calculate observe feature occurrence

positional arguments:
  filenames             genome files in fasta format

optional arguments:
  -h, --help            show this help message and exit
  --fast                Jellyfish one-pass calculation (faster)
  --canonical           Jellyfish count only canonical mer
  -k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
  -t THREAD, --thread THREAD
  -o OUTPUT, --output OUTPUT
                        output filename

General Example

kitsune cre genome1.fna -kf 5 -ke 10
kitsune acf genome1.fna genome2.fna -k 5
kitsune ofc genome_fasta/* -k 5

Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes

Kitsune provides a commands to calculate genomic distance using different distance estimation method. Users can assess the impact of a selected k-mer length on the genomic distnace of choice below.

distance option	name
braycurtis	Bray-Curtis distance
canberra	Canberra distance
chebyshev	Chebyshev distance
cityblock	City Block (Manhattan) distance
correlation	Correlation distance
cosine	Cosine distance
euclidean	Euclidean distance
jensenshannon	Jensen-Shannon distance
sqeuclidean	Squared Euclidean distance
dice	Dice dissimilarity
hamming	Hamming distance
jaccard	Jaccard-Needham dissimilarity
kulsinski	Kulsinski dissimilarity
rogerstanimoto	Rogers-Tanimoto dissimilarity
russellrao	Russell-Rao dissimilarity
sokalmichener	Sokal-Michener dissimilarity
sokalsneath	Sokal-Sneath dissimilarity
yule	Yule dissimilarity
mash	MASH distance
jsmash	MASH Jensen-Shannon distance
jaccarddistp	Jaccard-Needham dissimilarity Probability
euclidean_of_frequency	Euclidean distance of Frequency

Kitsune provides a choice of distance transformation proposed by Fan et.al.

Calculate a distance matrix

$ kitsune dmatrix -h
usage: kitsune [-h] [--fast] [--canonical] -k KMER [-i INPUT] [-o OUTPUT]
               [-t THREAD] [--transformed] [-d DISTANCE] [-f FORMAT]
               [filenames [filenames ...]]

Calculate a distance matrix

positional arguments:
  filenames             genome files in fasta format

optional arguments:
  -h, --help            show this help message and exit
  --fast                Jellyfish one-pass calculation (faster)
  --canonical           Jellyfish count only canonical mer
  -k KMER, --kmer KMER
  -i INPUT, --input INPUT
                        list of genome files in txt
  -o OUTPUT, --output OUTPUT
                        output filename
  -t THREAD, --thread THREAD
  --transformed
  -d DISTANCE, --distance DISTANCE
                        braycurtis, canberra, jsmash, chebyshev, cityblock,
                        correlation, cosine (default), dice, euclidean,
                        hamming, jaccard, kulsinsk, matching, rogerstanimoto,
                        russellrao, sokalmichener, sokalsneath, sqeuclidean,
                        yule, mash, jaccarddistp
  -f FORMAT, --format FORMAT

Example of choosing distance option:

kitsune dmatrix genome1.fna genome2.fna -k 11 -d jaccard --canonical --fast -o output.txt
kitsune dmatrix genome1.fna genome2.fna -k 11 -d hensenshannon --canonical --fast -o output.txt

Find optimum k-mer from a given set of genomes

Kitsune provides a wrap-up comand to find optimum k-mer length for a given set of genome within a given kmer interval.

$ kitsune kopt -h
usage: kitsune [-h] [--fast] [--canonical] -kl KLARGE [-o OUTPUT]
               [--closely_related] [-x CRE_CUTOFF] [-y ACF_CUTOFF] [-t THREAD]
               filenames

Example: kitsune kopt genomeList.txt -kl 15 --canonical --fast -t 4 -o out.txt

positional arguments:
  filenames             A file that list the path to all genomes(fasta format)
                        with extension as (.txt,.csv,.tab) or no extension

optional arguments:
  -h, --help            show this help message and exit
  --fast                Jellyfish one-pass calculation (faster)
  --canonical           Jellyfish count only canonical mer
  -kl KLARGE, --klarge KLARGE
                        largest k-mer length to consider, note: the smallest
                        kmer length is 4
  -o OUTPUT, --output OUTPUT
                        output filename
  --closely_related     For closely related set of genomes, use this option
  -x CRE_CUTOFF, --cre_cutoff CRE_CUTOFF
                        cutoff to use in selecting kmers whose cre's are <=
                        (cutoff * max(cre)), Default = 10 percent, ie x=0.1
  -y ACF_CUTOFF, --acf_cutoff ACF_CUTOFF
                        cutoff to use in selecting kmers whose acf's are >=
                        (cutoff * max(acf)), Default = 10 percent, ie y=0.1
  -t THREAD, --thread THREAD
                        Number of threads (integer)

Example dataset

First download the example files. Download

kitsune kopt genomeList.txt -kl 15 --canonical --fast -t 4 -o out.txt

**Please be aware that this command will use big computational resources when large number of genomes and/or large genome size are used as the input.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.3.5

May 16, 2024

1.3.3

Mar 29, 2023

1.3.2

Feb 22, 2023

This version

1.3.1

Jul 27, 2020

1.2.13

Jul 14, 2020

1.2.12

Jul 11, 2020

1.2.11

Jun 23, 2020

1.2.10

Apr 27, 2020

1.2.9

Apr 16, 2020

1.2.8

Jan 1, 2020

1.2.6

Jul 14, 2019

1.2.4

Jul 6, 2019

1.2.2

Jun 13, 2019

1.2.0

Jun 8, 2019

1.1.6

Jun 3, 2019

1.1.4

May 21, 2019

1.1.2

May 6, 2019

1.1.0

Apr 16, 2019

1.0.0

Apr 7, 2019

0.9.3

Apr 7, 2019

0.9.2

Apr 5, 2019

0.9.1

Nov 8, 2018

0.0.0

Apr 16, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kitsune-1.3.1.tar.gz (3.0 MB view details)

Uploaded Jul 27, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kitsune-1.3.1-py2.py3-none-any.whl (3.0 MB view details)

Uploaded Jul 27, 2020 Python 2Python 3

File details

Details for the file kitsune-1.3.1.tar.gz.

File metadata

Download URL: kitsune-1.3.1.tar.gz
Upload date: Jul 27, 2020
Size: 3.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for kitsune-1.3.1.tar.gz
Algorithm	Hash digest
SHA256	`11c40fe420d082523a1ff2179aa1acfdaa42a272c90911d3a05b4585f682afec`
MD5	`26e62940ff34341b1f93efaf36abac78`
BLAKE2b-256	`42b22741cd4aa520f49064a4ac5bed260c19bb3f4aa64ba852ce5edf017dff3b`

See more details on using hashes here.

File details

Details for the file kitsune-1.3.1-py2.py3-none-any.whl.

File metadata

Download URL: kitsune-1.3.1-py2.py3-none-any.whl
Upload date: Jul 27, 2020
Size: 3.0 MB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.3

File hashes

Hashes for kitsune-1.3.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`92b0bc8413c07dbfaf02def52cc4032c90adc37dcba2734c81a2d7631a3e51fc`
MD5	`d8027b9e7d11b9c4e4368d2c865e0e9c`
BLAKE2b-256	`aa3eadf17a05b427b44a74d45fb48389209751d59219ed24929f7a6b0a3ada0b`

See more details on using hashes here.

kitsune 1.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

pip

Clone from github

Usage

Overview of kitsune

Calculate CRE, ACF, and OFC value for specific kmer

Calculate CRE

Calculate ACF

Calculate OFC

General Example

Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes

Calculate a distance matrix

Find optimum k-mer from a given set of genomes

Example dataset

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes