Skip to main content

Alignment trimming software for phylogenetics.

Project description

Logo

Docs · Report Bug · Request Feature

PyPI version Monthly downloads follow on Twitter

ClipKIT is a fast and flexible alignment trimming tool that keeps phylogenetically informative sites and removes others.

If you found clipkit useful, please cite ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference. bioRxiv. doi: 10.1101/2020.06.08.140384.


Guide

Quick Start
Advanced Usage
Performance Assessment
FAQ


Quick Start

1) Installation

If you are having trouble installing PhyKIT, please contact the lead developer, Jacob L. Steenwyk, via email or twitter to get help.

To install using pip, we strongly recommend building a virtual environment to avoid software dependency issues. To do so, execute the following commands:

# create virtual environment
python -m venv .venv
# activate virtual environment
source .venv/bin/activate
# install clipkit
pip install clipkit

Note, the virtual environment must be activated to use clipkit.

After using ClipKIT, you may wish to deactivate your virtual environment and can do so using the following command:

# deactivate virtual environment
deactivate

Similarly, to install from source, we strongly recommend using a virtual environment. To do so, use the following commands:

# download
git clone https://github.com/JLSteenwyk/ClipKIT.git
cd PhyKIT/
# create virtual environment
python -m venv .venv
# activate virtual environment
source .venv/bin/activate
# install
make install

To deactivate your virtual environment, use the following command:

# deactivate virtual environment
deactivate

Note, the virtual environment must be activated to use clipkit.

2) Usage

To use ClipKIT in its simpliest form, execute the following command:

clipkit <input>

Output file with the suffix ".clipkit"



Advanced Usage

This section describes the various features and options of ClipKIT.
- Modes
- Output
- Log
- Complementary
- All options


Modes

ClipKIT can be run with five different modes (gappy, kpic, kpic-gappy, kpi, and kpi-gappy), which are specified with the -m/--mode argument.
Default: 'gappy'

  • gappy: trim all sites that are above a threshold of gappyness (default: 0.9)
  • kpic (alias: medium): keep only parismony informative and constant sites
  • kpic-gappy (alias: medium-gappy): a combination of kpic- and gappy-based trimming
  • kpi (alias: heavy): keep only parsimony informative sites
  • kpi-gappy (alias: heavy-gappy): a combination of kpi- and gappy-based trimming
# gappy-based trimming
clipkit <input>
clipkit <input> -m gappy

# kpic-based trimming
clipkit <input> -m kpic
clipkit <input> -m medium

# kpic- and gappy-based trimming
clipkit <input> -m kpic-gappy
clipkit <input> -m medium-gappy

# kpi-based trimming
clipkit <input> -m kpi
clipkit <input> -m heavy

# kpi- and gappy-based trimming
clipkit <input> -m kpi-gappy 
clipkit <input> -m heavy-gappy

Output

By default, output files will have the same name as the input file with the suffix ".clipkit" appended to the name. Users can specify output file names with the -o option.

# specify output
clipkit <input> -o <output>

Log

It can be very useful to have information about the each position in an alignment. For example, this information could be used in alignment diagnostics, fine-tuning of trimming parameters, etc. To create the log file, use the -l/--log option. Using this option will create a four column file with the suffix '.clipkit.log'. Default: off

  • col1: position in the alignment (starting at 1)
  • col2: reports if site was trimmed or kept (trim or keep, respectively)
  • col3: reports if the site is constant or not (Const or nConst), parsimony informative or not (PI or nPI), or neither (nConst, nPI)
  • col4: reports the gappyness of the position (number of gaps / entries in alignment)

clipkit <input> -l

Output file with the suffix ".clipkit.log"


Complementary

Having an alignment of the sequences that were trimmed can be useful for other analyses. To obtain an alignment of the sequences that were trimmed, use the -c/--complementary option. Default: off

clipkit <input> -c

Output file with the suffix ".clipkit.complementary"


All options

Option Usage and meaning
-h/--help Print help message
-v/--version Print software version
-o/--output Specify output file name
-m/--modes Specify trimming mode. Default: gappy
-g/--gaps Specify gappyness threshold (between 0 and 1). Default: 0.9
-if/--input_file_format Specify input file format*. Default: auto-detect
-of/--input_file_format Specify output file format*. Default: input file type
-l/--log Create a log file. Default: off
-c/--complementary Create a complementary alignment file. Default: off

*Acceptable file formats include: fasta, clustal, maf, mauve, phylip, phylip-sequential, phylip-relaxed, stockholm


Performance Assessment

In brief, performance assessment and comparison of multiple trimming alignment software revealed that ClipKIT with nearly any mode is a top-performing software. Here, we provide greater detail into the empirical datasets used to assess alignment trimming performance.

Performance Summary

ClipKIT is a top-performing software for trimming multiple sequence alignments. Across a total of 138,152 multiple sequence alignments (MSAs) from empirical (left) and simulated (right) datasets, desirability-based integration of accuracy and support metrics per MSA facilitated the comparison of relative software performance and revealed ClipKIT is a top-performing software. MSA trimming approaches are ordered along the x-axis from the highest-performing software (left) to the lowest-performing software (right) according to average desirability-based rank, which is derived from measures of tree accuracy (i.e., normalized Robinson Foulds distance) and tree support (i.e., average bipartition support).

Abbreviations of trimmers and parameters are as follows: ClipKIT: g = gappy mode; ClipKIT: kc = kpic; ClipKIT: kcg = kpic-gappy; ClipKIT: k = kpi mode; ClipKIT: kg = kpi-gappy mode; BMGE = BMGE default; BMGE 0.3 = 0.3 entropy threshold; BMGE 0.7 = 0.7 entropy threshold; trimAl: s = strict; trimAl: sp = strictplus; Noisy = default; Gblocks = default; No trim = no trimming.

For additional performance details, please see the manuscript ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference. bioRxiv. doi: 10.1101/2020.06.08.140384.





FAQ

If tree inference with no trim works well, why even trim?

Tree inference with trimmed multiple sequence alignments is computationally efficient. In other words, shorter alignments require less computational time and memory during tree search. We found that ClipKIT reduced computation time by an average of 20%. As datasets continuously become bigger, an alignment trimming algorithm that can reduce computational time will be of great value.


Does ClipKIT trim amino acids, nucleotides, or codons?

ClipKIT's trims amino acid and nucleotide alignments. Currently, ClipKIT does not trim codons.


Is there a website version of ClipKIT?

Currently, there is not website version of ClipKIT.

I am having trouble install PhyKIT, what should I do?

Please install ClipKIT using a virtual environment as directed in the installation instructions. If you are still running into issues after installing in a virtual environment, please contact the main software developer via email or twitter.



Developers


All Team Members


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clipkit-0.1.6.tar.gz (16.8 kB view hashes)

Uploaded Source

Built Distribution

clipkit-0.1.6-py2.py3-none-any.whl (16.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page