Skip to main content

A library for generating gkm-svm faster

Project description

FastSK: fast sequence analysis with gapped string kernels (Fast-GKM-SVM)

This Github repo provides improved algorithms for implementing gkm-svm string kernel calculations. We provide C++ version of the algorithm implementation and a python wrapper (making to a python package) for the C++ implementation. Our package provides fast and accuate gkm-svm based training SVM classifiers and regressors for gkm string kernel based sequence analysis.

This Github is built with a novel and fast algorithm design for implementing gapped k-mer algorithm, pybind11, and LIBSVM.

More details of algorithms and results now in: Bioinformatics 2020

Prerequisites

On Windows

  • Visual Studio 2015 (required for all Python versions, see notes below)
  • CMake >= 3.1

Installation via Pip Install (Linux and MacOS)

Clone this repository and run:

git clone --recursive https://github.com/QData/FastSK.git
cd FastSK
pip install -r requirements.txt
pip install .

The pip intallation of FastSK has been tested successfully on CentOS, Red Hat, MacOS and WindowsXP.

Python Version Tutorial

Example Jupyter notebook

  • 'docs/2demo/fastDemo.ipynb'

Example python usage script:

cd test
python run_check.py 

You can check if fastsk library is installed correctly in python shell:

from fastsk import FastSK

## Compute kernel matrix
fastsk = FastSK(g=10, m=6, t=1, approx=True)

Experimental Results, Baselines, Utility Codes and Setup

  • We have provided all datasets we used in the subfolder "data"
  • We have provided all scripts we used to generate results under the subfolder "results"

Grid Search for FastSK and gkm-svm baseline

To run a grid search over the hyperparameter space (g, m, and C) to find the optimal parameters, e.g, one utility code:

cd results/
python run_gridsearch.py

When comparing with Deep Learning baselines

  • You do need to have pytorch installed
pip install torch torchvision
  • One utility code: on all datasets with hyperparameter tuning of charCNN and each configure with 5 random-seeding repeats:
cd results/neural_nets
python run_cnn_hyperTrTune.py 
  • We have many other utility codes helping users to run CNN and RNN baselines

Some of our exprimental results comparing FastSK with baselines wrt performance and speed

Some of our exprimental results comparing FastSK with Character based Convolutional Neural Nets (CharCNN) when varying training size.

To Do:

  • a detailed user document, with example input files, output files, code, and perhaps a user group where people can post their questions

Citations

If you find this tool useful, please cite us!

@article{fast-gkm-svm,
    author = {Blakely, Derrick and Collins, Eamon and Singh, Ritambhara and Norton, Andrew and Lanchantin, Jack and Qi, Yanjun},
    title = "{FastSK: fast sequence analysis with gapped string kernels}",
    journal = {Bioinformatics},
    volume = {36},
    number = {Supplement_2},
    pages = {i857-i865},
    year = {2020},
    month = {12},
    abstract = "{Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size.In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines.Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSKSupplementary data are available at Bioinformatics online.}",
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa817},
    url = {https://doi.org/10.1093/bioinformatics/btaa817},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/36/Supplement\_2/i857/35337038/btaa817.pdf},
}

Legacy: If you prefer using the executable made from the Pure C++ source code (without python wrapper or R wrapper)

  • you can clone this repository:
git clone --recursive https://github.com/QData/FastSK.git

then run

cd FastSK
make

A fastsk executable will be installed to the bin directory, which you can use for kernel computation and inference. For example:

./bin/fastsk -g 10 -m 6 -C 1 -t 1 -a data/EP300.train.fasta data/EP300.test.fasta

This will run the approximate kernel algorithm on the EP300 TFBS dataset using a feature length of g = 10 with up to m = 6 mismatches. It will then train and evaluate an SVM classifier with the SVM parameter C = 1.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsk-0.0.2.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

fastsk-0.0.2-cp37-cp37m-macosx_10_14_x86_64.whl (136.9 kB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

File details

Details for the file fastsk-0.0.2.tar.gz.

File metadata

  • Download URL: fastsk-0.0.2.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.6

File hashes

Hashes for fastsk-0.0.2.tar.gz
Algorithm Hash digest
SHA256 738a810bf745f7f859d24165b2f7d3cbfb656dfcfbe8db0dd06612c105dfcbb4
MD5 04dd76baef528602f5c157e95f50f20d
BLAKE2b-256 70cacaaaa0441a3efd67f4b6918f3c676f44c96ca504872cd8eae15f0e994a2c

See more details on using hashes here.

File details

Details for the file fastsk-0.0.2-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: fastsk-0.0.2-cp37-cp37m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 136.9 kB
  • Tags: CPython 3.7m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.6

File hashes

Hashes for fastsk-0.0.2-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 f84a76303d336c6428ab2b377bc4cb46cbd07414d2d93f953b355cc09614ab87
MD5 97fbe6010559caa86e195905d8f0fca2
BLAKE2b-256 5c2eeb8d5608ac8701708895683f18eb7c72077274656f32e6bb2c3cdcee8eea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page