Skip to main content

SHARK (Similarity/Homology Assessment by Relating K-mers)

Project description


SHARK (Similarity/Homology Assessment by Relating K-mers)

To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers).

SHARK-dive

We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment in assessing homology in unalignable sequences, and correctly identified dissimilar IDRs capable of functional rescue in IDR-replacement experiments reported in the literature.

1. Dive-Score

Scoring the similarity between a pair of sequence

Variants:

  1. Normal (SHARK-score (T))
  2. Sparse (SHARK-score (best))

2. Dive-Predict

Find sequences similar to a given query from a target set

User Section

Installation

SHARK officially supports Python versions >=3.9,<3.12.

Recommended Use within a local python virtual environment

python3 -m venv /path/to/new/virtual/environment

SHARK is installable from PyPI soon

$ pip install bio-shark

SHARK is also installable from source

  • This allows users to import functionalities as a python package
  • This also allows user to run the functionalities as a command line utility
$ git clone git@git.mpi-cbg.de:tothpetroczylab/shark.git

Once you have a copy of the source, you can embed it in your own Python package, or install it into your site-packages easily.

# Make sure you have the required Python version installed
$ python3 --version
Python 3.11.5

$ cd shark
$ python3 -m venv shark-env
$ source shark-env/bin/activate
$ (shark-env) % python -m pip install .

SHARK is also installable from GitLab source directly

$ pip install git+https://git.mpi-cbg.de/tothpetroczylab/shark.git

How to use?

1. Dive

1.1. Scoring: Given two protein sequences and a k-mer length (1 to 20), score the similarity b/w them

Inputs
  1. Protein Sequence 1
  2. Protein Sequence 2
  3. Scoring-variant: Normal (SHARK-score (T))/ Sparse (SHARK-score (best))
    1. Threshold (for "Normal")
  4. K-Mer Length (Should be <= smallest_len(sequences))
1.1.1. As a command-line utility
  • Run the command shark-score
  • Enter sequences when command prompts
  • Enter the variant (1/2) when the command prompts
% shark-score 
Enter Sequence 1:
> SSSSPINTHGVSTTVPSSNNTIIPSSDGVSLSQTDYFDTVHNRQSPSRRESPVTVFRQPSLSHSKSLHKDSKNKVPQISTNQSHPSAVSTANTPGPSPN
Enter Sequence 2:
> VAEREFNGRSNSLHANFTSPVPRTVLDHHRHELTFCNPNNTTGFKTITPSPPTQHQSILPTAVDNVPRSKSVSSLPVSGFPPLIVKQQQQQQLNSSSSASALPSIHSPLTNEH
Enter k-mer length (integer 1 - 10): > 5
Press: 1. Normal; 2. Sparse
> 1
Enter threshold:
>0.8
Similarity Score: 0.6552442773
1.1.2. As an imported python package
from bio_shark.core import utils
from bio_shark.dive.run import run_normal, run_sparse

dive_t_score = run_normal(
    sequence1="LASIDPTFKAN",
    sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
    k=3,
    threshold=0.8
)   # Compute SHARK-score (T)  

dive_best_score = run_sparse(
    sequence1="LASIDPTFKAN",
    sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
    k=3,
)   # Compute SHARK-score (best)

1.2. Similarity Prediction

1.2.1. As an imported python package
from bio_shark.dive.prediction import Prediction

predictor = Prediction(q_sequence_id_map=<dict-fasta-id-seq>, t_sequence_id_map=<dict-fasta-id-seq>)

expected_out_keys = ['seq_id1', 'sequence1', 'seq_id2', 'sequence2', 'similarity_scores_k', 'pred_label', 'pred_proba']
output = predictor.predict()    # List of output objects; Each element is for one pair
1.2.2. As a command-line utility
  • Run the command shark-dive with the absolute path of the sequence fasta files as only argument
  • Sequences should be of length > 10, since prediction is always based on scores of k = [1..10]
  • You may use the sample_fasta_file.fasta from data folder (Owncloud link)
usage: shark-dive [-h] [--output_dir OUTPUT_DIR] query target

DIVE-Predict: Given some query sequences, compute their similarity from the list of target sequences;Target is
supposed to be major database of protein sequences

positional arguments:
  query       Absolute path to fasta file for the query set of input sequences
  target      Absolute path to fasta file for the target set of input sequences

options:
  -h, --help  show this help message and exit
  --output_dir OUTPUT_DIR
                        Output folder (default: current working directory)
  
$ shark-dive "<query-fasta-file>.fasta" "<target-fasta-file>.fasta"
Read fasta file from path <query-fasta-file>.fasta; Found 4 sequences; Skipped 0 sequences for having X
Read fasta file from path <target-fasta-file>.fasta; Found 6 sequences; Skipped 0 sequences for having X
Output stored at <OUTPUT_DIR>/<path-to-sequence-fasta-file>.fasta.csv
  • Output CSV has the following column headers:
    • (1) "Query": Fasta ID of sequence from Query list
    • (2) "Target": Fasta ID of sequence from Target list
    • (3..12) "SHARK-Score (k=*)": Similarity score between the two sequences for specific k-value
    • (13) "SHARK-Dive": Aggregated similarity score over all lengths of k-mer

Publication

SHARK enables homology assessment in unalignable and disordered sequences

Chi Fung Willis Chow, Soumyadeep Ghosh, Anna Hadarovich, Agnes Toth-Petroczy*

Accepted

Biorxiv link: https://www.biorxiv.org/content/10.1101/2023.06.26.546490v1

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio_shark-1.2.1.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

bio_shark-1.2.1-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file bio_shark-1.2.1.tar.gz.

File metadata

  • Download URL: bio_shark-1.2.1.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for bio_shark-1.2.1.tar.gz
Algorithm Hash digest
SHA256 91d99faee996b190601945551bde3529ed5e3e1901475a5aef019d0c2d3c1d46
MD5 e5a3ad5d7475b0c1dd8794b42e2c70d2
BLAKE2b-256 973a4c5eedcb76c24890d9b1d34531c497c7284d338dc3ad43a27a0263deba47

See more details on using hashes here.

File details

Details for the file bio_shark-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: bio_shark-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for bio_shark-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 77968b4901f41692d7b3b1c801e1cc3fb1f5e96f7f9eb5c5f72550fb9d83305e
MD5 9d45eb1a7368fa7a66f87c2baa39f2db
BLAKE2b-256 97b27deb74671e19ae0f8f58892d8bb7c1d6b4a7d43a5fac8a6e9f054963d34f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page