Skip to main content

SHARK (Similarity/Homology Assessment by Relating K-mers)

Project description


SHARK (Similarity/Homology Assessment by Relating K-mers)

To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers).

SHARK-dive

We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment in assessing homology in unalignable sequences, and correctly identified dissimilar IDRs capable of functional rescue in IDR-replacement experiments reported in the literature.

1. Dive-Score

Scoring the similarity between a pair of sequence

Variants:

  1. Normal (SHARK-score (T))
  2. Sparse (SHARK-score (best))

2. Dive-Predict

Find sequences similar to a given query from a target set

User Section

Installation

SHARK officially supports Python versions >=3.9,<3.12.

Recommended Use within a local python virtual environment

python3 -m venv /path/to/new/virtual/environment

SHARK is installable from PyPI soon

$ pip install bio-shark

SHARK is also installable from source

  • This allows users to import functionalities as a python package
  • This also allows user to run the functionalities as a command line utility
$ git clone git@git.mpi-cbg.de:tothpetroczylab/shark.git

Once you have a copy of the source, you can embed it in your own Python package, or install it into your site-packages easily.

# Make sure you have the required Python version installed
$ python3 --version
Python 3.11.5

$ cd shark
$ python3 -m venv shark-env
$ source shark-env/bin/activate
$ (shark-env) % python -m pip install .

SHARK is also installable from GitLab source directly

$ pip install git+https://git.mpi-cbg.de/tothpetroczylab/shark.git

How to use?

1. SHARK-scores: Given two protein sequences and a k-mer length (1 to 20), score the similarity b/w them

Inputs
  1. Protein Sequence 1
  2. Protein Sequence 2
  3. Scoring-variant: Normal (SHARK-score (T))/ Sparse (SHARK-score (best))
    1. Threshold (for "Normal")
  4. K-Mer Length (Should be <= smallest_len(sequences))
1.1. As a command-line utility
  • Run the command shark-score along with input fasta files and scoring parameters
  • Instead of input fasta files (--infile or --dbfile), a pair of query-target sequences can also be provided, e.g.:
% shark-score QUERYSEQUENCE TARGETSEQUENCE -k 5 - t 0.95 -s threshold -o results.tsv
  • Note that if a FASTA file is provided, it will be used instead.
  • The overall usage is as follows:
% shark-score --infile <path/to/query/fasta/file> --dbfile <path/to/target/fasta/file> --outfile <path/to/result/file> --length <k-mer length> --threshold <shark-score threshold> 
usage: shark-score [-h] [--infile INFILE] [--dbfile DBFILE] [--outfile OUTFILE] [--scoretype {best,threshold,NGD}] [--length LENGTH] [--threshold THRESHOLD] [query] [target]

Run SHARK-Scores (best or T=x variants) or Normalised Google Distance Scores. Note that if a FASTA file is provided, it will be used instead.

positional arguments:
  query                 Query sequence
  target                Target sequence

optional arguments:
  -h, --help            show this help message and exit
  --infile INFILE, -i INFILE
                        Query FASTA file
  --dbfile DBFILE, -d DBFILE
                        Target FASTA file
  --outfile OUTFILE, -o OUTFILE
                        Result file
  --scoretype {best,threshold,NGD}, -s {best,threshold,NGD}
                        Score type: best or threshold or NGD. Default is threshold.
  --length LENGTH, -k LENGTH
                        k-mer length
  --threshold THRESHOLD, -t THRESHOLD
                        threshold for SHARK-Score (T=x) variant
1.2. As an imported python package
from bio_shark.core import utils
from bio_shark.dive.run import run_normal, run_sparse

dive_t_score = run_normal(
    sequence1="LASIDPTFKAN",
    sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
    k=3,
    threshold=0.8
)   # Compute SHARK-score (T)  

dive_best_score = run_sparse(
    sequence1="LASIDPTFKAN",
    sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
    k=3,
)   # Compute SHARK-score (best)

2. SHARK-Dive: Homology Assessment between two sequences

2.1. As an imported python package
from bio_shark.dive.prediction import Prediction

predictor = Prediction(q_sequence_id_map=<dict-fasta-id-seq>, t_sequence_id_map=<dict-fasta-id-seq>)

expected_out_keys = ['seq_id1', 'sequence1', 'seq_id2', 'sequence2', 'similarity_scores_k', 'pred_label', 'pred_proba']
output = predictor.predict()    # List of output objects; Each element is for one pair
2.2. As a command-line utility
  • Run the command shark-dive with the absolute path of the sequence fasta files as only argument
  • Sequences should be of length > 10, since prediction is always based on scores of k = [1..10]
  • You may use the sample_fasta_file.fasta from data folder (Owncloud link)
usage: shark-dive [-h] [--output_dir OUTPUT_DIR] query target

DIVE-Predict: Given some query sequences, compute their similarity from the list of target sequences;Target is
supposed to be major database of protein sequences

positional arguments:
  query       Absolute path to fasta file for the query set of input sequences
  target      Absolute path to fasta file for the target set of input sequences

options:
  -h, --help  show this help message and exit
  --output_dir OUTPUT_DIR
                        Output folder (default: current working directory)
  
$ shark-dive "<query-fasta-file>.fasta" "<target-fasta-file>.fasta"
Read fasta file from path <query-fasta-file>.fasta; Found 4 sequences; Skipped 0 sequences for having X
Read fasta file from path <target-fasta-file>.fasta; Found 6 sequences; Skipped 0 sequences for having X
Output stored at <OUTPUT_DIR>/<path-to-sequence-fasta-file>.fasta.csv
  • Output CSV has the following column headers:
    • (1) "Query": Fasta ID of sequence from Query list
    • (2) "Target": Fasta ID of sequence from Target list
    • (3..12) "SHARK-Score (k=*)": Similarity score between the two sequences for specific k-value
    • (13) "SHARK-Dive": Aggregated similarity score over all lengths of k-mer
2.3. Parallelised Runs of SHARK-Dive
  • Each k-mer score is run in parallel, with a final aggregation step of the 10 k-mer scores, whereupon SHARK-Dive is run.
  • change the environmental variables in parallel_run_example_environment.env (or create your own!)
  • navigate to the parallel_run folder
  • run parallel_run.sh
$ bash parallel_run.sh
...
Read fasta file from path ../data/IDR_Segments.fasta; Found 6 sequences; Skipped 0 sequences for having non-canonical AAs
All sequences are present! Proceeding with SHARK-dive prediction...
Finished in 0.10163092613220215 seconds
121307136
SHARK-dive prediction complete!
Elapsed Time: 3 seconds

Publication

SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences. Chi Fung Willis Chow, Soumyadeep Ghosh, Anna Hadarovich, and Agnes Toth-Petroczy. Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: 10.1073/pnas.2401622121. Epub 2024 Oct 9. PMID: 39383002.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio_shark-1.2.2.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

bio_shark-1.2.2-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file bio_shark-1.2.2.tar.gz.

File metadata

  • Download URL: bio_shark-1.2.2.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for bio_shark-1.2.2.tar.gz
Algorithm Hash digest
SHA256 c7b5ccee8dfcadf5327219c738a0dc45247ee0b20dfc1d21a277756c2ca8d273
MD5 3312708741df879c4c8b675623318f19
BLAKE2b-256 09981fc98c56d076ce157e6c99cebd104d4fc3f5ccd34bb068258ba549324cb6

See more details on using hashes here.

File details

Details for the file bio_shark-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: bio_shark-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for bio_shark-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 91caf25e9cd455f541efd6db9f743a192ef21a2805373d99087c99d1c8842b15
MD5 81e88dc9ccff1f0374968b3e283f5247
BLAKE2b-256 1b69077956c16f83870aeb7a633f75eb71942163d4b183cdd2012c8e030098db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page