SHARK (Similarity/Homology Assessment by Relating K-mers)
Project description
SHARK (Similarity/Homology Assessment by Relating K-mers)
To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers).
SHARK-dive
We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment in assessing homology in unalignable sequences, and correctly identified dissimilar IDRs capable of functional rescue in IDR-replacement experiments reported in the literature.
1. Dive-Score
Scoring the similarity between a pair of sequence
Variants:
- Normal (
SHARK-score (T)
) - Sparse (
SHARK-score (best)
)
2. Dive-Predict
Find sequences similar to a given query from a target set
User Section
Installation
SHARK officially supports Python versions >=3.9,<3.12.
Recommended Use within a local python virtual environment
python3 -m venv /path/to/new/virtual/environment
SHARK is installable from PyPI soon
$ pip install bio-shark
SHARK is also installable from source
- This allows users to import functionalities as a python package
- This also allows user to run the functionalities as a command line utility
$ git clone git@git.mpi-cbg.de:tothpetroczylab/shark.git
Once you have a copy of the source, you can embed it in your own Python package, or install it into your site-packages easily.
# Make sure you have the required Python version installed
$ python3 --version
Python 3.11.5
$ cd shark
$ python3 -m venv shark-env
$ source shark-env/bin/activate
$ (shark-env) % python -m pip install .
SHARK is also installable from GitLab source directly
$ pip install git+https://git.mpi-cbg.de/tothpetroczylab/shark.git
How to use?
1. SHARK-scores: Given two protein sequences and a k-mer length (1 to 20), score the similarity b/w them
Inputs
- Protein Sequence 1
- Protein Sequence 2
- Scoring-variant: Normal (
SHARK-score (T)
)/ Sparse (SHARK-score (best)
)- Threshold (for "Normal")
- K-Mer Length (Should be <= smallest_len(sequences))
1.1. As a command-line utility
- Run the command
shark-score
along with input fasta files and scoring parameters - Instead of input fasta files (--infile or --dbfile), a pair of query-target sequences can also be provided, e.g.:
% shark-score QUERYSEQUENCE TARGETSEQUENCE -k 5 - t 0.95 -s threshold -o results.tsv
- Note that if a FASTA file is provided, it will be used instead.
- The overall usage is as follows:
% shark-score --infile <path/to/query/fasta/file> --dbfile <path/to/target/fasta/file> --outfile <path/to/result/file> --length <k-mer length> --threshold <shark-score threshold>
usage: shark-score [-h] [--infile INFILE] [--dbfile DBFILE] [--outfile OUTFILE] [--scoretype {best,threshold,NGD}] [--length LENGTH] [--threshold THRESHOLD] [query] [target]
Run SHARK-Scores (best or T=x variants) or Normalised Google Distance Scores. Note that if a FASTA file is provided, it will be used instead.
positional arguments:
query Query sequence
target Target sequence
optional arguments:
-h, --help show this help message and exit
--infile INFILE, -i INFILE
Query FASTA file
--dbfile DBFILE, -d DBFILE
Target FASTA file
--outfile OUTFILE, -o OUTFILE
Result file
--scoretype {best,threshold,NGD}, -s {best,threshold,NGD}
Score type: best or threshold or NGD. Default is threshold.
--length LENGTH, -k LENGTH
k-mer length
--threshold THRESHOLD, -t THRESHOLD
threshold for SHARK-Score (T=x) variant
1.2. As an imported python package
from bio_shark.core import utils
from bio_shark.dive.run import run_normal, run_sparse
dive_t_score = run_normal(
sequence1="LASIDPTFKAN",
sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
k=3,
threshold=0.8
) # Compute SHARK-score (T)
dive_best_score = run_sparse(
sequence1="LASIDPTFKAN",
sequence2="ERQKNGGKSDSDDDEPAAKKKVEYPIAAAPPMMMP",
k=3,
) # Compute SHARK-score (best)
2. SHARK-Dive: Homology Assessment between two sequences
2.1. As an imported python package
from bio_shark.dive.prediction import Prediction
predictor = Prediction(q_sequence_id_map=<dict-fasta-id-seq>, t_sequence_id_map=<dict-fasta-id-seq>)
expected_out_keys = ['seq_id1', 'sequence1', 'seq_id2', 'sequence2', 'similarity_scores_k', 'pred_label', 'pred_proba']
output = predictor.predict() # List of output objects; Each element is for one pair
2.2. As a command-line utility
- Run the command
shark-dive
with the absolute path of the sequence fasta files as only argument - Sequences should be of length > 10, since
prediction
is always based on scores of k = [1..10] - You may use the
sample_fasta_file.fasta
fromdata
folder (Owncloud link)
usage: shark-dive [-h] [--output_dir OUTPUT_DIR] query target
DIVE-Predict: Given some query sequences, compute their similarity from the list of target sequences;Target is
supposed to be major database of protein sequences
positional arguments:
query Absolute path to fasta file for the query set of input sequences
target Absolute path to fasta file for the target set of input sequences
options:
-h, --help show this help message and exit
--output_dir OUTPUT_DIR
Output folder (default: current working directory)
$ shark-dive "<query-fasta-file>.fasta" "<target-fasta-file>.fasta"
Read fasta file from path <query-fasta-file>.fasta; Found 4 sequences; Skipped 0 sequences for having X
Read fasta file from path <target-fasta-file>.fasta; Found 6 sequences; Skipped 0 sequences for having X
Output stored at <OUTPUT_DIR>/<path-to-sequence-fasta-file>.fasta.csv
- Output CSV has the following column headers:
- (1) "Query": Fasta ID of sequence from Query list
- (2) "Target": Fasta ID of sequence from Target list
- (3..12) "SHARK-Score (k=*)": Similarity score between the two sequences for specific k-value
- (13) "SHARK-Dive": Aggregated similarity score over all lengths of k-mer
2.3. Parallelised Runs of SHARK-Dive
- Each k-mer score is run in parallel, with a final aggregation step of the 10 k-mer scores, whereupon SHARK-Dive is run.
- change the environmental variables in parallel_run_example_environment.env (or create your own!)
- navigate to the parallel_run folder
- run parallel_run.sh
$ bash parallel_run.sh
...
Read fasta file from path ../data/IDR_Segments.fasta; Found 6 sequences; Skipped 0 sequences for having non-canonical AAs
All sequences are present! Proceeding with SHARK-dive prediction...
Finished in 0.10163092613220215 seconds
121307136
SHARK-dive prediction complete!
Elapsed Time: 3 seconds
Publication
SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences. Chi Fung Willis Chow, Soumyadeep Ghosh, Anna Hadarovich, and Agnes Toth-Petroczy. Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: 10.1073/pnas.2401622121. Epub 2024 Oct 9. PMID: 39383002.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bio_shark-1.2.2.tar.gz
.
File metadata
- Download URL: bio_shark-1.2.2.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7b5ccee8dfcadf5327219c738a0dc45247ee0b20dfc1d21a277756c2ca8d273 |
|
MD5 | 3312708741df879c4c8b675623318f19 |
|
BLAKE2b-256 | 09981fc98c56d076ce157e6c99cebd104d4fc3f5ccd34bb068258ba549324cb6 |
File details
Details for the file bio_shark-1.2.2-py3-none-any.whl
.
File metadata
- Download URL: bio_shark-1.2.2-py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91caf25e9cd455f541efd6db9f743a192ef21a2805373d99087c99d1c8842b15 |
|
MD5 | 81e88dc9ccff1f0374968b3e283f5247 |
|
BLAKE2b-256 | 1b69077956c16f83870aeb7a633f75eb71942163d4b183cdd2012c8e030098db |