tools for predicting kinome specificities
Project description
KATLAS
KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.
References: Please cite the appropriate papers if KATLAS is helpful to your research.
-
KATLAS was described in the paper [Decoding Human Kinome Specificities through a Computational Data-Driven Approach (manuscript)]
-
The positional scanning peptide array (PSPA) data is from paper An atlas of substrate specificities for the human serine/threonine kinome and paper The intrinsic substrate specificity of the human tyrosine kinome
-
The kinase substrate datasets used for generating PSSMs are derived from PhosphoSitePlus and paper Large-scale Discovery of Substrates of the Human Kinome
-
Phosphorylation sites are acquired from PhosphoSitePlus, paper The functional landscape of the human phosphoproteome, and CPTAC / LinkedOmics
Tutorials on Colab
Install
Install the latest version through git
!pip install git+https://github.com/sky1ove/katlas.git -Uqq
Import
from katlas.core import *
Quick start
We provide two methods to calculate substrate sequence:
- Computational Data-Driven Method (CDDM)
- Positional Scanning Peptide Array (PSPA)
We consider the input in two formats:
- a single input string (phosphorylation site)
- a csv/dataframe that contains a column of phosphorylation sites
For input sequences, we also consider it in two conditions:
- all capital
- contains lower cases indicating phosphorylation status
Single sequence as input
CDDM, all capital
predict_kinase('AAAAAAASGGAGSDN',**param_CDDM_upper)
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']
kinase
PAK6 2.032
ULK3 2.032
PRKX 2.012
ATR 1.991
PRKD1 1.988
...
DDR2 0.928
EPHA4 0.928
TEK 0.921
KIT 0.915
FGFR3 0.910
Length: 289, dtype: float64
CDDM, with lower case indicating phosphorylation status
predict_kinase('AAAAAAAsGGAGsDN',**param_CDDM)
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']
kinase
ULK3 1.987
PAK6 1.981
PRKD1 1.946
PIM3 1.944
PRKX 1.939
...
EPHA4 0.905
EGFR 0.900
TEK 0.898
FGFR3 0.894
KIT 0.882
Length: 289, dtype: float64
PSPA, with lower case indicating phosphorylation status
predict_kinase('AEEKEyHsEGG',**param_PSPA).head()
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']
kinase
EGFR 4.013
FGFR4 3.568
ZAP70 3.412
CSK 3.241
SYK 3.209
dtype: float64
To replicate the results from The Kinase Library (PSPA)
Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).
predict_kinase('AEEKEyHSEGG',**param_PSPA).head(10)
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']
kinase
EGFR 3.181
FGFR4 2.390
CSK 2.308
ZAP70 2.068
SYK 1.998
PDHK1_TYR 1.922
RET 1.732
MATK 1.688
FLT1 1.627
BMPR2_TYR 1.456
dtype: float64
- So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
- Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.
We can also calculate the percentile score using a referenced score sheet.
# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()
get_pct('AEEKEyHSEGG',**param_PSPA_y, pct_ref = y_pct)
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
log2(score) | percentile | |
---|---|---|
EGFR | 3.181 | 96.787423 |
FGFR4 | 2.390 | 94.012303 |
CSK | 2.308 | 95.201640 |
ZAP70 | 2.068 | 88.380041 |
SYK | 1.998 | 85.522898 |
... | ... | ... |
EPHA1 | -3.501 | 12.139440 |
FES | -3.699 | 21.216678 |
TNK1 | -4.269 | 5.481887 |
TNK2 | -4.577 | 2.050581 |
DDR2 | -4.920 | 10.403281 |
High-throughput substrate scoring on a dataframe
Load your csv
# df = pd.read_csv('your_file.csv')
Load a demo df
# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]
site_seq | gene_site | |
---|---|---|
0 | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
3 | KSRFTEYSMTSSVMR | A0A075B6Q4_S68 |
4 | FTEYSMTSSVMRRNE | A0A075B6Q4_S71 |
Set the column name and param to calculate
Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.
results = predict_kinase_df(df,'site_seq',**param_CDDM_upper)
results
input dataframe has a length 5
Preprocessing
Finish preprocessing
Calculating position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
100%|██████████| 289/289 [00:05<00:00, 56.64it/s]
kinase | SRC | EPHA3 | FES | NTRK3 | ALK | EPHA8 | ABL1 | FLT3 | EPHB2 | FYN | ... | MEK5 | PKN2 | MAP2K7 | MRCKB | HIPK3 | CDK8 | BUB1 | MEKK3 | MAP2K3 | GRK1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.991760 | 1.093712 | 1.051750 | 1.067134 | 1.013682 | 1.097519 | 0.966379 | 0.982464 | 1.054986 | 1.055910 | ... | 1.314859 | 1.635470 | 1.652251 | 1.622672 | 1.362973 | 1.797155 | 1.305198 | 1.423618 | 1.504941 | 1.872020 |
1 | 0.910262 | 0.953743 | 0.942327 | 0.950601 | 0.872694 | 0.932586 | 0.846899 | 0.826662 | 0.915020 | 0.942713 | ... | 1.175454 | 1.402006 | 1.430392 | 1.215826 | 1.569373 | 1.716455 | 1.270999 | 1.195081 | 1.223082 | 1.793290 |
2 | 0.849866 | 0.899910 | 0.848895 | 0.879652 | 0.874959 | 0.899414 | 0.839200 | 0.836523 | 0.858040 | 0.867269 | ... | 1.408003 | 1.813739 | 1.454786 | 1.084522 | 1.352556 | 1.524663 | 1.377839 | 1.173830 | 1.305691 | 1.811849 |
3 | 0.803826 | 0.836527 | 0.800759 | 0.894570 | 0.839905 | 0.781001 | 0.847847 | 0.807040 | 0.805877 | 0.801402 | ... | 1.110307 | 1.703637 | 1.795092 | 1.469653 | 1.549936 | 1.491344 | 1.446922 | 1.055452 | 1.534895 | 1.741090 |
4 | 0.822793 | 0.796532 | 0.792343 | 0.839882 | 0.810122 | 0.781420 | 0.805251 | 0.795022 | 0.790380 | 0.864538 | ... | 1.062617 | 1.357689 | 1.485945 | 1.249266 | 1.456078 | 1.422782 | 1.376471 | 1.089629 | 1.121309 | 1.697524 |
Phosphorylation sites
Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.
CPTAC pan-cancer phosphoproteomics
df = Data.get_cptac_ensembl_site()
df.head(3)
gene | site | site_seq | protein | gene_name | gene_site | protein_site | |
---|---|---|---|---|---|---|---|
0 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000000412.3 | M6PR | M6PR_S267 | ENSP00000000412_S267 |
1 | ENSG00000003056.8 | S267 | DDQLGEESEERDDHL | ENSP00000440488.2 | M6PR | M6PR_S267 | ENSP00000440488_S267 |
2 | ENSG00000048028.11 | S1053 | PPTIRPNSPYDLCSR | ENSP00000003302.4 | USP28 | USP28_S1053 | ENSP00000003302_S1053 |
Ochoa et al. human phosphoproteome
df = Data.get_ochoa_site()
df.head(3)
uniprot | position | residue | is_disopred | disopred_score | log10_hotspot_pval_min | isHotspot | uniprot_position | functional_score | current_uniprot | name | gene | Sequence | is_valid | site_seq | gene_site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0A075B6Q4 | 24 | S | True | 0.91 | 6.839384 | True | A0A075B6Q4_24 | 0.149257 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | VDDEKGDSNDDYDSA | A0A075B6Q4_S24 |
1 | A0A075B6Q4 | 35 | S | True | 0.87 | 9.192622 | False | A0A075B6Q4_35 | 0.136966 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | YDSAGLLSDEDCMSV | A0A075B6Q4_S35 |
2 | A0A075B6Q4 | 57 | S | False | 0.28 | 0.818834 | False | A0A075B6Q4_57 | 0.125364 | A0A075B6Q4 | A0A075B6Q4_HUMAN | None | MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... | True | IADHLFWSEETKSRF | A0A075B6Q4_S57 |
PhosphoSitePlus human phosphorylation site
df = Data.get_psp_human_site()
df.head(3)
gene | protein | uniprot | site | gene_site | SITE_GRP_ID | species | site_seq | LT_LIT | MS_LIT | MS_CST | CST_CAT# | Ambiguous_Site | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | YWHAB | 14-3-3 beta | P31946 | T2 | YWHAB_T2 | 15718712 | human | ______MtMDksELV | NaN | 3.0 | 1.0 | None | 0 |
1 | YWHAB | 14-3-3 beta | P31946 | S6 | YWHAB_S6 | 15718709 | human | __MtMDksELVQkAk | NaN | 8.0 | NaN | None | 0 |
2 | YWHAB | 14-3-3 beta | P31946 | Y21 | YWHAB_Y21 | 3426383 | human | LAEQAERyDDMAAAM | NaN | NaN | 4.0 | None | 0 |
Unique sites of combined Ochoa & PhosphoSitePlus
df = Data.get_combine_site_psp_ochoa()
df.head(3)
site_seq | gene_site | gene | source | num_site | acceptor | -7 | -6 | -5 | -4 | ... | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAAAAAASGGAGSDN | PBX1_S136 | PBX1 | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | G | A | G | S | D | N |
1 | AAAAAAASGGGVSPD | PBX2_S146 | PBX2 | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | G | G | V | S | P | D |
2 | AAAAAAASGVTTGKP | CLASR_S349 | CLASR | ochoa | 1 | S | A | A | A | A | ... | A | A | S | G | V | T | T | G | K | P |
Phosphorylation site sequence example
All capital - 15 length (-7 to +7)
- QSEEEKLSPSPTTED
- TLQHVPDYRQNVYIP
- TMGLSARyGPQFTLQ
All capital - 10 length (-5 to +4)
- SRDPHYQDPH
- LDNPDyQQDF
- AAAAAsGGAG
With lowercase - (-7 to +7)
- QsEEEKLsPsPTTED
- TLQHVPDyRQNVYIP
- TMGLsARyGPQFTLQ
With lowercase - (-5 to +4)
- sRDPHyQDPH
- LDNPDyQQDF
- AAAAAsGGAG
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file python-katlas-0.1.0.tar.gz
.
File metadata
- Download URL: python-katlas-0.1.0.tar.gz
- Upload date:
- Size: 43.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92e73d111505caea12822e4c2451523ef837f1f263cfaf18f345b90152bbd8ea |
|
MD5 | 211c744fd91f9d597613f5b8ad183275 |
|
BLAKE2b-256 | 45d0ab45ee5ecdd2d2d1d48bab3dd68a452d2f7f358d7ff7f6c5f417918c3e0d |
File details
Details for the file python_katlas-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: python_katlas-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 655bbf7c1b0ed13e8b01f594de296b43277a5bb27f910f0be1543571ab92f07f |
|
MD5 | feaeecf26b717ad7d0ec90d38bdfb656 |
|
BLAKE2b-256 | 3945c19e06a25c22fe1f0b7aca366f7771660d58bbce1e643ed31bb711c20034 |