Skip to main content

tools for predicting kinome specificities

Project description

KATLAS

Open In Colab

KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.

References: Please cite the appropriate papers if KATLAS is helpful to your research.

Tutorials on Colab

Install

Install the latest version through git

!pip install git+https://github.com/sky1ove/katlas.git -Uqq

Import

from katlas.core import *

Quick start

We provide two methods to calculate substrate sequence:

  • Computational Data-Driven Method (CDDM)
  • Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

  • a single input string (phosphorylation site)
  • a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

  • all capital
  • contains lower cases indicating phosphorylation status

Single sequence as input

CDDM, all capital

predict_kinase('AAAAAAASGGAGSDN',**param_CDDM_upper)
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']

kinase
PAK6     2.032
ULK3     2.032
PRKX     2.012
ATR      1.991
PRKD1    1.988
         ...  
DDR2     0.928
EPHA4    0.928
TEK      0.921
KIT      0.915
FGFR3    0.910
Length: 289, dtype: float64

CDDM, with lower case indicating phosphorylation status

predict_kinase('AAAAAAAsGGAGsDN',**param_CDDM)
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']

kinase
ULK3     1.987
PAK6     1.981
PRKD1    1.946
PIM3     1.944
PRKX     1.939
         ...  
EPHA4    0.905
EGFR     0.900
TEK      0.898
FGFR3    0.894
KIT      0.882
Length: 289, dtype: float64

PSPA, with lower case indicating phosphorylation status

predict_kinase('AEEKEyHsEGG',**param_PSPA).head()
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']

kinase
EGFR     4.013
FGFR4    3.568
ZAP70    3.412
CSK      3.241
SYK      3.209
dtype: float64

To replicate the results from The Kinase Library (PSPA)

Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).

predict_kinase('AEEKEyHSEGG',**param_PSPA).head(10)
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']

kinase
EGFR         3.181
FGFR4        2.390
CSK          2.308
ZAP70        2.068
SYK          1.998
PDHK1_TYR    1.922
RET          1.732
MATK         1.688
FLT1         1.627
BMPR2_TYR    1.456
dtype: float64
  • So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
  • Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.

We can also calculate the percentile score using a referenced score sheet.

# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()

get_pct('AEEKEyHSEGG',**param_PSPA_y, pct_ref = y_pct)
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
log2(score) percentile
EGFR 3.181 96.787423
FGFR4 2.390 94.012303
CSK 2.308 95.201640
ZAP70 2.068 88.380041
SYK 1.998 85.522898
... ... ...
EPHA1 -3.501 12.139440
FES -3.699 21.216678
TNK1 -4.269 5.481887
TNK2 -4.577 2.050581
DDR2 -4.920 10.403281

High-throughput substrate scoring on a dataframe

Load your csv

# df = pd.read_csv('your_file.csv')

Load a demo df

# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]
site_seq gene_site
0 VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 IADHLFWSEETKSRF A0A075B6Q4_S57
3 KSRFTEYSMTSSVMR A0A075B6Q4_S68
4 FTEYSMTSSVMRRNE A0A075B6Q4_S71

Set the column name and param to calculate

Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.

results = predict_kinase_df(df,'site_seq',**param_CDDM_upper)
results
input dataframe has a length 5
Preprocessing
Finish preprocessing
Calculating position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]

100%|██████████| 289/289 [00:05<00:00, 56.64it/s]
kinase SRC EPHA3 FES NTRK3 ALK EPHA8 ABL1 FLT3 EPHB2 FYN ... MEK5 PKN2 MAP2K7 MRCKB HIPK3 CDK8 BUB1 MEKK3 MAP2K3 GRK1
0 0.991760 1.093712 1.051750 1.067134 1.013682 1.097519 0.966379 0.982464 1.054986 1.055910 ... 1.314859 1.635470 1.652251 1.622672 1.362973 1.797155 1.305198 1.423618 1.504941 1.872020
1 0.910262 0.953743 0.942327 0.950601 0.872694 0.932586 0.846899 0.826662 0.915020 0.942713 ... 1.175454 1.402006 1.430392 1.215826 1.569373 1.716455 1.270999 1.195081 1.223082 1.793290
2 0.849866 0.899910 0.848895 0.879652 0.874959 0.899414 0.839200 0.836523 0.858040 0.867269 ... 1.408003 1.813739 1.454786 1.084522 1.352556 1.524663 1.377839 1.173830 1.305691 1.811849
3 0.803826 0.836527 0.800759 0.894570 0.839905 0.781001 0.847847 0.807040 0.805877 0.801402 ... 1.110307 1.703637 1.795092 1.469653 1.549936 1.491344 1.446922 1.055452 1.534895 1.741090
4 0.822793 0.796532 0.792343 0.839882 0.810122 0.781420 0.805251 0.795022 0.790380 0.864538 ... 1.062617 1.357689 1.485945 1.249266 1.456078 1.422782 1.376471 1.089629 1.121309 1.697524

Phosphorylation sites

Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.

CPTAC pan-cancer phosphoproteomics

df = Data.get_cptac_ensembl_site()
df.head(3)
gene site site_seq protein gene_name gene_site protein_site
0 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000000412.3 M6PR M6PR_S267 ENSP00000000412_S267
1 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000440488.2 M6PR M6PR_S267 ENSP00000440488_S267
2 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000003302.4 USP28 USP28_S1053 ENSP00000003302_S1053

Ochoa et al. human phosphoproteome

df = Data.get_ochoa_site()
df.head(3)
uniprot position residue is_disopred disopred_score log10_hotspot_pval_min isHotspot uniprot_position functional_score current_uniprot name gene Sequence is_valid site_seq gene_site
0 A0A075B6Q4 24 S True 0.91 6.839384 True A0A075B6Q4_24 0.149257 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 A0A075B6Q4 35 S True 0.87 9.192622 False A0A075B6Q4_35 0.136966 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 A0A075B6Q4 57 S False 0.28 0.818834 False A0A075B6Q4_57 0.125364 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True IADHLFWSEETKSRF A0A075B6Q4_S57

PhosphoSitePlus human phosphorylation site

df = Data.get_psp_human_site()
df.head(3)
gene protein uniprot site gene_site SITE_GRP_ID species site_seq LT_LIT MS_LIT MS_CST CST_CAT# Ambiguous_Site
0 YWHAB 14-3-3 beta P31946 T2 YWHAB_T2 15718712 human ______MtMDksELV NaN 3.0 1.0 None 0
1 YWHAB 14-3-3 beta P31946 S6 YWHAB_S6 15718709 human __MtMDksELVQkAk NaN 8.0 NaN None 0
2 YWHAB 14-3-3 beta P31946 Y21 YWHAB_Y21 3426383 human LAEQAERyDDMAAAM NaN NaN 4.0 None 0

Unique sites of combined Ochoa & PhosphoSitePlus

df = Data.get_combine_site_psp_ochoa()
df.head(3)
site_seq gene_site gene source num_site acceptor -7 -6 -5 -4 ... -2 -1 0 1 2 3 4 5 6 7
0 AAAAAAASGGAGSDN PBX1_S136 PBX1 ochoa 1 S A A A A ... A A S G G A G S D N
1 AAAAAAASGGGVSPD PBX2_S146 PBX2 ochoa 1 S A A A A ... A A S G G G V S P D
2 AAAAAAASGVTTGKP CLASR_S349 CLASR ochoa 1 S A A A A ... A A S G V T T G K P

Phosphorylation site sequence example

All capital - 15 length (-7 to +7)

  • QSEEEKLSPSPTTED
  • TLQHVPDYRQNVYIP
  • TMGLSARyGPQFTLQ

All capital - 10 length (-5 to +4)

  • SRDPHYQDPH
  • LDNPDyQQDF
  • AAAAAsGGAG

With lowercase - (-7 to +7)

  • QsEEEKLsPsPTTED
  • TLQHVPDyRQNVYIP
  • TMGLsARyGPQFTLQ

With lowercase - (-5 to +4)

  • sRDPHyQDPH
  • LDNPDyQQDF
  • AAAAAsGGAG

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-katlas-0.0.4.tar.gz (42.7 kB view details)

Uploaded Source

Built Distribution

python_katlas-0.0.4-py3-none-any.whl (38.8 kB view details)

Uploaded Python 3

File details

Details for the file python-katlas-0.0.4.tar.gz.

File metadata

  • Download URL: python-katlas-0.0.4.tar.gz
  • Upload date:
  • Size: 42.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for python-katlas-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f8d647acc4396445df914eb5ef8d40c42a74cafc259e51e687691f6ece7abb40
MD5 33931b711bced7b35bfde8b0a131e9cd
BLAKE2b-256 91cb7519019482676d445f4900fa8582069377311725c590f34ce0323635fff6

See more details on using hashes here.

File details

Details for the file python_katlas-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for python_katlas-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b491ad1ce4ad3429c89312f4de288e9b1d49bda891f3599835e335f1bf654fb4
MD5 613cf3172d62ddba8977a12c24ff2dfb
BLAKE2b-256 5fa2a13811c8406ac324511f7bbbdc4c89fab9cafbb5f04ed880beed48619324

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page