Skip to main content

tools for predicting kinome specificities

Project description

KATLAS

Open In Colab PyPI

KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.

References: Please cite the appropriate papers if KATLAS is helpful to your research.

Reproduce datasets & figures

Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw

Need to install the package via: pip install 'python-katlas[dev]' -U

Web applications

Users can now run the analysis directly on the web without needing to code.

Check out our latest web platform: kinase-atlas.com

Tutorials on Colab

Install

pip install python-katlas -U

To use other modules besides the core, do pip install 'python-katlas[dev]' -U

Import

from katlas.core import *

Quick start

We provide two methods to calculate substrate sequence:

  • Computational Data-Driven Method (CDDM)
  • Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

  • a single input string (phosphorylation site)
  • a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

  • all capital
  • contains lower cases indicating phosphorylation status

Single sequence as input

CDDM, all capital

predict_kinase('AAAAAAASGGAGSDN',**param_CDDM_upper)
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']

kinase
PAK6     2.032
ULK3     2.032
PRKX     2.012
ATR      1.991
PRKD1    1.988
         ...  
DDR2     0.928
EPHA4    0.928
TEK      0.921
KIT      0.915
FGFR3    0.910
Length: 289, dtype: float64

CDDM, with lower case indicating phosphorylation status

predict_kinase('AAAAAAAsGGAGsDN',**param_CDDM)
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']

kinase
ULK3     1.987
PAK6     1.981
PRKD1    1.946
PIM3     1.944
PRKX     1.939
         ...  
EPHA4    0.905
EGFR     0.900
TEK      0.898
FGFR3    0.894
KIT      0.882
Length: 289, dtype: float64

PSPA, with lower case indicating phosphorylation status

predict_kinase('AEEKEyHsEGG',**param_PSPA).head()
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']

kinase
EGFR     4.013
FGFR4    3.568
ZAP70    3.412
CSK      3.241
SYK      3.209
dtype: float64

To replicate the results from The Kinase Library (PSPA)

Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).

predict_kinase('AEEKEyHSEGG',**param_PSPA).head(10)
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']

kinase
EGFR         3.181
FGFR4        2.390
CSK          2.308
ZAP70        2.068
SYK          1.998
PDHK1_TYR    1.922
RET          1.732
MATK         1.688
FLT1         1.627
BMPR2_TYR    1.456
dtype: float64
  • So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
  • Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.

We can also calculate the percentile score using a referenced score sheet.

# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()

get_pct('AEEKEyHSEGG',**param_PSPA_y, pct_ref = y_pct)
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
log2(score) percentile
EGFR 3.181 96.787423
FGFR4 2.390 94.012303
CSK 2.308 95.201640
ZAP70 2.068 88.380041
SYK 1.998 85.522898
... ... ...
EPHA1 -3.501 12.139440
FES -3.699 21.216678
TNK1 -4.269 5.481887
TNK2 -4.577 2.050581
DDR2 -4.920 10.403281

93 rows × 2 columns

High-throughput substrate scoring on a dataframe

Load your csv

# df = pd.read_csv('your_file.csv')

Load a demo df

# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]
site_seq gene_site
0 VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 IADHLFWSEETKSRF A0A075B6Q4_S57
3 KSRFTEYSMTSSVMR A0A075B6Q4_S68
4 FTEYSMTSSVMRRNE A0A075B6Q4_S71

Set the column name and param to calculate

Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.

results = predict_kinase_df(df,'site_seq',**param_CDDM_upper)
results
input dataframe has a length 5
Preprocessing
Finish preprocessing
Calculating position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]

100%|██████████| 289/289 [00:05<00:00, 56.64it/s]
kinase SRC EPHA3 FES NTRK3 ALK EPHA8 ABL1 FLT3 EPHB2 FYN ... MEK5 PKN2 MAP2K7 MRCKB HIPK3 CDK8 BUB1 MEKK3 MAP2K3 GRK1
0 0.991760 1.093712 1.051750 1.067134 1.013682 1.097519 0.966379 0.982464 1.054986 1.055910 ... 1.314859 1.635470 1.652251 1.622672 1.362973 1.797155 1.305198 1.423618 1.504941 1.872020
1 0.910262 0.953743 0.942327 0.950601 0.872694 0.932586 0.846899 0.826662 0.915020 0.942713 ... 1.175454 1.402006 1.430392 1.215826 1.569373 1.716455 1.270999 1.195081 1.223082 1.793290
2 0.849866 0.899910 0.848895 0.879652 0.874959 0.899414 0.839200 0.836523 0.858040 0.867269 ... 1.408003 1.813739 1.454786 1.084522 1.352556 1.524663 1.377839 1.173830 1.305691 1.811849
3 0.803826 0.836527 0.800759 0.894570 0.839905 0.781001 0.847847 0.807040 0.805877 0.801402 ... 1.110307 1.703637 1.795092 1.469653 1.549936 1.491344 1.446922 1.055452 1.534895 1.741090
4 0.822793 0.796532 0.792343 0.839882 0.810122 0.781420 0.805251 0.795022 0.790380 0.864538 ... 1.062617 1.357689 1.485945 1.249266 1.456078 1.422782 1.376471 1.089629 1.121309 1.697524

5 rows × 289 columns

Phosphorylation sites

Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.

CPTAC pan-cancer phosphoproteomics

df = Data.get_cptac_ensembl_site()
df.head(3)
gene site site_seq protein gene_name gene_site protein_site
0 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000000412.3 M6PR M6PR_S267 ENSP00000000412_S267
1 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000440488.2 M6PR M6PR_S267 ENSP00000440488_S267
2 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000003302.4 USP28 USP28_S1053 ENSP00000003302_S1053

Ochoa et al. human phosphoproteome

df = Data.get_ochoa_site()
df.head(3)
uniprot position residue is_disopred disopred_score log10_hotspot_pval_min isHotspot uniprot_position functional_score current_uniprot name gene Sequence is_valid site_seq gene_site
0 A0A075B6Q4 24 S True 0.91 6.839384 True A0A075B6Q4_24 0.149257 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 A0A075B6Q4 35 S True 0.87 9.192622 False A0A075B6Q4_35 0.136966 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 A0A075B6Q4 57 S False 0.28 0.818834 False A0A075B6Q4_57 0.125364 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True IADHLFWSEETKSRF A0A075B6Q4_S57

PhosphoSitePlus human phosphorylation site

df = Data.get_psp_human_site()
df.head(3)
gene protein uniprot site gene_site SITE_GRP_ID species site_seq LT_LIT MS_LIT MS_CST CST_CAT# Ambiguous_Site
0 YWHAB 14-3-3 beta P31946 T2 YWHAB_T2 15718712 human ______MtMDksELV NaN 3.0 1.0 None 0
1 YWHAB 14-3-3 beta P31946 S6 YWHAB_S6 15718709 human __MtMDksELVQkAk NaN 8.0 NaN None 0
2 YWHAB 14-3-3 beta P31946 Y21 YWHAB_Y21 3426383 human LAEQAERyDDMAAAM NaN NaN 4.0 None 0

Unique sites of combined Ochoa & PhosphoSitePlus

df = Data.get_combine_site_psp_ochoa()
df.head(3)
site_seq gene_site gene source num_site acceptor -7 -6 -5 -4 ... -2 -1 0 1 2 3 4 5 6 7
0 AAAAAAASGGAGSDN PBX1_S136 PBX1 ochoa 1 S A A A A ... A A S G G A G S D N
1 AAAAAAASGGGVSPD PBX2_S146 PBX2 ochoa 1 S A A A A ... A A S G G G V S P D
2 AAAAAAASGVTTGKP CLASR_S349 CLASR ochoa 1 S A A A A ... A A S G V T T G K P

3 rows × 21 columns

Phosphorylation site sequence example

All capital - 15 length (-7 to +7)

  • QSEEEKLSPSPTTED
  • TLQHVPDYRQNVYIP
  • TMGLSARyGPQFTLQ

All capital - 10 length (-5 to +4)

  • SRDPHYQDPH
  • LDNPDyQQDF
  • AAAAAsGGAG

With lowercase - (-7 to +7)

  • QsEEEKLsPsPTTED
  • TLQHVPDyRQNVYIP
  • TMGLsARyGPQFTLQ

With lowercase - (-5 to +4)

  • sRDPHyQDPH
  • LDNPDyQQDF
  • AAAAAsGGAG

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_katlas-0.1.4.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

python_katlas-0.1.4-py3-none-any.whl (39.2 kB view details)

Uploaded Python 3

File details

Details for the file python_katlas-0.1.4.tar.gz.

File metadata

  • Download URL: python_katlas-0.1.4.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.10

File hashes

Hashes for python_katlas-0.1.4.tar.gz
Algorithm Hash digest
SHA256 ba51b1967fe935937932fbb260bb039f6bb29fe874f25cabd5ff5a500a5ba496
MD5 d03be869287d6af46d20b014230e03d6
BLAKE2b-256 ab7da88c44c7bcf6f42f7b0c367722ee25cd428a4bd271a4ef9516004ef13cd4

See more details on using hashes here.

File details

Details for the file python_katlas-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for python_katlas-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2462b7d78c4344fb683d146bbfe3167dd7524a8f556c7cf5edf405b2d8aee1a5
MD5 ae6e71e0e784ecddb9ac2a4a6ebae781
BLAKE2b-256 8276ed0f520a8d185ce457b179784494000812f6cd06a2699d42888b15f4b12f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page