Skip to main content

Predict the activity of CRISPR sgRNAs

Project description

Rule Set 3

Python package to predict the activity of CRISPR sgRNA sequences using Rule Set 3

Install

pip install git+ssh://git@github.com/gpp-rnd/rs3.git

Quick Start

Sequence based model

Import packages

from rs3.seq import predict_seq

Create a list of context sequences you want to predict

context_seqs = ['GACGAAAGCGACAACGCGTTCATCCGGGCA', 'AGAAAACACTAGCATCCCCACCCGCGGACT']

You can predict on-target scores for sequences using the predict_seq function, specifying either Hsu2013 or Chen2013 as the tracrRNA to score with

predict_seq(context_seqs, sequence_tracr='Hsu2013')
array([-0.86673522,  1.09560723])

Target and sequence scores

Using the predict function we can calculate both target scores and sequence scores. Target-based scores use information such as amino acid sequence and whether the sgRNA targets in a protein domain.

import pandas as pd
from rs3.predict import predict
import gpplot
import seaborn as sns
import matplotlib.pyplot as plt

We'll use a list of ~250 sgRNA from the GeckoV2 library as an example dataset

design_df = pd.read_table('test_data/sgrna-designs.txt')
design_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Input Quota Target Taxon Target Gene ID Target Gene Symbol Target Transcript Target Reference Coords Target Alias CRISPR Mechanism Target Domain ... On-Target Rank Weight Off-Target Rank Weight Combined Rank Preselected As Matching Active Arrayed Oligos Matching Arrayed Constructs Pools Containing Matching Construct Pick Order Picking Round Picking Notes
0 PSMB7 2 9606 ENSG00000136930 PSMB7 ENST00000259457.8 NaN NaN CRISPRko CDS ... 1.0 1.0 7 GCAGATACAAGAGCAACTGA NaN BRDN0004619103 NaN 1 0 Preselected
1 PSMB7 2 9606 ENSG00000136930 PSMB7 ENST00000259457.8 NaN NaN CRISPRko CDS ... 1.0 1.0 48 AAAACTGGCACGACCATCGC NaN NaN NaN 2 0 Preselected
2 PRC1 2 9606 ENSG00000198901 PRC1 ENST00000394249.8 NaN NaN CRISPRko CDS ... 1.0 1.0 7 AAAAGATTTGCGCACCCAAG NaN NaN NaN 1 0 Preselected
3 PRC1 2 9606 ENSG00000198901 PRC1 ENST00000394249.8 NaN NaN CRISPRko CDS ... 1.0 1.0 8 CTTTGACCCAGACATAATGG NaN NaN NaN 2 0 Preselected
4 TOP1 2 9606 ENSG00000198900 TOP1 ENST00000361337.3 NaN NaN CRISPRko CDS ... 1.0 1.0 1 NaN NaN BRDN0001486452 NaN 2 1 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395 RFC5 2 9606 ENSG00000111445 RFC5 ENST00000454402.7 NaN NaN CRISPRko CDS ... 1.0 1.0 23 TTTATATAGCTGTTTCGCAC NaN NaN NaN 1 0 Preselected
396 NXT1 2 9606 ENSG00000132661 NXT1 ENST00000254998.3 NaN NaN CRISPRko CDS ... 1.0 1.0 3 NaN NaN BRDN0002419367 NaN 2 1 NaN
397 NXT1 2 9606 ENSG00000132661 NXT1 ENST00000254998.3 NaN NaN CRISPRko CDS ... 1.0 1.0 31 TTTGCTGTCCCGCCTGTACA NaN NaN NaN 1 0 Preselected
398 NOL10 2 9606 ENSG00000115761 NOL10 ENST00000381685.10 NaN NaN CRISPRko CDS ... 1.0 1.0 3 NaN NaN NaN NaN 2 1 NaN
399 NOL10 2 9606 ENSG00000115761 NOL10 ENST00000381685.10 NaN NaN CRISPRko CDS ... 1.0 1.0 14 TTTGTCTGATGACTACTCAA NaN NaN NaN 1 0 Preselected

400 rows × 60 columns

gecko_activity = pd.read_csv('test_data/Aguirre2017_activity.csv')
gecko_activity
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sgRNA Sequence sgRNA Context Sequence Target Gene Symbol Target Cut % avg_mean_centered_neg_lfc
0 AAAAAACTTACCCCTTTGAC AAAAAAAAAACTTACCCCTTTGACTGGCCA CPSF6 22.2 -1.139819
1 AAAAACATTATCATTGAGCC TGGCAAAAACATTATCATTGAGCCTGGATT SKA3 62.3 -0.793055
2 AAAAAGAGATTGTCAAATCA TATGAAAAAGAGATTGTCAAATCAAGGTAG AQR 3.8 0.946453
3 AAAAAGCATCTCTAGAAATA TTCAAAAAAGCATCTCTAGAAATATGGTCC ZNHIT6 61.7 -0.429590
4 AAAAAGCGAGATACCCGAAA AAAAAAAAAGCGAGATACCCGAAAAGGCAG ABCF1 9.4 0.734196
... ... ... ... ... ...
8654 TTTGTGGCAGCGAATCATAA TGTCTTTGTGGCAGCGAATCATAATGGTTC UMPS 43.8 -0.927345
8655 TTTGTTAATATCTGCTGAAC TGAATTTGTTAATATCTGCTGAACAGGAGT GTF2A1 40.3 -0.382060
8656 TTTGTTAGGATGTGCATTCC TTTCTTTGTTAGGATGTGCATTCCAGGTAC NAT10 16.4 -0.927645
8657 TTTGTTAGGTCATCGTATTG GGTTTTTGTTAGGTCATCGTATTGAGGAAG RPL4 33.5 -1.425502
8658 TTTGTTCCTTAGTTGCTGAC TACTTTTGTTCCTTAGTTGCTGACAGGTCC MRPL47 34.8 -1.268444

8659 rows × 5 columns

By listing both tracrRNA tracr=['Hsu2013', 'Chen2013'] and setting target=True, we calculate 5 unique scores: one sequence score for each tracr, the target score, and the sequence scores plus the target score.

scored_designs = predict(design_df, tracr=['Hsu2013', 'Chen2013'], target=True,
                         n_jobs=2)
scored_designs
Getting amino acid sequences


100%|██████████| 4/4 [00:00<00:00, 94.47it/s]


Getting protein domains


100%|██████████| 200/200 [00:30<00:00,  6.50it/s]
/Users/pdeweird/opt/anaconda3/envs/rs3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator SimpleImputer from version 1.0.dev0 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
/Users/pdeweird/opt/anaconda3/envs/rs3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator Pipeline from version 1.0.dev0 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Input Quota Target Taxon Target Gene ID Target Gene Symbol Target Transcript Target Reference Coords Target Alias CRISPR Mechanism Target Domain ... Pools Containing Matching Construct Pick Order Picking Round Picking Notes RS3 Sequence Score (Hsu2013 tracr) RS3 Sequence Score (Chen2013 tracr) Transcript Base RS3 Target Score RS3 Sequence (Hsu2013 tracr) + Target Score RS3 Sequence (Chen2013 tracr) + Target Score
0 PSMB7 2 9606 ENSG00000136930 PSMB7 ENST00000259457.8 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected 0.750904 0.512534 ENST00000259457 0.273974 1.024878 0.786508
1 PSMB7 2 9606 ENSG00000136930 PSMB7 ENST00000259457.8 NaN NaN CRISPRko CDS ... NaN 2 0 Preselected -0.218514 -0.095684 ENST00000259457 -0.010152 -0.228667 -0.105837
2 PRC1 2 9606 ENSG00000198901 PRC1 ENST00000394249.8 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected -0.126708 -0.307830 ENST00000394249 -0.018259 -0.144967 -0.326089
3 PRC1 2 9606 ENSG00000198901 PRC1 ENST00000394249.8 NaN NaN CRISPRko CDS ... NaN 2 0 Preselected 0.690050 0.390095 ENST00000394249 -0.089659 0.600392 0.300436
4 TOP1 2 9606 ENSG00000198900 TOP1 ENST00000361337.3 NaN NaN CRISPRko CDS ... NaN 2 1 NaN 0.451508 -0.169016 ENST00000361337 -0.018748 0.432760 -0.187764
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395 RFC5 2 9606 ENSG00000111445 RFC5 ENST00000454402.7 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected -0.220600 -0.022154 ENST00000454402 0.102902 -0.117698 0.080747
396 NXT1 2 9606 ENSG00000132661 NXT1 ENST00000254998.3 NaN NaN CRISPRko CDS ... NaN 2 1 NaN 0.621609 0.539656 ENST00000254998 0.220856 0.842465 0.760513
397 NXT1 2 9606 ENSG00000132661 NXT1 ENST00000254998.3 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected 0.119830 0.012744 ENST00000254998 0.146767 0.266597 0.159511
398 NOL10 2 9606 ENSG00000115761 NOL10 ENST00000381685.10 NaN NaN CRISPRko CDS ... NaN 2 1 NaN 0.798633 0.646323 ENST00000381685 -0.039771 0.758861 0.606552
399 NOL10 2 9606 ENSG00000115761 NOL10 ENST00000381685.10 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected 0.283254 0.264148 ENST00000381685 0.015462 0.298716 0.279611

400 rows × 66 columns

gecko_activity_scores = (gecko_activity.merge(scored_designs,
                                              how='inner',
                                              on=['sgRNA Sequence', 'sgRNA Context Sequence',
                                                  'Target Gene Symbol', 'Target Cut %']))
gecko_activity_scores
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sgRNA Sequence sgRNA Context Sequence Target Gene Symbol Target Cut % avg_mean_centered_neg_lfc Input Quota Target Taxon Target Gene ID Target Transcript ... Pools Containing Matching Construct Pick Order Picking Round Picking Notes RS3 Sequence Score (Hsu2013 tracr) RS3 Sequence Score (Chen2013 tracr) Transcript Base RS3 Target Score RS3 Sequence (Hsu2013 tracr) + Target Score RS3 Sequence (Chen2013 tracr) + Target Score
0 AAAACTGGCACGACCATCGC CCGGAAAACTGGCACGACCATCGCTGGGGT PSMB7 16.4 -1.052943 PSMB7 2 9606 ENSG00000136930 ENST00000259457.8 ... NaN 2 0 Preselected -0.218514 -0.095684 ENST00000259457 -0.010152 -0.228667 -0.105837
1 AAAAGATTTGCGCACCCAAG TAGAAAAAGATTTGCGCACCCAAGTGGAAT PRC1 17.0 0.028674 PRC1 2 9606 ENSG00000198901 ENST00000394249.8 ... NaN 1 0 Preselected -0.126708 -0.307830 ENST00000394249 -0.018259 -0.144967 -0.326089
2 AAAAGTCCAAGCATAGCAAC CGGGAAAAGTCCAAGCATAGCAACAGGTAA TOP1 6.5 0.195309 TOP1 2 9606 ENSG00000198900 ENST00000361337.3 ... NaN 1 0 Preselected -0.356580 -0.082514 ENST00000361337 -0.418276 -0.774856 -0.500790
3 AAAGAAGCCTCAACTTCGTC AGCGAAAGAAGCCTCAACTTCGTCTGGAGA CENPW 37.5 -1.338209 CENPW 2 9606 ENSG00000203760 ENST00000368328.5 ... NaN 2 0 Preselected -0.663540 -0.303324 ENST00000368328 0.274739 -0.388801 -0.028585
4 AAAGTGTGCTTTGTTGGAGA TACTAAAGTGTGCTTTGTTGGAGATGGCTT NSA2 60.0 -0.175219 NSA2 2 9606 ENSG00000164346 ENST00000610426.5 ... NaN 2 0 Preselected -0.413636 -0.585179 ENST00000610426 -0.072158 -0.485794 -0.657337
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
247 TTTAGCCGGATGCGCAGTGA CTCTTTTAGCCGGATGCGCAGTGATGGTTT NKAPD1 20.6 0.627322 NKAPD1 2 9606 ENSG00000150776 ENST00000393047.8 ... CP1718 1 0 Preselected 0.298329 0.274344 ENST00000393047 -0.201158 0.097171 0.073187
248 TTTATATAGCTGTTTCGCAC TGTCTTTATATAGCTGTTTCGCACAGGCTA RFC5 21.5 -0.957190 RFC5 2 9606 ENSG00000111445 ENST00000454402.7 ... NaN 1 0 Preselected -0.220600 -0.022154 ENST00000454402 0.102902 -0.117698 0.080747
249 TTTGCTGTCCCGCCTGTACA GGCGTTTGCTGTCCCGCCTGTACATGGGCA NXT1 27.2 0.176827 NXT1 2 9606 ENSG00000132661 ENST00000254998.3 ... NaN 1 0 Preselected 0.119830 0.012744 ENST00000254998 0.146767 0.266597 0.159511
250 TTTGTCTGATGACTACTCAA AAATTTTGTCTGATGACTACTCAAAGGTAT NOL10 15.6 -0.043965 NOL10 2 9606 ENSG00000115761 ENST00000381685.10 ... NaN 1 0 Preselected 0.283254 0.264148 ENST00000381685 0.015462 0.298716 0.279611
251 TTTGTTAGGTCATCGTATTG GGTTTTTGTTAGGTCATCGTATTGAGGAAG RPL4 33.5 -1.425502 RPL4 2 9606 ENSG00000174444 ENST00000307961.11 ... CP1718 2 1 NaN -0.636302 -0.575100 ENST00000307961 0.021391 -0.614912 -0.553709

252 rows × 67 columns

Since Gecko was screened with the tracrRNA from Hsu et al. 2013, we'll use this as our predictor

plt.subplots(figsize=(4,4))
gpplot.point_densityplot(gecko_activity_scores, y='avg_mean_centered_neg_lfc',
                         x='RS3 Sequence (Hsu2013 tracr) + Target Score')
gpplot.add_correlation(gecko_activity_scores, y='avg_mean_centered_neg_lfc',
                       x='RS3 Sequence (Hsu2013 tracr) + Target Score')
sns.despine()

png

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rs3-0.0.5.tar.gz (6.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rs3-0.0.5-py3-none-any.whl (6.1 MB view details)

Uploaded Python 3

File details

Details for the file rs3-0.0.5.tar.gz.

File metadata

  • Download URL: rs3-0.0.5.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.6.8

File hashes

Hashes for rs3-0.0.5.tar.gz
Algorithm Hash digest
SHA256 ab65c73716d4a75e052fa374525498b203504e8359f8c50212a8c27a7495e57a
MD5 e7a09d23df28fe8517506e5e12e8ac5e
BLAKE2b-256 5b7618a431e7858d79bd34aa2aa8774fb54b6d9f0eccadee1993a494ff24ca64

See more details on using hashes here.

File details

Details for the file rs3-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: rs3-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 6.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.6.8

File hashes

Hashes for rs3-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3121effe78143a4f63aff486eb1f0059c5843b6a82ed247b84dd99729dee2d35
MD5 5cb8d702d427fbb9c9109287251823b1
BLAKE2b-256 50097ac4f6de1b089f2697b8c38e00c50955f8ac907894e5d7f87459cb6063a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page