Skip to main content

Predict the activity of CRISPR sgRNAs

Project description

Rule Set 3

Python package to predict the activity of CRISPR sgRNA sequences using Rule Set 3

Install

pip install git+ssh://git@github.com/gpp-rnd/rs3.git

Quick Start

Sequence based model

Import packages

from rs3.seq import predict_seq

Create a list of context sequences you want to predict

context_seqs = ['GACGAAAGCGACAACGCGTTCATCCGGGCA', 'AGAAAACACTAGCATCCCCACCCGCGGACT']

You can predict on-target scores for sequences using the predict_seq function, specifying either Hsu2013 or Chen2013 as the tracrRNA to score with

predict_seq(context_seqs, sequence_tracr='Hsu2013')
array([-0.86673522,  1.09560723])

Target and sequence scores

Using the predict function we can calculate both target scores and sequence scores. Target-based scores use information such as amino acid sequence and whether the sgRNA targets in a protein domain.

import pandas as pd
from rs3.predict import predict
import gpplot
import seaborn as sns
import matplotlib.pyplot as plt

We'll use a list of ~250 sgRNA from the GeckoV2 library as an example dataset

design_df = pd.read_table('test_data/sgrna-designs.txt')
design_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Input Quota Target Taxon Target Gene ID Target Gene Symbol Target Transcript Target Reference Coords Target Alias CRISPR Mechanism Target Domain ... On-Target Rank Weight Off-Target Rank Weight Combined Rank Preselected As Matching Active Arrayed Oligos Matching Arrayed Constructs Pools Containing Matching Construct Pick Order Picking Round Picking Notes
0 PSMB7 2 9606 ENSG00000136930 PSMB7 ENST00000259457.8 NaN NaN CRISPRko CDS ... 1.0 1.0 7 GCAGATACAAGAGCAACTGA NaN BRDN0004619103 NaN 1 0 Preselected
1 PSMB7 2 9606 ENSG00000136930 PSMB7 ENST00000259457.8 NaN NaN CRISPRko CDS ... 1.0 1.0 48 AAAACTGGCACGACCATCGC NaN NaN NaN 2 0 Preselected
2 PRC1 2 9606 ENSG00000198901 PRC1 ENST00000394249.8 NaN NaN CRISPRko CDS ... 1.0 1.0 7 AAAAGATTTGCGCACCCAAG NaN NaN NaN 1 0 Preselected
3 PRC1 2 9606 ENSG00000198901 PRC1 ENST00000394249.8 NaN NaN CRISPRko CDS ... 1.0 1.0 8 CTTTGACCCAGACATAATGG NaN NaN NaN 2 0 Preselected
4 TOP1 2 9606 ENSG00000198900 TOP1 ENST00000361337.3 NaN NaN CRISPRko CDS ... 1.0 1.0 1 NaN NaN BRDN0001486452 NaN 2 1 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395 RFC5 2 9606 ENSG00000111445 RFC5 ENST00000454402.7 NaN NaN CRISPRko CDS ... 1.0 1.0 23 TTTATATAGCTGTTTCGCAC NaN NaN NaN 1 0 Preselected
396 NXT1 2 9606 ENSG00000132661 NXT1 ENST00000254998.3 NaN NaN CRISPRko CDS ... 1.0 1.0 3 NaN NaN BRDN0002419367 NaN 2 1 NaN
397 NXT1 2 9606 ENSG00000132661 NXT1 ENST00000254998.3 NaN NaN CRISPRko CDS ... 1.0 1.0 31 TTTGCTGTCCCGCCTGTACA NaN NaN NaN 1 0 Preselected
398 NOL10 2 9606 ENSG00000115761 NOL10 ENST00000381685.10 NaN NaN CRISPRko CDS ... 1.0 1.0 3 NaN NaN NaN NaN 2 1 NaN
399 NOL10 2 9606 ENSG00000115761 NOL10 ENST00000381685.10 NaN NaN CRISPRko CDS ... 1.0 1.0 14 TTTGTCTGATGACTACTCAA NaN NaN NaN 1 0 Preselected

400 rows × 60 columns

gecko_activity = pd.read_csv('test_data/Aguirre2017_activity.csv')
gecko_activity
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sgRNA Sequence sgRNA Context Sequence Target Gene Symbol Target Cut % avg_mean_centered_neg_lfc
0 AAAAAACTTACCCCTTTGAC AAAAAAAAAACTTACCCCTTTGACTGGCCA CPSF6 22.2 -1.139819
1 AAAAACATTATCATTGAGCC TGGCAAAAACATTATCATTGAGCCTGGATT SKA3 62.3 -0.793055
2 AAAAAGAGATTGTCAAATCA TATGAAAAAGAGATTGTCAAATCAAGGTAG AQR 3.8 0.946453
3 AAAAAGCATCTCTAGAAATA TTCAAAAAAGCATCTCTAGAAATATGGTCC ZNHIT6 61.7 -0.429590
4 AAAAAGCGAGATACCCGAAA AAAAAAAAAGCGAGATACCCGAAAAGGCAG ABCF1 9.4 0.734196
... ... ... ... ... ...
8654 TTTGTGGCAGCGAATCATAA TGTCTTTGTGGCAGCGAATCATAATGGTTC UMPS 43.8 -0.927345
8655 TTTGTTAATATCTGCTGAAC TGAATTTGTTAATATCTGCTGAACAGGAGT GTF2A1 40.3 -0.382060
8656 TTTGTTAGGATGTGCATTCC TTTCTTTGTTAGGATGTGCATTCCAGGTAC NAT10 16.4 -0.927645
8657 TTTGTTAGGTCATCGTATTG GGTTTTTGTTAGGTCATCGTATTGAGGAAG RPL4 33.5 -1.425502
8658 TTTGTTCCTTAGTTGCTGAC TACTTTTGTTCCTTAGTTGCTGACAGGTCC MRPL47 34.8 -1.268444

8659 rows × 5 columns

By listing both tracrRNA tracr=['Hsu2013', 'Chen2013'] and setting target=True, we calculate 5 unique scores: one sequence score for each tracr, the target score, and the sequence scores plus the target score.

scored_designs = predict(design_df, tracr=['Hsu2013', 'Chen2013'], target=True,
                         n_jobs=2)
scored_designs
Getting amino acid sequences


100%|██████████| 4/4 [00:00<00:00, 94.47it/s]


Getting protein domains


100%|██████████| 200/200 [00:30<00:00,  6.50it/s]
/Users/pdeweird/opt/anaconda3/envs/rs3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator SimpleImputer from version 1.0.dev0 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
/Users/pdeweird/opt/anaconda3/envs/rs3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator Pipeline from version 1.0.dev0 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Input Quota Target Taxon Target Gene ID Target Gene Symbol Target Transcript Target Reference Coords Target Alias CRISPR Mechanism Target Domain ... Pools Containing Matching Construct Pick Order Picking Round Picking Notes RS3 Sequence Score (Hsu2013 tracr) RS3 Sequence Score (Chen2013 tracr) Transcript Base RS3 Target Score RS3 Sequence (Hsu2013 tracr) + Target Score RS3 Sequence (Chen2013 tracr) + Target Score
0 PSMB7 2 9606 ENSG00000136930 PSMB7 ENST00000259457.8 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected 0.750904 0.512534 ENST00000259457 0.273974 1.024878 0.786508
1 PSMB7 2 9606 ENSG00000136930 PSMB7 ENST00000259457.8 NaN NaN CRISPRko CDS ... NaN 2 0 Preselected -0.218514 -0.095684 ENST00000259457 -0.010152 -0.228667 -0.105837
2 PRC1 2 9606 ENSG00000198901 PRC1 ENST00000394249.8 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected -0.126708 -0.307830 ENST00000394249 -0.018259 -0.144967 -0.326089
3 PRC1 2 9606 ENSG00000198901 PRC1 ENST00000394249.8 NaN NaN CRISPRko CDS ... NaN 2 0 Preselected 0.690050 0.390095 ENST00000394249 -0.089659 0.600392 0.300436
4 TOP1 2 9606 ENSG00000198900 TOP1 ENST00000361337.3 NaN NaN CRISPRko CDS ... NaN 2 1 NaN 0.451508 -0.169016 ENST00000361337 -0.018748 0.432760 -0.187764
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395 RFC5 2 9606 ENSG00000111445 RFC5 ENST00000454402.7 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected -0.220600 -0.022154 ENST00000454402 0.102902 -0.117698 0.080747
396 NXT1 2 9606 ENSG00000132661 NXT1 ENST00000254998.3 NaN NaN CRISPRko CDS ... NaN 2 1 NaN 0.621609 0.539656 ENST00000254998 0.220856 0.842465 0.760513
397 NXT1 2 9606 ENSG00000132661 NXT1 ENST00000254998.3 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected 0.119830 0.012744 ENST00000254998 0.146767 0.266597 0.159511
398 NOL10 2 9606 ENSG00000115761 NOL10 ENST00000381685.10 NaN NaN CRISPRko CDS ... NaN 2 1 NaN 0.798633 0.646323 ENST00000381685 -0.039771 0.758861 0.606552
399 NOL10 2 9606 ENSG00000115761 NOL10 ENST00000381685.10 NaN NaN CRISPRko CDS ... NaN 1 0 Preselected 0.283254 0.264148 ENST00000381685 0.015462 0.298716 0.279611

400 rows × 66 columns

gecko_activity_scores = (gecko_activity.merge(scored_designs,
                                              how='inner',
                                              on=['sgRNA Sequence', 'sgRNA Context Sequence',
                                                  'Target Gene Symbol', 'Target Cut %']))
gecko_activity_scores
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sgRNA Sequence sgRNA Context Sequence Target Gene Symbol Target Cut % avg_mean_centered_neg_lfc Input Quota Target Taxon Target Gene ID Target Transcript ... Pools Containing Matching Construct Pick Order Picking Round Picking Notes RS3 Sequence Score (Hsu2013 tracr) RS3 Sequence Score (Chen2013 tracr) Transcript Base RS3 Target Score RS3 Sequence (Hsu2013 tracr) + Target Score RS3 Sequence (Chen2013 tracr) + Target Score
0 AAAACTGGCACGACCATCGC CCGGAAAACTGGCACGACCATCGCTGGGGT PSMB7 16.4 -1.052943 PSMB7 2 9606 ENSG00000136930 ENST00000259457.8 ... NaN 2 0 Preselected -0.218514 -0.095684 ENST00000259457 -0.010152 -0.228667 -0.105837
1 AAAAGATTTGCGCACCCAAG TAGAAAAAGATTTGCGCACCCAAGTGGAAT PRC1 17.0 0.028674 PRC1 2 9606 ENSG00000198901 ENST00000394249.8 ... NaN 1 0 Preselected -0.126708 -0.307830 ENST00000394249 -0.018259 -0.144967 -0.326089
2 AAAAGTCCAAGCATAGCAAC CGGGAAAAGTCCAAGCATAGCAACAGGTAA TOP1 6.5 0.195309 TOP1 2 9606 ENSG00000198900 ENST00000361337.3 ... NaN 1 0 Preselected -0.356580 -0.082514 ENST00000361337 -0.418276 -0.774856 -0.500790
3 AAAGAAGCCTCAACTTCGTC AGCGAAAGAAGCCTCAACTTCGTCTGGAGA CENPW 37.5 -1.338209 CENPW 2 9606 ENSG00000203760 ENST00000368328.5 ... NaN 2 0 Preselected -0.663540 -0.303324 ENST00000368328 0.274739 -0.388801 -0.028585
4 AAAGTGTGCTTTGTTGGAGA TACTAAAGTGTGCTTTGTTGGAGATGGCTT NSA2 60.0 -0.175219 NSA2 2 9606 ENSG00000164346 ENST00000610426.5 ... NaN 2 0 Preselected -0.413636 -0.585179 ENST00000610426 -0.072158 -0.485794 -0.657337
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
247 TTTAGCCGGATGCGCAGTGA CTCTTTTAGCCGGATGCGCAGTGATGGTTT NKAPD1 20.6 0.627322 NKAPD1 2 9606 ENSG00000150776 ENST00000393047.8 ... CP1718 1 0 Preselected 0.298329 0.274344 ENST00000393047 -0.201158 0.097171 0.073187
248 TTTATATAGCTGTTTCGCAC TGTCTTTATATAGCTGTTTCGCACAGGCTA RFC5 21.5 -0.957190 RFC5 2 9606 ENSG00000111445 ENST00000454402.7 ... NaN 1 0 Preselected -0.220600 -0.022154 ENST00000454402 0.102902 -0.117698 0.080747
249 TTTGCTGTCCCGCCTGTACA GGCGTTTGCTGTCCCGCCTGTACATGGGCA NXT1 27.2 0.176827 NXT1 2 9606 ENSG00000132661 ENST00000254998.3 ... NaN 1 0 Preselected 0.119830 0.012744 ENST00000254998 0.146767 0.266597 0.159511
250 TTTGTCTGATGACTACTCAA AAATTTTGTCTGATGACTACTCAAAGGTAT NOL10 15.6 -0.043965 NOL10 2 9606 ENSG00000115761 ENST00000381685.10 ... NaN 1 0 Preselected 0.283254 0.264148 ENST00000381685 0.015462 0.298716 0.279611
251 TTTGTTAGGTCATCGTATTG GGTTTTTGTTAGGTCATCGTATTGAGGAAG RPL4 33.5 -1.425502 RPL4 2 9606 ENSG00000174444 ENST00000307961.11 ... CP1718 2 1 NaN -0.636302 -0.575100 ENST00000307961 0.021391 -0.614912 -0.553709

252 rows × 67 columns

Since Gecko was screened with the tracrRNA from Hsu et al. 2013, we'll use this as our predictor

plt.subplots(figsize=(4,4))
gpplot.point_densityplot(gecko_activity_scores, y='avg_mean_centered_neg_lfc',
                         x='RS3 Sequence (Hsu2013 tracr) + Target Score')
gpplot.add_correlation(gecko_activity_scores, y='avg_mean_centered_neg_lfc',
                       x='RS3 Sequence (Hsu2013 tracr) + Target Score')
sns.despine()

png

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rs3-0.0.6.tar.gz (6.0 MB view hashes)

Uploaded Source

Built Distribution

rs3-0.0.6-py3-none-any.whl (6.1 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page