Predict the activity of CRISPR sgRNAs
Project description
Rule Set 3
Python package to predict the activity of CRISPR sgRNA sequences using Rule Set 3
Install
pip install git+ssh://git@github.com/gpp-rnd/rs3.git
Quick Start
Sequence based model
Import packages
from rs3.seq import predict_seq
Create a list of context sequences you want to predict
context_seqs = ['GACGAAAGCGACAACGCGTTCATCCGGGCA', 'AGAAAACACTAGCATCCCCACCCGCGGACT']
You can predict on-target scores for sequences using the predict_seq
function, specifying either
Hsu2013 or
Chen2013
as the tracrRNA to score with
predict_seq(context_seqs, sequence_tracr='Hsu2013')
array([-0.86673522, 1.09560723])
Target and sequence scores
Using the predict
function we can calculate both target scores and sequence scores. Target-based scores use
information such as amino acid sequence and whether the sgRNA targets in a protein domain.
import pandas as pd
from rs3.predict import predict
import gpplot
import seaborn as sns
import matplotlib.pyplot as plt
We'll use a list of ~250 sgRNA from the GeckoV2 library as an example dataset
design_df = pd.read_table('test_data/sgrna-designs.txt')
design_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Input | Quota | Target Taxon | Target Gene ID | Target Gene Symbol | Target Transcript | Target Reference Coords | Target Alias | CRISPR Mechanism | Target Domain | ... | On-Target Rank Weight | Off-Target Rank Weight | Combined Rank | Preselected As | Matching Active Arrayed Oligos | Matching Arrayed Constructs | Pools Containing Matching Construct | Pick Order | Picking Round | Picking Notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PSMB7 | 2 | 9606 | ENSG00000136930 | PSMB7 | ENST00000259457.8 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 7 | GCAGATACAAGAGCAACTGA | NaN | BRDN0004619103 | NaN | 1 | 0 | Preselected |
1 | PSMB7 | 2 | 9606 | ENSG00000136930 | PSMB7 | ENST00000259457.8 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 48 | AAAACTGGCACGACCATCGC | NaN | NaN | NaN | 2 | 0 | Preselected |
2 | PRC1 | 2 | 9606 | ENSG00000198901 | PRC1 | ENST00000394249.8 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 7 | AAAAGATTTGCGCACCCAAG | NaN | NaN | NaN | 1 | 0 | Preselected |
3 | PRC1 | 2 | 9606 | ENSG00000198901 | PRC1 | ENST00000394249.8 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 8 | CTTTGACCCAGACATAATGG | NaN | NaN | NaN | 2 | 0 | Preselected |
4 | TOP1 | 2 | 9606 | ENSG00000198900 | TOP1 | ENST00000361337.3 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 1 | NaN | NaN | BRDN0001486452 | NaN | 2 | 1 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
395 | RFC5 | 2 | 9606 | ENSG00000111445 | RFC5 | ENST00000454402.7 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 23 | TTTATATAGCTGTTTCGCAC | NaN | NaN | NaN | 1 | 0 | Preselected |
396 | NXT1 | 2 | 9606 | ENSG00000132661 | NXT1 | ENST00000254998.3 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 3 | NaN | NaN | BRDN0002419367 | NaN | 2 | 1 | NaN |
397 | NXT1 | 2 | 9606 | ENSG00000132661 | NXT1 | ENST00000254998.3 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 31 | TTTGCTGTCCCGCCTGTACA | NaN | NaN | NaN | 1 | 0 | Preselected |
398 | NOL10 | 2 | 9606 | ENSG00000115761 | NOL10 | ENST00000381685.10 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 3 | NaN | NaN | NaN | NaN | 2 | 1 | NaN |
399 | NOL10 | 2 | 9606 | ENSG00000115761 | NOL10 | ENST00000381685.10 | NaN | NaN | CRISPRko | CDS | ... | 1.0 | 1.0 | 14 | TTTGTCTGATGACTACTCAA | NaN | NaN | NaN | 1 | 0 | Preselected |
400 rows × 60 columns
gecko_activity = pd.read_csv('test_data/Aguirre2017_activity.csv')
gecko_activity
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
sgRNA Sequence | sgRNA Context Sequence | Target Gene Symbol | Target Cut % | avg_mean_centered_neg_lfc | |
---|---|---|---|---|---|
0 | AAAAAACTTACCCCTTTGAC | AAAAAAAAAACTTACCCCTTTGACTGGCCA | CPSF6 | 22.2 | -1.139819 |
1 | AAAAACATTATCATTGAGCC | TGGCAAAAACATTATCATTGAGCCTGGATT | SKA3 | 62.3 | -0.793055 |
2 | AAAAAGAGATTGTCAAATCA | TATGAAAAAGAGATTGTCAAATCAAGGTAG | AQR | 3.8 | 0.946453 |
3 | AAAAAGCATCTCTAGAAATA | TTCAAAAAAGCATCTCTAGAAATATGGTCC | ZNHIT6 | 61.7 | -0.429590 |
4 | AAAAAGCGAGATACCCGAAA | AAAAAAAAAGCGAGATACCCGAAAAGGCAG | ABCF1 | 9.4 | 0.734196 |
... | ... | ... | ... | ... | ... |
8654 | TTTGTGGCAGCGAATCATAA | TGTCTTTGTGGCAGCGAATCATAATGGTTC | UMPS | 43.8 | -0.927345 |
8655 | TTTGTTAATATCTGCTGAAC | TGAATTTGTTAATATCTGCTGAACAGGAGT | GTF2A1 | 40.3 | -0.382060 |
8656 | TTTGTTAGGATGTGCATTCC | TTTCTTTGTTAGGATGTGCATTCCAGGTAC | NAT10 | 16.4 | -0.927645 |
8657 | TTTGTTAGGTCATCGTATTG | GGTTTTTGTTAGGTCATCGTATTGAGGAAG | RPL4 | 33.5 | -1.425502 |
8658 | TTTGTTCCTTAGTTGCTGAC | TACTTTTGTTCCTTAGTTGCTGACAGGTCC | MRPL47 | 34.8 | -1.268444 |
8659 rows × 5 columns
By listing both tracrRNA tracr=['Hsu2013', 'Chen2013']
and setting target=True
, we calculate
5 unique scores: one sequence score for each tracr, the target score, and the sequence scores plus the target score.
scored_designs = predict(design_df, tracr=['Hsu2013', 'Chen2013'], target=True,
n_jobs=2)
scored_designs
Getting amino acid sequences
100%|██████████| 4/4 [00:00<00:00, 94.47it/s]
Getting protein domains
100%|██████████| 200/200 [00:30<00:00, 6.50it/s]
/Users/pdeweird/opt/anaconda3/envs/rs3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator SimpleImputer from version 1.0.dev0 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/Users/pdeweird/opt/anaconda3/envs/rs3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator Pipeline from version 1.0.dev0 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Input | Quota | Target Taxon | Target Gene ID | Target Gene Symbol | Target Transcript | Target Reference Coords | Target Alias | CRISPR Mechanism | Target Domain | ... | Pools Containing Matching Construct | Pick Order | Picking Round | Picking Notes | RS3 Sequence Score (Hsu2013 tracr) | RS3 Sequence Score (Chen2013 tracr) | Transcript Base | RS3 Target Score | RS3 Sequence (Hsu2013 tracr) + Target Score | RS3 Sequence (Chen2013 tracr) + Target Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PSMB7 | 2 | 9606 | ENSG00000136930 | PSMB7 | ENST00000259457.8 | NaN | NaN | CRISPRko | CDS | ... | NaN | 1 | 0 | Preselected | 0.750904 | 0.512534 | ENST00000259457 | 0.273974 | 1.024878 | 0.786508 |
1 | PSMB7 | 2 | 9606 | ENSG00000136930 | PSMB7 | ENST00000259457.8 | NaN | NaN | CRISPRko | CDS | ... | NaN | 2 | 0 | Preselected | -0.218514 | -0.095684 | ENST00000259457 | -0.010152 | -0.228667 | -0.105837 |
2 | PRC1 | 2 | 9606 | ENSG00000198901 | PRC1 | ENST00000394249.8 | NaN | NaN | CRISPRko | CDS | ... | NaN | 1 | 0 | Preselected | -0.126708 | -0.307830 | ENST00000394249 | -0.018259 | -0.144967 | -0.326089 |
3 | PRC1 | 2 | 9606 | ENSG00000198901 | PRC1 | ENST00000394249.8 | NaN | NaN | CRISPRko | CDS | ... | NaN | 2 | 0 | Preselected | 0.690050 | 0.390095 | ENST00000394249 | -0.089659 | 0.600392 | 0.300436 |
4 | TOP1 | 2 | 9606 | ENSG00000198900 | TOP1 | ENST00000361337.3 | NaN | NaN | CRISPRko | CDS | ... | NaN | 2 | 1 | NaN | 0.451508 | -0.169016 | ENST00000361337 | -0.018748 | 0.432760 | -0.187764 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
395 | RFC5 | 2 | 9606 | ENSG00000111445 | RFC5 | ENST00000454402.7 | NaN | NaN | CRISPRko | CDS | ... | NaN | 1 | 0 | Preselected | -0.220600 | -0.022154 | ENST00000454402 | 0.102902 | -0.117698 | 0.080747 |
396 | NXT1 | 2 | 9606 | ENSG00000132661 | NXT1 | ENST00000254998.3 | NaN | NaN | CRISPRko | CDS | ... | NaN | 2 | 1 | NaN | 0.621609 | 0.539656 | ENST00000254998 | 0.220856 | 0.842465 | 0.760513 |
397 | NXT1 | 2 | 9606 | ENSG00000132661 | NXT1 | ENST00000254998.3 | NaN | NaN | CRISPRko | CDS | ... | NaN | 1 | 0 | Preselected | 0.119830 | 0.012744 | ENST00000254998 | 0.146767 | 0.266597 | 0.159511 |
398 | NOL10 | 2 | 9606 | ENSG00000115761 | NOL10 | ENST00000381685.10 | NaN | NaN | CRISPRko | CDS | ... | NaN | 2 | 1 | NaN | 0.798633 | 0.646323 | ENST00000381685 | -0.039771 | 0.758861 | 0.606552 |
399 | NOL10 | 2 | 9606 | ENSG00000115761 | NOL10 | ENST00000381685.10 | NaN | NaN | CRISPRko | CDS | ... | NaN | 1 | 0 | Preselected | 0.283254 | 0.264148 | ENST00000381685 | 0.015462 | 0.298716 | 0.279611 |
400 rows × 66 columns
gecko_activity_scores = (gecko_activity.merge(scored_designs,
how='inner',
on=['sgRNA Sequence', 'sgRNA Context Sequence',
'Target Gene Symbol', 'Target Cut %']))
gecko_activity_scores
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
sgRNA Sequence | sgRNA Context Sequence | Target Gene Symbol | Target Cut % | avg_mean_centered_neg_lfc | Input | Quota | Target Taxon | Target Gene ID | Target Transcript | ... | Pools Containing Matching Construct | Pick Order | Picking Round | Picking Notes | RS3 Sequence Score (Hsu2013 tracr) | RS3 Sequence Score (Chen2013 tracr) | Transcript Base | RS3 Target Score | RS3 Sequence (Hsu2013 tracr) + Target Score | RS3 Sequence (Chen2013 tracr) + Target Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAAACTGGCACGACCATCGC | CCGGAAAACTGGCACGACCATCGCTGGGGT | PSMB7 | 16.4 | -1.052943 | PSMB7 | 2 | 9606 | ENSG00000136930 | ENST00000259457.8 | ... | NaN | 2 | 0 | Preselected | -0.218514 | -0.095684 | ENST00000259457 | -0.010152 | -0.228667 | -0.105837 |
1 | AAAAGATTTGCGCACCCAAG | TAGAAAAAGATTTGCGCACCCAAGTGGAAT | PRC1 | 17.0 | 0.028674 | PRC1 | 2 | 9606 | ENSG00000198901 | ENST00000394249.8 | ... | NaN | 1 | 0 | Preselected | -0.126708 | -0.307830 | ENST00000394249 | -0.018259 | -0.144967 | -0.326089 |
2 | AAAAGTCCAAGCATAGCAAC | CGGGAAAAGTCCAAGCATAGCAACAGGTAA | TOP1 | 6.5 | 0.195309 | TOP1 | 2 | 9606 | ENSG00000198900 | ENST00000361337.3 | ... | NaN | 1 | 0 | Preselected | -0.356580 | -0.082514 | ENST00000361337 | -0.418276 | -0.774856 | -0.500790 |
3 | AAAGAAGCCTCAACTTCGTC | AGCGAAAGAAGCCTCAACTTCGTCTGGAGA | CENPW | 37.5 | -1.338209 | CENPW | 2 | 9606 | ENSG00000203760 | ENST00000368328.5 | ... | NaN | 2 | 0 | Preselected | -0.663540 | -0.303324 | ENST00000368328 | 0.274739 | -0.388801 | -0.028585 |
4 | AAAGTGTGCTTTGTTGGAGA | TACTAAAGTGTGCTTTGTTGGAGATGGCTT | NSA2 | 60.0 | -0.175219 | NSA2 | 2 | 9606 | ENSG00000164346 | ENST00000610426.5 | ... | NaN | 2 | 0 | Preselected | -0.413636 | -0.585179 | ENST00000610426 | -0.072158 | -0.485794 | -0.657337 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
247 | TTTAGCCGGATGCGCAGTGA | CTCTTTTAGCCGGATGCGCAGTGATGGTTT | NKAPD1 | 20.6 | 0.627322 | NKAPD1 | 2 | 9606 | ENSG00000150776 | ENST00000393047.8 | ... | CP1718 | 1 | 0 | Preselected | 0.298329 | 0.274344 | ENST00000393047 | -0.201158 | 0.097171 | 0.073187 |
248 | TTTATATAGCTGTTTCGCAC | TGTCTTTATATAGCTGTTTCGCACAGGCTA | RFC5 | 21.5 | -0.957190 | RFC5 | 2 | 9606 | ENSG00000111445 | ENST00000454402.7 | ... | NaN | 1 | 0 | Preselected | -0.220600 | -0.022154 | ENST00000454402 | 0.102902 | -0.117698 | 0.080747 |
249 | TTTGCTGTCCCGCCTGTACA | GGCGTTTGCTGTCCCGCCTGTACATGGGCA | NXT1 | 27.2 | 0.176827 | NXT1 | 2 | 9606 | ENSG00000132661 | ENST00000254998.3 | ... | NaN | 1 | 0 | Preselected | 0.119830 | 0.012744 | ENST00000254998 | 0.146767 | 0.266597 | 0.159511 |
250 | TTTGTCTGATGACTACTCAA | AAATTTTGTCTGATGACTACTCAAAGGTAT | NOL10 | 15.6 | -0.043965 | NOL10 | 2 | 9606 | ENSG00000115761 | ENST00000381685.10 | ... | NaN | 1 | 0 | Preselected | 0.283254 | 0.264148 | ENST00000381685 | 0.015462 | 0.298716 | 0.279611 |
251 | TTTGTTAGGTCATCGTATTG | GGTTTTTGTTAGGTCATCGTATTGAGGAAG | RPL4 | 33.5 | -1.425502 | RPL4 | 2 | 9606 | ENSG00000174444 | ENST00000307961.11 | ... | CP1718 | 2 | 1 | NaN | -0.636302 | -0.575100 | ENST00000307961 | 0.021391 | -0.614912 | -0.553709 |
252 rows × 67 columns
Since Gecko was screened with the tracrRNA from Hsu et al. 2013, we'll use this as our predictor
plt.subplots(figsize=(4,4))
gpplot.point_densityplot(gecko_activity_scores, y='avg_mean_centered_neg_lfc',
x='RS3 Sequence (Hsu2013 tracr) + Target Score')
gpplot.add_correlation(gecko_activity_scores, y='avg_mean_centered_neg_lfc',
x='RS3 Sequence (Hsu2013 tracr) + Target Score')
sns.despine()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.