Python package to analyze the results of pooled CRISPR screens

These details have been verified by PyPI

Maintainers

goulddy ldrepano mhegde peterdeweirdt

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

poola

Python package for pooled screen analysis

Install

Install from github for the latest development release:

pip install git+git://github.com/gpp-rnd/poola.git#egg=poola

Or install the most recent distribution from PyPi:

pip install poola

How to use

Additional packages required for this tutorial can be install using pip install -r requirements.txt

from poola import core as pool
import pandas as pd
import seaborn as sns
import gpplot
import matplotlib.pyplot as plt

To demonstrate the functionality of this module we'll use read counts from Sanson et al. 2018.

supp_reads = 'https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-07901-8/MediaObjects/41467_2018_7901_MOESM4_ESM.xlsx'
read_counts = pd.read_excel(supp_reads,
                            sheet_name = 'A375_orig_tracr raw reads', 
                            header = None,
                            skiprows = 3, 
                            names = ['sgRNA Sequence', 'pDNA', 'A375_RepA', 'A375_RepB'], 
                            engine='openpyxl')
guide_annotations = pd.read_excel(supp_reads,
                                  sheet_name='sgRNA annotations', 
                                  engine='openpyxl')

The input data has three columns with read counts and one column with sgRNA annotations

read_counts.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sgRNA Sequence	pDNA	A375_RepA	A375_RepB
0	AAAAAAAATCCGGACAATGG	522	729	774
1	AAAAAAAGGATGGTGATCAA	511	1484	1393
2	AAAAAAATGACATTACTGCA	467	375	603
3	AAAAAAATGTCAGTCGAGTG	200	737	506
4	AAAAAACACAAGCAAGACCG	286	672	352

lognorms = pool.lognorm_columns(reads_df=read_counts, columns=['pDNA', 'A375_RepA', 'A375_RepB'])
filtered_lognorms = pool.filter_pdna(lognorm_df=lognorms, pdna_cols=['pDNA'], z_low=-3)
print('Filtered ' + str(lognorms.shape[0] - filtered_lognorms.shape[0]) + ' columns due to low pDNA abundance')

Filtered 576 columns due to low pDNA abundance

Note that the column names for the lognorms remain the same

lognorms.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sgRNA Sequence	pDNA	A375_RepA	A375_RepB
0	AAAAAAAATCCGGACAATGG	4.192756	3.373924	3.521755
1	AAAAAAAGGATGGTGATCAA	4.163726	4.326828	4.312620
2	AAAAAAATGACATTACTGCA	4.041390	2.540624	3.196767
3	AAAAAAATGTCAGTCGAGTG	2.930437	3.388159	2.973599
4	AAAAAACACAAGCAAGACCG	3.388394	3.268222	2.528233

lfc_df = pool.calculate_lfcs(lognorm_df=filtered_lognorms, ref_col='pDNA', target_cols=['A375_RepA', 'A375_RepB'])

We drop the pDNA column after calculating log-fold changes

lfc_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sgRNA Sequence	A375_RepA	A375_RepB
0	AAAAAAAATCCGGACAATGG	-0.818831	-0.671000
1	AAAAAAAGGATGGTGATCAA	0.163102	0.148894
2	AAAAAAATGACATTACTGCA	-1.500766	-0.844622
3	AAAAAAATGTCAGTCGAGTG	0.457721	0.043161
4	AAAAAACACAAGCAAGACCG	-0.120172	-0.860161

Since we only have two conditions it's easy to visualize replicates as a point densityplot using gpplot

plt.subplots(figsize=(4,4))
gpplot.point_densityplot(data=lfc_df, x='A375_RepA', y='A375_RepB')
gpplot.add_correlation(data=lfc_df, x='A375_RepA', y='A375_RepB')
sns.despine()

png

Since we see a strong correlation, we'll average the log-fold change of each sgRNA across replicates

avg_replicate_lfc_df = pool.average_replicate_lfcs(lfcs=lfc_df, guide_col='sgRNA Sequence', condition_indices=[0],
                                                   sep='_')

After averaging log-fold changes our dataframe is melted, so the condition column specifies the experimental condition (A375 here) and the n_obs specifies the number of replicates

avg_replicate_lfc_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sgRNA Sequence	condition	avg_lfc	n_obs
0	AAAAAAAATCCGGACAATGG	A375	-0.744916	2
1	AAAAAAAGGATGGTGATCAA	A375	0.155998	2
2	AAAAAAATGACATTACTGCA	A375	-1.172694	2
3	AAAAAAATGTCAGTCGAGTG	A375	0.250441	2
4	AAAAAACACAAGCAAGACCG	A375	-0.490166	2

It's sometimes helpful to group controls into pseudo-genes so they're easier to compare with target genes. Our annotation file maps from sgRNA sequences to gene symbols

remapped_annotations = pool.group_pseudogenes(annotations=guide_annotations, pseudogene_size=4, 
                                              gene_col='Annotated Gene Symbol', 
                                              control_regex=['NO_CURRENT'])
remapped_annotations.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sgRNA Sequence	Annotated Gene Symbol	Annotated Gene ID
0	AAAAAAAATCCGGACAATGG	SLC25A24	29957
1	AAAAAAAGGATGGTGATCAA	FASTKD3	79072
2	AAAAAAATGACATTACTGCA	BCAS2	10286
3	AAAAAAATGTCAGTCGAGTG	GPR18	2841
4	AAAAAACACAAGCAAGACCG	ZNF470	388566

We provide two methods for scaling log-fold change values to controls:

Z-score from a set of negative controls
Scale scores between a set of negative and positive controls

For both scoring methods, you can input either a regex or a list of genes to define control sets

For our set of negative controls, we'll use nonessential genes

nonessential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/non-essential-genes-Hart2014.txt',
                                    names=['gene'])
                      .gene)
annot_guide_lfcs = pool.annotate_guide_lfcs(avg_replicate_lfc_df, remapped_annotations, 'Annotated Gene Symbol',
                                            merge_on='sgRNA Sequence', z_score_neg_ctls=True,
                                            z_score_neg_ctl_genes=nonessential_genes)
annot_guide_lfcs.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sgRNA Sequence	condition	avg_lfc	n_obs	Annotated Gene Symbol	Annotated Gene ID	z_scored_avg_lfc
0	AAAAAAAATCCGGACAATGG	A375	-0.744916	2	SLC25A24	29957	-1.651921
1	AAAAAAAGGATGGTGATCAA	A375	0.155998	2	FASTKD3	79072	0.166585
2	AAAAAAATGACATTACTGCA	A375	-1.172694	2	BCAS2	10286	-2.515396
3	AAAAAAATGTCAGTCGAGTG	A375	0.250441	2	GPR18	2841	0.357219
4	AAAAAACACAAGCAAGACCG	A375	-0.490166	2	ZNF470	388566	-1.137705

To aggregate scores at the gene level, we specify columns to average, and columns to z-score. From our z-scores we calculate a p-value and FDR using the Benjamini-Hochberg procedure

gene_lfcs = pool.aggregate_gene_lfcs(annot_guide_lfcs, 'Annotated Gene Symbol',
                                     average_cols=['avg_lfc'],
                                     zscore_cols=['z_scored_avg_lfc'])
gene_lfcs.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	condition	Annotated Gene Symbol	n_guides	avg_lfc	z_scored_avg_lfc	z_scored_avg_lfc_p_value	z_scored_avg_lfc_fdr
0	A375	A1BG	3	-0.084621	-0.552711	0.580461	0.872833
1	A375	A1CF	4	0.500314	1.723181	0.084856	0.338055
2	A375	A2M	4	0.288629	0.868604	0.385064	0.759829
3	A375	A2ML1	4	-0.481786	-2.241580	0.024989	0.132164
4	A375	A3GALT2	4	0.059362	-0.056952	0.954583	0.990693

Finally, to evaluate the quality this screen, we'll calculate the ROC-AUC between essential and nonessential genes for each condition

essential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/essential-genes-Hart2015.txt', 
                                 names=['gene'])
                   .gene)
roc_aucs, roc_df = pool.get_roc_aucs(lfcs=gene_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol',
                                     condition_col='condition', score_col='avg_lfc')
print('ROC-AUC: ' + str(round(roc_aucs['ROC-AUC'].values[0], 3)))

ROC-AUC: 0.976

Note that we can also use this function to calculate roc-aucs at the guide level

annotated_guide_lfcs = lfc_df.merge(guide_annotations, how='inner', on='sgRNA Sequence')
roc_aucs, roc_df = pool.get_roc_aucs(lfcs=annotated_guide_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol',
                                     condition_list=['A375_RepA', 'A375_RepB'])
roc_aucs

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	condition	ROC-AUC
0	A375_RepA	0.918505
1	A375_RepB	0.917600

And plot ROC curves from the roc_df

plt.subplots(figsize=(4,4))
sns.lineplot(data=roc_df, x='fpr', y='tpr', hue='condition', ci=None)
gpplot.add_xy_line()
sns.despine()

png

Project details

These details have been verified by PyPI

Maintainers

goulddy ldrepano mhegde peterdeweirdt

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.0.16

Jun 3, 2022

0.0.15

Apr 11, 2022

0.0.14

Apr 7, 2022

0.0.12

Jun 2, 2021

0.0.11

Jun 2, 2021

0.0.10

Jun 2, 2021

0.0.9

Apr 20, 2021

0.0.8

Apr 15, 2021

0.0.7

Mar 24, 2021

0.0.6

Mar 24, 2021

0.0.5

Mar 22, 2021

0.0.4

Feb 17, 2021

0.0.3

Feb 17, 2021

0.0.2

Feb 17, 2021

0.0.1

Feb 17, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poola-0.0.16.tar.gz (20.0 kB view hashes)

Uploaded Jun 3, 2022 Source

Built Distribution

poola-0.0.16-py3-none-any.whl (15.3 kB view hashes)

Uploaded Jun 3, 2022 Python 3

Hashes for poola-0.0.16.tar.gz

Hashes for poola-0.0.16.tar.gz
Algorithm	Hash digest
SHA256	`fcb561e4a1f43fcdeb836689bd517fa01266038affb446b3c5cc7d5f727efc4e`
MD5	`c64e124be8c8da133d32290602b6ff33`
BLAKE2b-256	`b021ad4ffb7f2bbedd0256bb20e30078e986893f2fc1ddd1ee95d500d4ed3f40`

Hashes for poola-0.0.16-py3-none-any.whl

Hashes for poola-0.0.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`06c4066b5ee94da31b2455917428fbe7bbf984cb1bed56bf57f6d838621b604f`
MD5	`187834ef1ad011892952627f9c1911e3`
BLAKE2b-256	`82dda96f6ffc2b19fabbaf1a4853bbe052ecd2ee8c25f841387797afe6afb2e3`