PhIP-seq analysis tools
Project description
phip-stat: tools for analyzing PhIP-seq data
NOTE: This project is no longer being maintained. Please see the phippery, phip-flow, and related projects maintained by Erick Matsen's group: https://github.com/matsengrp/phippery https://github.com/matsengrp/phip-flow
The PhIP-seq assay was first described in Larman et. al.. This repo contains code for processing raw PhIP-seq data into analysis-ready enrichment scores.
This code also implements multiple statistical models for processing PhIP-seq
data, including the model described in the original Larman et al paper
(generalized-poisson-model
). We currently recommend using one of the newer
models implemented here (e.g., gamma-poisson-model
).
Please submit issues to report any problems.
Installation
phip-stat runs on Python 3.6+ and minimally depends on click, tqdm, numpy, scipy, and pandas. The matrix factorization model also requires tensorflow.
pip install phip-stat
or to install the latest development version from GitHub
pip install git+https://github.com/lasersonlab/phip-stat.git
Usage
The overall flow of the pipeline is
-
align — for each sample count the number of reads derived from each possible library member
-
merge — combine the count values from all samples into a single count matrix
-
model — normalize counts and train a model to compute enrichment scores/hits
An entire NextSeq run with 500M reads can be processed in <30 min on a 4-core laptop (if aligning with a tool like kallisto).
Command-line interface
All the pipeline tools are accessed through the phip
executable. All
(sub)command usage/options can be obtained by passing -h
.
$ phip -h
Usage: phip [OPTIONS] COMMAND [ARGS]...
phip -- PhIP-seq analysis tools
Options:
-h, --help Show this message and exit.
Commands:
align-parts align fastq files to peptide reference
compute-counts compute counts from aligned bam file
compute-pvals compute p-values from counts
groupby-sample group alignments by sample
join-barcodes annotate Illumina reads with barcodes Some...
merge-columns merge tab-delim files
split-fastq split fastq files into smaller chunks
Example pipeline 1: kallisto alignment followed by Gamma-Poisson model
This pipeline will use kallisto to pseudoalign the reads to the reference. Because the output of each alignment step is a directory, the merge step uses a CLI tool designed for this directory structure. The counts are also pre-normalized.
# 1. align
kallisto quant --single --plaintext --fr-stranded -l 75 -s 0.1 -t 4 \
-i reference.idx -o sample_counts/sample1 sample1.fastq.gz
# ...
kallisto quant --single --plaintext --fr-stranded -l 75 -s 0.1 -t 4 \
-i reference.idx -o sample_counts/sampleN sampleN.fastq.gz
# 2. merge
phip merge-kallisto-tpm -i sample_counts -o cpm.tsv
# 3. model
phip gamma-poisson-model -t 99.9 -i cpm.tsv -o gamma-poisson
Example pipeline 2: exact-matching reads followed by matrix factorization
This pipeline will match each read to the reference exactly (or a chosen subset of the read) followed by merging into a single matrix. The matrix is then factored with a low-rank approximation (allowing for clipping) and "hits" are called with a heuristic.
# 1. align
phip count-exact-matches -r reference.fasta -l 75 -o sample_counts/sample1.counts.tsv sample1.fastq.gz
# ...
phip count-exact-matches -r reference.fasta -l 75 -o sample_counts/sampleN.counts.tsv sampleN.fastq.gz
# 2. merge
phip merge-columns -m iter -i sample_counts -o counts.tsv
# 3. model
phip clipped-factorization-model --rank 2 -i counts.tsv -o residuals.tsv
phip call-hits -i residuals.tsv -o hits.tsv --beads-regex ".*BEADS_ONLY.*"
Example pipeline 3: bowtie2 alignment followed by normalization and Gamma-Poisson
This example uses bowtie2, which should give the maximum sensitivity at the expense of speed. The main bowtie2 command accomplishes the following: align reads to reference, sort and convert to BAM, compute coverage depth at each position of each clone, for each clone take only the largest number observed, finally sort by clone identifier.
# 1. align
echo "id\tsample1" > sample_counts/sample1.tsv
bowtie2 -p 4 -x reference_index -U sample1.fastq.gz \
| samtools sort -O BAM \
| samtools depth -aa -m 100000000 - \
| awk 'BEGIN {OFS="\t"} {counts[$1] = ($3 < counts[$1]) ? counts[$1] : $3} END {for (c in counts) {print c, counts[c]}}' \
| sort -k 1 \
>> sample_counts/sample1.tsv
# ...
echo "id\tsampleN" > sample_counts/sampleN.tsv
bowtie2 -p 4 -x reference_index -U sampleN.fastq.gz \
| samtools sort -O BAM \
| samtools depth -aa -m 100000000 - \
| awk 'BEGIN {OFS="\t"} {counts[$1] = ($3 < counts[$1]) ? counts[$1] : $3} END {for (c in counts) {print c, counts[c]}}' \
| sort -k 1 \
>> sample_counts/sampleN.tsv
# 2. merge -- NOTE: this performs a pandas outer join and loads all counts into memory
phip merge-columns -m outer -i sample_counts -o counts.tsv
# 3. model
phip normalize-counts -m size-factors -i counts.tsv -o normalized_counts.tsv
phip gamma-poisson-model -t 99.9 -i normalized_counts.tsv -o gamma-poisson
Snakemake recipes
We include several example Snakemake recipes for easily processing large sets of
samples at once, e.g.,
workflows/example-kallisto-GamPois-factorization.snakefile
. In general the
configuration section must be edited to specify the location of the raw
sequencing data.
Running unit tests
Unit tests use the nose
package and can be run with:
$ pip install nose # if not already installed
$ nosetests -sv test/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file phip-stat-0.5.1.tar.gz
.
File metadata
- Download URL: phip-stat-0.5.1.tar.gz
- Upload date:
- Size: 31.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.0 importlib_metadata/4.7.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebf405bbede636f34a26e9d4cf0201a2d4603ca504492338078d43a3cd62c2fc |
|
MD5 | c8c74b4b646755cc3fe857626398d9bf |
|
BLAKE2b-256 | 301067885e116322b1859ab20a34076581b55b929bee338f0bda24d8f4187125 |
File details
Details for the file phip_stat-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: phip_stat-0.5.1-py3-none-any.whl
- Upload date:
- Size: 30.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.0 importlib_metadata/4.7.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 257502939ea7884659a66f27165cabd1cc0fe44f4651eb1b904989819efeae7b |
|
MD5 | a3a998603c0cc6acd60635cbebb50848 |
|
BLAKE2b-256 | 11a4fc0e5e42a48c7bc25186a45ea540b348dca69c54692410b33f291b052bc0 |