Skip to main content

PhIP-seq analysis tools

Project description

phip-stat: tools for analyzing PhIP-seq data

NOTE: This project is no longer being maintained. Please see the phippery, phip-flow, and related projects maintained by Erick Matsen's group: https://github.com/matsengrp/phippery https://github.com/matsengrp/phip-flow

The PhIP-seq assay was first described in Larman et. al.. This repo contains code for processing raw PhIP-seq data into analysis-ready enrichment scores.

This code also implements multiple statistical models for processing PhIP-seq data, including the model described in the original Larman et al paper (generalized-poisson-model). We currently recommend using one of the newer models implemented here (e.g., gamma-poisson-model).

Please submit issues to report any problems.

Installation

phip-stat runs on Python 3.6+ and minimally depends on click, tqdm, numpy, scipy, and pandas. The matrix factorization model also requires tensorflow.

pip install phip-stat

or to install the latest development version from GitHub

pip install git+https://github.com/lasersonlab/phip-stat.git

Usage

The overall flow of the pipeline is

  1. align — for each sample count the number of reads derived from each possible library member

  2. merge — combine the count values from all samples into a single count matrix

  3. model — normalize counts and train a model to compute enrichment scores/hits

An entire NextSeq run with 500M reads can be processed in <30 min on a 4-core laptop (if aligning with a tool like kallisto).

Command-line interface

All the pipeline tools are accessed through the phip executable. All (sub)command usage/options can be obtained by passing -h.

$ phip -h
Usage: phip [OPTIONS] COMMAND [ARGS]...

  phip -- PhIP-seq analysis tools

Options:
  -h, --help  Show this message and exit.

Commands:
  align-parts     align fastq files to peptide reference
  compute-counts  compute counts from aligned bam file
  compute-pvals   compute p-values from counts
  groupby-sample  group alignments by sample
  join-barcodes   annotate Illumina reads with barcodes Some...
  merge-columns   merge tab-delim files
  split-fastq     split fastq files into smaller chunks

Example pipeline 1: kallisto alignment followed by Gamma-Poisson model

This pipeline will use kallisto to pseudoalign the reads to the reference. Because the output of each alignment step is a directory, the merge step uses a CLI tool designed for this directory structure. The counts are also pre-normalized.

# 1. align
kallisto quant --single --plaintext --fr-stranded -l 75 -s 0.1 -t 4 \
    -i reference.idx -o sample_counts/sample1 sample1.fastq.gz
# ...
kallisto quant --single --plaintext --fr-stranded -l 75 -s 0.1 -t 4 \
    -i reference.idx -o sample_counts/sampleN sampleN.fastq.gz

# 2. merge
phip merge-kallisto-tpm -i sample_counts -o cpm.tsv

# 3. model
phip gamma-poisson-model -t 99.9 -i cpm.tsv -o gamma-poisson

Example pipeline 2: exact-matching reads followed by matrix factorization

This pipeline will match each read to the reference exactly (or a chosen subset of the read) followed by merging into a single matrix. The matrix is then factored with a low-rank approximation (allowing for clipping) and "hits" are called with a heuristic.

# 1. align
phip count-exact-matches -r reference.fasta -l 75 -o sample_counts/sample1.counts.tsv sample1.fastq.gz
# ...
phip count-exact-matches -r reference.fasta -l 75 -o sample_counts/sampleN.counts.tsv sampleN.fastq.gz

# 2. merge
phip merge-columns -m iter -i sample_counts -o counts.tsv

# 3. model
phip clipped-factorization-model --rank 2 -i counts.tsv -o residuals.tsv
phip call-hits -i residuals.tsv -o hits.tsv --beads-regex ".*BEADS_ONLY.*"

Example pipeline 3: bowtie2 alignment followed by normalization and Gamma-Poisson

This example uses bowtie2, which should give the maximum sensitivity at the expense of speed. The main bowtie2 command accomplishes the following: align reads to reference, sort and convert to BAM, compute coverage depth at each position of each clone, for each clone take only the largest number observed, finally sort by clone identifier.

# 1. align
echo "id\tsample1" > sample_counts/sample1.tsv
bowtie2 -p 4 -x reference_index -U sample1.fastq.gz \
    | samtools sort -O BAM \
    | samtools depth -aa -m 100000000 - \
    | awk 'BEGIN {OFS="\t"} {counts[$1] = ($3 < counts[$1]) ? counts[$1] : $3} END {for (c in counts) {print c, counts[c]}}' \
    | sort -k 1 \
    >> sample_counts/sample1.tsv
# ...
echo "id\tsampleN" > sample_counts/sampleN.tsv
bowtie2 -p 4 -x reference_index -U sampleN.fastq.gz \
    | samtools sort -O BAM \
    | samtools depth -aa -m 100000000 - \
    | awk 'BEGIN {OFS="\t"} {counts[$1] = ($3 < counts[$1]) ? counts[$1] : $3} END {for (c in counts) {print c, counts[c]}}' \
    | sort -k 1 \
    >> sample_counts/sampleN.tsv

# 2. merge -- NOTE: this performs a pandas outer join and loads all counts into memory
phip merge-columns -m outer -i sample_counts -o counts.tsv

# 3. model
phip normalize-counts -m size-factors -i counts.tsv -o normalized_counts.tsv
phip gamma-poisson-model -t 99.9 -i normalized_counts.tsv -o gamma-poisson

Snakemake recipes

We include several example Snakemake recipes for easily processing large sets of samples at once, e.g., workflows/example-kallisto-GamPois-factorization.snakefile. In general the configuration section must be edited to specify the location of the raw sequencing data.

Running unit tests

Unit tests use the nose package and can be run with:

$ pip install nose  # if not already installed
$ nosetests -sv test/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phip-stat-0.5.1.tar.gz (31.4 kB view details)

Uploaded Source

Built Distribution

phip_stat-0.5.1-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file phip-stat-0.5.1.tar.gz.

File metadata

  • Download URL: phip-stat-0.5.1.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.7.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for phip-stat-0.5.1.tar.gz
Algorithm Hash digest
SHA256 ebf405bbede636f34a26e9d4cf0201a2d4603ca504492338078d43a3cd62c2fc
MD5 c8c74b4b646755cc3fe857626398d9bf
BLAKE2b-256 301067885e116322b1859ab20a34076581b55b929bee338f0bda24d8f4187125

See more details on using hashes here.

File details

Details for the file phip_stat-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: phip_stat-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 30.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.7.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for phip_stat-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 257502939ea7884659a66f27165cabd1cc0fe44f4651eb1b904989819efeae7b
MD5 a3a998603c0cc6acd60635cbebb50848
BLAKE2b-256 11a4fc0e5e42a48c7bc25186a45ea540b348dca69c54692410b33f291b052bc0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page