Skip to main content

Bioinformatic pipeline for the CHANGE-seq assay.

Project description

Version Python versions Platforms

CHANGE-seq: Circularization for High-throughput Analysis Nuclease Genome-wide Effects by Sequencing

This is a repository for CHANGE-seq analytical software, which takes sample-specific paired-end FASTQ files as input and produces a list of CHANGE-seq detected off-target cleavage sites as output.

Summary

This package implements a pipeline that takes in reads from the CHANGE-seq assay and returns detected cleavage sites as output. The individual pipeline steps are:

  1. Merge: Merge read1 an read2 for easier mapping to genome.
  2. Read Alignment: Merged paired end reads from the assay are aligned to the reference genome using the BWA-MEM algorithm with default parameters (Li. H, 2009).
  3. Cleavage Site Identification: Mapped sites are analyzed to determine which represent high-quality cleavage sites.
  4. Visualization of Results: Identified on-target and off-target cleavage sites are rendered as a color-coded alignment map for easy analysis of results.

Installation

The most easiest way to install change-seq pipeline is via conda.


conda create -n changeseq -c conda-forge -c bioconda -c anaconda -c omnia -c tsailabSJ changeseq

source activate changeseq

changeseq.py -h

## BWA 0.7.17 and samtools 1.9 are automatically installed

Alternatively, you can git clone this repository and install


git clone https://github.com/tsailabSJ/changeseq

cd changeseq

pip install -r requirements.txt

python setup.py install

changeseq.py -h

## Please install BWA and samtools if you choose this option

Download Reference Genome

The CHANGEseq package requires a reference genome for read mapping. You can use any genome of your choosing, but for all of our testing and original CHANGE-seq analyses we use hg19 (download). Be sure to (g)unzip the FASTA file before use if it is compressed.

Usage

The change-seq pipeline requires a manifest yaml file specifying input files, output directory, and pipeline parameters. Once the yaml file is created, users can simply run change_seq.py all --manifest /path/to/manifest.yaml

Below is an example manifest.yaml file::

reference_genome: /data/joung/genomes/Homo_sapiens_assembly19.fasta
analysis_folder: /data/joung/CHANGE-Seq/test2

bwa: bwa
samtools: samtools

read_threshold: 4
window_size: 3
mapq_threshold: 50
start_threshold: 1
gap_threshold: 3
mismatch_threshold: 6
search_radius: 30
merged_analysis: True

samples:
    U2OS_exp1_VEGFA_site_1:
        target: GGGTGGGGGGAGTTTGCTCCNGG
        read1: /data/joung/sequencing_fastq/150902_M01326_0235_000000000-AHLT8/fastq/1_S1_L001_R1_001.fastq.gz
        read2: /data/joung/sequencing_fastq/150902_M01326_0235_000000000-AHLT8/fastq/1_S1_L001_R2_001.fastq.gz
        controlread1: /data/joung/sequencing_fastq/150902_M01326_0235_000000000-AHLT8/fastq/4_S4_L001_R1_001.fastq.gz
        controlread2: /data/joung/sequencing_fastq/150902_M01326_0235_000000000-AHLT8/fastq/4_S4_L001_R2_001.fastq.gz
        description: U2OS_exp1
    U2OS_exp1_EMX1:
        target: GAGTCCGAGCAGAAGAAGAANGG
        read1: /data/joung/sequencing_fastq/150902_M01326_0235_000000000-AHLT8/fastq/2_S2_L001_R1_001.fastq.gz
        read2: /data/joung/sequencing_fastq/150902_M01326_0235_000000000-AHLT8/fastq/2_S2_L001_R2_001.fastq.gz
        controlread1: /data/joung/sequencing_fastq/150902_M01326_0235_000000000-AHLT8/fastq/4_S4_L001_R1_001.fastq.gz
        controlread2: /data/joung/sequencing_fastq/150902_M01326_0235_000000000-AHLT8/fastq/4_S4_L001_R2_001.fastq.gz
        description: U2OS_exp1

Quickstart


git clone https://github.com/tsailabSJ/changeseq

cd changeseq/test

changeseq.py all --manifest CIRCLEseq_MergedTest.yaml

Writing A Manifest File

When running the end-to-end analysis functionality of the CHANGEseq package a number of inputs are required. To simplify the formatting of these inputs and to encourage reproducibility, these parameters are inputted into the pipeline via a manifest formatted as a YAML file. YAML files allow easy-to-read specification of key-value pairs. This allows us to easily specify our parameters. The following fields are required in the manifest:

  • reference_genome: The absolute path to the reference genome FASTA file.
  • output_folder: The absolute path to the folder in which all pipeline outputs will be saved.
  • bwa: The absolute path to the bwa executable
  • samtools: The absolute path to the samtools executable
  • read_threshold: The minimum number of reads at a location for that location to be called as a site. We recommend leaving it to the default value of 4.
  • window_size: Size of the sliding window, we recommend leaving it to the default value of 3.
  • mapq_threshold: Minimum read mapping quality score. We recommend leaving it to the default value of 50.
  • start_threshold: Tolerance for breakpoint location. We recommend leaving it to the default value of 1.
  • gap_threshold: Distance between breakpoints. We recommend leaving it to the default value of 3 for Cas9.
  • mismatch_threshold: Number of tolerated gaps in the fuzzy target search setp. We recommend leaving it to the default value of 6.
  • read_length: Fastq file read length, default is 151.
  • PAM: PAM sequence, default is NGG.
  • merged_analysis: Whether or not the paired read merging step should takingTrue
  • samples: Lists the samples you wish to analyze and the details for each. Each sample name should be nested under the top level samples key, and each sample detail should be nested under the sample name. See the sample manifest for an example.
    • For each sample, you must provide the following parameters:
      • target: Target sequence for that sample. Accepts degenerate bases.
      • read1: The absolute path to the .FASTQ(.gz) file containing the read1 reads.
      • read2: The absolute path to the .FASTQ(.gz) file containing the read2 reads.
      • controlread1: The absolute path to the .FASTQ(.gz) file containing the control read1 reads.
      • controlread2: The absolute path to the .FASTQ(.gz) file containing the control read2 reads.
      • description: A brief description of the sample

Pipeline Output

When running the full pipeline, the results of each step are outputted to the output_folder in a separate folder for each step. The output folders and their respective contents are as follows:

  • output_folder/aligned: Contains an alignment .sam, alignment .bam, sorted bam, and .bai index file for each sample.
  • output_folder/fastq: Merged .fastq.gz files for each sample.
  • output_folder/identified: Contains tab-delimited .txt files for each sample containing the identified DSBs, control DSBs, filtered DSBs, and read quantification.
  • output_folder/visualization: Contains a .svg vector image representing an alignment of all detected off-targets to the targetsite for each sample.

FAQ

None yet, we will keep this updated as needed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

changeseq-1.2.7.tar.gz (23.7 kB view details)

Uploaded Source

Built Distribution

changeseq-1.2.7-py2-none-any.whl (62.2 kB view details)

Uploaded Python 2

File details

Details for the file changeseq-1.2.7.tar.gz.

File metadata

  • Download URL: changeseq-1.2.7.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/2.7.18

File hashes

Hashes for changeseq-1.2.7.tar.gz
Algorithm Hash digest
SHA256 2645edc6df9b717b5345d2822e89f9ee05a377db53a020c7153cca836e4c3e8f
MD5 171382d9761e1991083b86ef6a3258ca
BLAKE2b-256 a657088be6ea81d92c82005146340f1ca176cb5d16375f98f675b57f1eaca9b6

See more details on using hashes here.

File details

Details for the file changeseq-1.2.7-py2-none-any.whl.

File metadata

  • Download URL: changeseq-1.2.7-py2-none-any.whl
  • Upload date:
  • Size: 62.2 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/2.7.18

File hashes

Hashes for changeseq-1.2.7-py2-none-any.whl
Algorithm Hash digest
SHA256 ccc8bc19e96a8d7e091423c8ed76048cb770b6e6752c896cb955bc37e4910020
MD5 3f7f82e2f9318e46549e3e9ae9f2ec75
BLAKE2b-256 17a3fb3b0d25d23b0bf7a3a3d7346d4178a235cb6eee01c1e9952589921d481e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page