Skip to main content

CPDseq data analysis

Project description

CPDSeqer

This package is used to do CPD sequence data analysis.

Prerequisites

Install tabix

apt-get install -y tabix

Installation

Install python main package

pip install git+git://github.com/shengqh/cpdseqer.git

Usage

Demultiplex fastq file

usage: cpdseqer demultiplex [-h] -i [INPUT] -o [OUTPUT] -b [BARCODEFILE]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input fastq gzipped file
  -o [OUTPUT], --output [OUTPUT]
                        Output folder
  -b [BARCODEFILE], --barcodeFile [BARCODEFILE]
                        Tab-delimited file, first column is barcode, second
                        column is sample name

for example:

cpdseqer demultiplex -i sample.fastq.gz -o . -b barcode.txt

Example of barcode.txt, columns are separated by tab

ATCGCGAT        Control
GAACTGAT        UV

Align reads to genome using bowtie2

bowtie2 -p 8  -x hg38/bowtie2_index_2.3.5.1/GRCh38.p12.genome -U Control.fastq.gz \
  --sam-RG ID:Control --sam-RG LB:Control --sam-RG SM:Control --sam-RG PL:ILLUMINA \
  -S Control.sam 2> Control.log
  
samtools view -Shu -F 256 Control.sam | samtools sort -o Control.bam -T Control -

samtools index Control.bam

rm Control.sam

You can download hg38 bowtie2 index files:

mkdir hg38
cd hg38
wget https://cqsweb.app.vumc.org/download1/cpdseqer/hg38_bowtie2.tar.gz
tar -xzvf hg38_bowtie2.tar.gz
cd ..

Extract dinucleotide from bam file

usage: cpdseqer bam2dinucleotide [-h] -i [INPUT] -g [GENOME_SEQ_FILE]
                                 [-q [MAPPING_QUALITY]] -o [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input BAM file
  -g [GENOME_SEQ_FILE], --genome_seq_file [GENOME_SEQ_FILE]
                        Input genome seq file
  -q [MAPPING_QUALITY], --mapping_quality [MAPPING_QUALITY]
                        Minimum mapping quality of read (default 20)
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer bam2dinucleotide \
  -i Control.bam \
  -g hg38/bowtie2_index_2.3.4.1/GRCh38.p12.genome.fa \
  -o Control.dinucleotide.bed.bgz

You can download bam files:

wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam.bai
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam.bai

Get statistic based on bed file

usage: cpdseqer statistic [-h] -i [INPUT] -c [COORDINATE_LIST_FILE] [-s]
                          [--category_index CATEGORY_INDEX] [--add_chr] -o
                          [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input dinucleotide list file, first column is file
                        location, second column is file name
  -c [COORDINATE_LIST_FILE], --coordinate_list_file [COORDINATE_LIST_FILE]
                        Input coordinate list file, first column is file
                        location, second column is file name
  -s, --space           Use space rather than tab in coordinate files
  --category_index CATEGORY_INDEX
                        Zero-based category column index in coordinate file
  --add_chr             Add chr in chromosome name in coordinate file
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer statistic --category_index 3 \
  -i cpd__fileList1.list \
  -c cpd__fileList2.list \
  -o cpd.dinucleotide.statistic.tsv

cpd__fileList1.list (separated by tab)

Control.dinucleotide.bed.bgz      Control
UV.dinucleotide.bed.bgz           UV

cpd__fileList2.list (separated by tab)

hg38_promoter.bed      Promoter
hg38_tf.bed            TFBinding

You can download example files as following scripts. The hg38_promoter.bed contains three columns only. So, Promoter (from cpd__fileList2.list definition) will be used as category name for all entries in the hg38_promoter.bed. The hg38_tf.bed contians four columns. The forth column in hg38_tf.bed indicates TF name which will be used as category name (--category_index 3) instead of TFBinding.

wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz.tbi
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz.tbi
wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList1.list
wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList2.list
wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_promoter.bed
wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_tf.bed


Running cpdseqer using singularity

We also build docker container for cpdseqer which can be used by singularity.

Running directly

singularity exec -e docker://shengqh/cpdseqer bowtie2 -h
singularity exec -e docker://shengqh/cpdseqer cpdseqer -h

Convert docker image to singularity image first

singularity build cpdseqer.simg docker://shengqh/cpdseqer
singularity exec -e cpdseqer.simg bowtie2 -h
singularity exec -e cpdseqer.simg cpdseqer -h

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cpdseqer-0.2.0.tar.gz (18.0 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page