CPDseq data analysis
Project description
CPDSeqer
This package is used to do CPD sequence data analysis.
Prerequisites
Install tabix
apt-get install -y tabix
Installation
Install python main package
pip install git+git://github.com/shengqh/cpdseqer.git
Usage
Demultiplex fastq file
usage: cpdseqer demultiplex [-h] -i [INPUT] -o [OUTPUT] -b [BARCODEFILE]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input fastq gzipped file
-o [OUTPUT], --output [OUTPUT]
Output folder
-b [BARCODEFILE], --barcodeFile [BARCODEFILE]
Tab-delimited file, first column is barcode, second
column is sample name
for example:
cpdseqer demultiplex -i sample.fastq.gz -o . -b barcode.txt
Example of barcode.txt, columns are separated by tab
ATCGCGAT Control
GAACTGAT UV
Align reads to genome using bowtie2
bowtie2 -p 8 -x hg38/bowtie2_index_2.3.5.1/GRCh38.p12.genome -U Control.fastq.gz \
--sam-RG ID:Control --sam-RG LB:Control --sam-RG SM:Control --sam-RG PL:ILLUMINA \
-S Control.sam 2> Control.log
samtools view -Shu -F 256 Control.sam | samtools sort -o Control.bam -T Control -
samtools index Control.bam
rm Control.sam
You can download hg38 bowtie2 index files:
mkdir hg38
cd hg38
wget https://cqsweb.app.vumc.org/download1/cpdseqer/hg38_bowtie2.tar.gz
tar -xzvf hg38_bowtie2.tar.gz
cd ..
Extract dinucleotide from bam file
usage: cpdseqer bam2dinucleotide [-h] -i [INPUT] -g [GENOME_SEQ_FILE]
[-q [MAPPING_QUALITY]] -o [OUTPUT]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input BAM file
-g [GENOME_SEQ_FILE], --genome_seq_file [GENOME_SEQ_FILE]
Input genome seq file
-q [MAPPING_QUALITY], --mapping_quality [MAPPING_QUALITY]
Minimum mapping quality of read (default 20)
-o [OUTPUT], --output [OUTPUT]
Output file name
for example:
cpdseqer bam2dinucleotide \
-i Control.bam \
-g hg38/bowtie2_index_2.3.4.1/GRCh38.p12.genome.fa \
-o Control.dinucleotide.bed.bgz
You can download bam files:
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam.bai
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam.bai
Get statistic based on bed file
usage: cpdseqer statistic [-h] -i [INPUT] -c [COORDINATE_LIST_FILE] [-s]
[--category_index CATEGORY_INDEX] [--add_chr] -o
[OUTPUT]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input dinucleotide list file, first column is file
location, second column is file name
-c [COORDINATE_LIST_FILE], --coordinate_list_file [COORDINATE_LIST_FILE]
Input coordinate list file, first column is file
location, second column is file name
-s, --space Use space rather than tab in coordinate files
--category_index CATEGORY_INDEX
Zero-based category column index in coordinate file
--add_chr Add chr in chromosome name in coordinate file
-o [OUTPUT], --output [OUTPUT]
Output file name
for example:
cpdseqer statistic --category_index 3 \
-i cpd__fileList1.list \
-c cpd__fileList2.list \
-o cpd.dinucleotide.statistic.tsv
cpd__fileList1.list (separated by tab)
Control.dinucleotide.bed.bgz Control
UV.dinucleotide.bed.bgz UV
cpd__fileList2.list (separated by tab)
hg38_promoter.bed Promoter
hg38_tf.bed TFBinding
You can download example files as following scripts. The hg38_promoter.bed contains three columns only. So, Promoter (from cpd__fileList2.list definition) will be used as category name for all entries in the hg38_promoter.bed. The hg38_tf.bed contians four columns. The forth column in hg38_tf.bed indicates TF name which will be used as category name (--category_index 3) instead of TFBinding.
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz.tbi
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz.tbi
wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList1.list
wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList2.list
wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_promoter.bed
wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_tf.bed
Running cpdseqer using singularity
We also build docker container for cpdseqer which can be used by singularity.
Running directly
singularity exec -e docker://shengqh/cpdseqer bowtie2 -h
singularity exec -e docker://shengqh/cpdseqer cpdseqer -h
Convert docker image to singularity image first
singularity build cpdseqer.simg docker://shengqh/cpdseqer
singularity exec -e cpdseqer.simg bowtie2 -h
singularity exec -e cpdseqer.simg cpdseqer -h
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file cpdseqer-0.2.0.tar.gz
.
File metadata
- Download URL: cpdseqer-0.2.0.tar.gz
- Upload date:
- Size: 18.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.44.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff7800bb65c65529f09ee92d60dbb1f78b9c50eedcda2fd0bcd21eec57718b17 |
|
MD5 | 2b84245bbb346f62065fc3c8f9ad6417 |
|
BLAKE2b-256 | 2f383d1c889e06a14013c29e29a0f1c46f0b4b4e10b060114c71e5a9d733a65f |