Pipeline for Processing RNA-Seq datasets
Project description
# DropRNA
## Install baseq_drops
We need python3 and a package called: baseq_drops, which could be installed by:
pip install baseqdrops
After install, you will have a runnable command `baseq-Drop`
## Config file
The pipeline need the following software or resources:
+ `star`: STAR software, for fast alignment of RNA-Seq data;
+ `samtools`: Sorting bam file;
+ `whitelistDir`: The barcode whitelist files for indrop and 10X should be placed under whitelistDir.
These files can be downloaded from XXX.
+ `cellranger_ref_<genome>`: The key process of read alignment and tagging to genes
are inspired and borrowed from the open source cellranger pipeline
(https://github.com/10XGenomics/cellranger).
The refernces of genome index and transcriptome can be downloaded
from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
In the config file, the directory of cellrange references is named as `cellranger_<genome>`.
While running command, the configures are recorded in the file called `config_drops.ini`:
[Drops]
samtools = /path/to/samtools
star = /path/to/STAR
whitelistDir = /path/to/whitelist_file_directory
cellranger_hg38 = /path/to/reference/refdata-cellranger-GRCh38-1.2.0/
## Process Steps
1. `Extract the Cell Barcode` Counting the number of each kinds of barcode; this will genrate a barcode_count.<sample>.csv;
2. `Cell Barcode correction and filtering` Correcting the cell barcode with 1bp mismatch, filtering the barcode with min number of reads;
3. `Split the reads of valid Cell Barcodes` The raw pair-end raw reads are splitted to 16 single end files for multiprocessing according to the 2bp prefix of barcode; For example, we will get: split.<sample>.<AA|AT|AC|AG...|GG>.fq
4. `Star Alignment` Fastq files runs at the same time; The bam file sorted by sequence header is generated;
5. `Reads tagging` Tagging the reads alignment position to the corresponding gene name
6. `Genrating UMI table`
## Run Command
The main config is:
+ `--config`: config file;
+ `--genome/-g`: genome version;
+ `--protocol`: [10X|indrop|dropseq]
+ `--minreads`: Minimum reads for a barcode
+ `--name/-n` : Sample name
+ `--fq1/-1`: Read 1
+ `--fq2/-2`: Read 2
+ `--top_million_reads`: How many million reads to use, mainly for testing pipeline with fraction of reads (default 1000)
+ `--dir/-d`: output path
If you config the: `cellranger_ref_hg38` you can run the following:
baseqdrops run_pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 10000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./
### For older version 10X results
The cell barcode length is 15 and UMI length is 5.
baseqdrops run_pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 10000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./
## Install baseq_drops
We need python3 and a package called: baseq_drops, which could be installed by:
pip install baseqdrops
After install, you will have a runnable command `baseq-Drop`
## Config file
The pipeline need the following software or resources:
+ `star`: STAR software, for fast alignment of RNA-Seq data;
+ `samtools`: Sorting bam file;
+ `whitelistDir`: The barcode whitelist files for indrop and 10X should be placed under whitelistDir.
These files can be downloaded from XXX.
+ `cellranger_ref_<genome>`: The key process of read alignment and tagging to genes
are inspired and borrowed from the open source cellranger pipeline
(https://github.com/10XGenomics/cellranger).
The refernces of genome index and transcriptome can be downloaded
from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
In the config file, the directory of cellrange references is named as `cellranger_<genome>`.
While running command, the configures are recorded in the file called `config_drops.ini`:
[Drops]
samtools = /path/to/samtools
star = /path/to/STAR
whitelistDir = /path/to/whitelist_file_directory
cellranger_hg38 = /path/to/reference/refdata-cellranger-GRCh38-1.2.0/
## Process Steps
1. `Extract the Cell Barcode` Counting the number of each kinds of barcode; this will genrate a barcode_count.<sample>.csv;
2. `Cell Barcode correction and filtering` Correcting the cell barcode with 1bp mismatch, filtering the barcode with min number of reads;
3. `Split the reads of valid Cell Barcodes` The raw pair-end raw reads are splitted to 16 single end files for multiprocessing according to the 2bp prefix of barcode; For example, we will get: split.<sample>.<AA|AT|AC|AG...|GG>.fq
4. `Star Alignment` Fastq files runs at the same time; The bam file sorted by sequence header is generated;
5. `Reads tagging` Tagging the reads alignment position to the corresponding gene name
6. `Genrating UMI table`
## Run Command
The main config is:
+ `--config`: config file;
+ `--genome/-g`: genome version;
+ `--protocol`: [10X|indrop|dropseq]
+ `--minreads`: Minimum reads for a barcode
+ `--name/-n` : Sample name
+ `--fq1/-1`: Read 1
+ `--fq2/-2`: Read 2
+ `--top_million_reads`: How many million reads to use, mainly for testing pipeline with fraction of reads (default 1000)
+ `--dir/-d`: output path
If you config the: `cellranger_ref_hg38` you can run the following:
baseqdrops run_pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 10000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./
### For older version 10X results
The cell barcode length is 15 and UMI length is 5.
baseqdrops run_pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 10000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
baseqRNA-1.5.tar.gz
(16.7 kB
view hashes)
Built Distribution
baseqRNA-1.5-py2.py3-none-any.whl
(24.0 kB
view hashes)
Close
Hashes for baseqRNA-1.5-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56530794332cabb3527ee98372967f11cfd7ff228f429a3276cf48a01f3a2f0c |
|
MD5 | f039d9a489a4dd7259bd89b2c7c13322 |
|
BLAKE2b-256 | 56da152269c62bcd76786056408cdb51027daf54c4ae1305140eaf9a6f2a4e4a |