Skip to main content

Pipeline for Processing RNA-Seq datasets

Project description

# DropRNA

## Install baseq_drops
We need python3 and a package called: baseq_drops, which could be installed by:

pip install baseqdrops

After install, you will have a runnable command `baseq-Drop`

## Config file

The pipeline need the following software or resources:

+ `star`: STAR software, for fast alignment of RNA-Seq data;
+ `samtools`: Sorting bam file;
+ `whitelistDir`: The barcode whitelist files for indrop and 10X should be placed under whitelistDir.
These files can be downloaded from XXX.
+ `cellranger_ref_<genome>`: The key process of read alignment and tagging to genes
are inspired and borrowed from the open source cellranger pipeline
(https://github.com/10XGenomics/cellranger).
The refernces of genome index and transcriptome can be downloaded
from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
In the config file, the directory of cellrange references is named as `cellranger_<genome>`.

While running command, the configures are recorded in the file called `config_drops.ini`:

[Drops]
samtools = /path/to/samtools
star = /path/to/STAR
whitelistDir = /path/to/whitelist_file_directory
cellranger_hg38 = /path/to/reference/refdata-cellranger-GRCh38-1.2.0/

## Process Steps
1. `Extract the Cell Barcode` Counting the number of each kinds of barcode; this will genrate a barcode_count.<sample>.csv;
2. `Cell Barcode correction and filtering` Correcting the cell barcode with 1bp mismatch, filtering the barcode with min number of reads;
3. `Split the reads of valid Cell Barcodes` The raw pair-end raw reads are splitted to 16 single end files for multiprocessing according to the 2bp prefix of barcode; For example, we will get: split.<sample>.<AA|AT|AC|AG...|GG>.fq
4. `Star Alignment` Fastq files runs at the same time; The bam file sorted by sequence header is generated;
5. `Reads tagging` Tagging the reads alignment position to the corresponding gene name
6. `Genrating UMI table`

## Run Command

The main config is:

+ `--config`: config file;
+ `--genome/-g`: genome version;
+ `--protocol`: [10X|indrop|dropseq]
+ `--minreads`: Minimum reads for a barcode
+ `--name/-n` : Sample name
+ `--fq1/-1`: Read 1
+ `--fq2/-2`: Read 2
+ `--top_million_reads`: How many million reads to use, mainly for testing pipeline with fraction of reads (default 1000)
+ `--dir/-d`: output path

If you config the: `cellranger_ref_hg38` you can run the following:

baseqdrops run_pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 10000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./

### For older version 10X results
The cell barcode length is 15 and UMI length is 5.

baseqdrops run_pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 10000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./



Project details


Release history Release notifications

This version

1.5

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for baseqRNA, version 1.5
Filename, size File type Python version Upload date Hashes
Filename, size baseqRNA-1.5-py2.py3-none-any.whl (24.0 kB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size baseqRNA-1.5.tar.gz (16.7 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page