Processing Drop-seq, 10X(3prime) and inDrop RNA-seq dataset
Project description
# baseqDrops
A versatile pipeline for processing dataset from 10X, indrop and Drop-seq.
## Install baseqDrops
We need python3 and a package called: baseqDrops, which could be installed by:
pip install baseqDrops
After install, you will have a runnable command `baseqDrops`
It is recommend for the computer or server to have memory >= 30Gb and CPU cores >=8 for efficient processing;
## Configuration file
The following software or resources are required:
+ `star`: STAR software, for fast alignment of RNA-Seq data to the genome;
+ `samtools`: For sorting the aligned bam file (version >=1.6);
+ `whitelistDir`: The barcode whitelist files for indrop and 10X should be placed under whitelistDir. These files could bed downloaded from https://github.com/beiseq/baseqDrops/tree/master/whitelist;
+ `cellranger_ref_<genome>`: The key process of read alignment and tagging to genes are inspired and borrowed from the open source cellranger pipeline(https://github.com/10XGenomics/cellranger). The references of genome index and transcriptome can be downloaded from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
In the config file, the directory of cellranger references is named as `cellranger_<genome>`.
While running command, the configures are recorded in the file called `config_drops.ini`:
[Drops]
samtools = /path/to/samtools
star = /path/to/STAR
whitelistDir = /path/to/whitelist_file_directory
cellranger_ref_hg38 = /path/to/reference/refdata-cellranger-GRCh38-1.2.0/
## For Help Informations
baseqDrops run-pipe --help
## Process Steps
1. `Cell Barcode Counting`: Counting the existed barcodes in dataset. This will generate a file named: barcode_count_<sample>.csv;
2. `Cell Barcode Correction, Aggregating and Filtering`: Correcting the cell barcodes within 1bp mismatch and then aggregating, filtering the barcode by minimum number of reads (default 5000), this will generate a valid barcode list named: barcode_stats_<sample>.csv;
3. `Split the Reads of Valid Cell Barcodes`: The raw pair-end raw reads are splitted to 16 single-end files for multiprocessing according to the 2bp prefix of the barcode; The folder of barcode_splits contains files like: split.<sample>.<AA|AT|AC|AG...|GG>.fq;
4. `Alignment to Genome using STAR`: Several (defined by --parallel/-p) STAR programs run at the same time, the results will be at folder named as star_align; The bam files are further sorted by sequence header;
5. `Reads Tagging`: Tagging the reads alignment position to the corresponding gene name;
6. `Generating Expression Table`: Both the expression table quantified by UMI (Result.UMIs.<sample>.txt) and raw read count (Result.Reads.<sample>.txt) will be generated;
## Run Pipeline
These parameters should be provided: (or run: baseqDrops run-pipe --help for information)
+ `--outdir/-d`: Output path (default ./, the result will be stored in ./<name>);
+ `--config`: Path to the config file;
+ `--genome/-g`: Genome version [hg38/mm38/hgmm];
+ `--protocol/-p`: [10X|indrop|dropseq];
+ `--minreads`: Minimum reads required for a barcode;
+ `--name/-n` : Name of sample, a folder of <outdir>/<name> will be created and be the main directory;
+ `--parallel` : The number of STAR and tagging processes runs at the same time (default is 4, need more memory for larger parallel number);
+ `--fq1/-1`: Path of Pair-end 1 sequencing file;
+ `--fq2/-2`: Path of Pair-end 2 sequencing file;
+ `--top_million_reads`: For huge dataset, you can choose to use part of the data for a quick look, the reads exceeding N million of reads will be skipped;
If your data is human origin and `cellranger_ref_hg38` has been defined in configuration file, you can run:
baseqDrops run-pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 1000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./
## Run by Single Steps
We also provide step-wise ways for running the pipeline, all the parameters should be provided as described above, an extra "--step" should be provided, for example:
baseqDrops run-pipe --config ./config.ini -g hg38 -p dropseq --minreads 1000 -n dropseq2 --top_million_reads 20 -1 dropseq_1.1.fq.gz -2 dropseq.2.fq.gz --step count -d ./
The steps are listed:
+ `Cell Barcode Counting`: --step count
+ `Cell Barcode Correction, Aggregating and Filtering`: --step stats
+ `Split the Reads of Valid Cell Barcodes`: --step split
+ `Alignment to Genome using STAR`: --step star
+ `Reads Tagging` : --step tagging
+ `Generating Expression Table`: --step table
## Contact
For any questions, please email to: friedpine@gmail.com
A versatile pipeline for processing dataset from 10X, indrop and Drop-seq.
## Install baseqDrops
We need python3 and a package called: baseqDrops, which could be installed by:
pip install baseqDrops
After install, you will have a runnable command `baseqDrops`
It is recommend for the computer or server to have memory >= 30Gb and CPU cores >=8 for efficient processing;
## Configuration file
The following software or resources are required:
+ `star`: STAR software, for fast alignment of RNA-Seq data to the genome;
+ `samtools`: For sorting the aligned bam file (version >=1.6);
+ `whitelistDir`: The barcode whitelist files for indrop and 10X should be placed under whitelistDir. These files could bed downloaded from https://github.com/beiseq/baseqDrops/tree/master/whitelist;
+ `cellranger_ref_<genome>`: The key process of read alignment and tagging to genes are inspired and borrowed from the open source cellranger pipeline(https://github.com/10XGenomics/cellranger). The references of genome index and transcriptome can be downloaded from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
In the config file, the directory of cellranger references is named as `cellranger_<genome>`.
While running command, the configures are recorded in the file called `config_drops.ini`:
[Drops]
samtools = /path/to/samtools
star = /path/to/STAR
whitelistDir = /path/to/whitelist_file_directory
cellranger_ref_hg38 = /path/to/reference/refdata-cellranger-GRCh38-1.2.0/
## For Help Informations
baseqDrops run-pipe --help
## Process Steps
1. `Cell Barcode Counting`: Counting the existed barcodes in dataset. This will generate a file named: barcode_count_<sample>.csv;
2. `Cell Barcode Correction, Aggregating and Filtering`: Correcting the cell barcodes within 1bp mismatch and then aggregating, filtering the barcode by minimum number of reads (default 5000), this will generate a valid barcode list named: barcode_stats_<sample>.csv;
3. `Split the Reads of Valid Cell Barcodes`: The raw pair-end raw reads are splitted to 16 single-end files for multiprocessing according to the 2bp prefix of the barcode; The folder of barcode_splits contains files like: split.<sample>.<AA|AT|AC|AG...|GG>.fq;
4. `Alignment to Genome using STAR`: Several (defined by --parallel/-p) STAR programs run at the same time, the results will be at folder named as star_align; The bam files are further sorted by sequence header;
5. `Reads Tagging`: Tagging the reads alignment position to the corresponding gene name;
6. `Generating Expression Table`: Both the expression table quantified by UMI (Result.UMIs.<sample>.txt) and raw read count (Result.Reads.<sample>.txt) will be generated;
## Run Pipeline
These parameters should be provided: (or run: baseqDrops run-pipe --help for information)
+ `--outdir/-d`: Output path (default ./, the result will be stored in ./<name>);
+ `--config`: Path to the config file;
+ `--genome/-g`: Genome version [hg38/mm38/hgmm];
+ `--protocol/-p`: [10X|indrop|dropseq];
+ `--minreads`: Minimum reads required for a barcode;
+ `--name/-n` : Name of sample, a folder of <outdir>/<name> will be created and be the main directory;
+ `--parallel` : The number of STAR and tagging processes runs at the same time (default is 4, need more memory for larger parallel number);
+ `--fq1/-1`: Path of Pair-end 1 sequencing file;
+ `--fq2/-2`: Path of Pair-end 2 sequencing file;
+ `--top_million_reads`: For huge dataset, you can choose to use part of the data for a quick look, the reads exceeding N million of reads will be skipped;
If your data is human origin and `cellranger_ref_hg38` has been defined in configuration file, you can run:
baseqDrops run-pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 1000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./
## Run by Single Steps
We also provide step-wise ways for running the pipeline, all the parameters should be provided as described above, an extra "--step" should be provided, for example:
baseqDrops run-pipe --config ./config.ini -g hg38 -p dropseq --minreads 1000 -n dropseq2 --top_million_reads 20 -1 dropseq_1.1.fq.gz -2 dropseq.2.fq.gz --step count -d ./
The steps are listed:
+ `Cell Barcode Counting`: --step count
+ `Cell Barcode Correction, Aggregating and Filtering`: --step stats
+ `Split the Reads of Valid Cell Barcodes`: --step split
+ `Alignment to Genome using STAR`: --step star
+ `Reads Tagging` : --step tagging
+ `Generating Expression Table`: --step table
## Contact
For any questions, please email to: friedpine@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
baseqDrops-2.0.tar.gz
(20.4 kB
view details)
Built Distribution
File details
Details for the file baseqDrops-2.0.tar.gz
.
File metadata
- Download URL: baseqDrops-2.0.tar.gz
- Upload date:
- Size: 20.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.21.0 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 775f40d1e4f394e3b48d44ed47cdbbff7aeed1117996f62900099f147cf6b82c |
|
MD5 | 777836e05a54391ac0ffb5bd1b5b167b |
|
BLAKE2b-256 | c4c25b1323bb5da55797053b7b533b5e864a38c8363d39df365718c8beddccf3 |
File details
Details for the file baseqDrops-2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: baseqDrops-2.0-py2.py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.21.0 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ccfbdb8f99f41fd898c09a40432c4605fc11f7fec95e32a16f5be1f11c7357ee |
|
MD5 | b701d3e4f3e02574f679e307d3ee2411 |
|
BLAKE2b-256 | 4f58b668bad105d17ab900757568eeaaeb0f4a824a538cfd8a21994121a673c9 |