Skip to main content

Processing Drop-seq, 10X(3prime) and inDrop RNA-seq dataset

Project description

# baseqDrops
A versatile pipeline for processing dataset from 10X, indrop and Drop-seq.

## Install baseqDrops
We need python3 and a package called: baseqDrops, which could be installed by:

pip install baseqDrops

After install, you will have a runnable command `baseqDrops`

It is recommend for the computer or server to have memory >= 30Gb and CPU cores >=8 for efficient processing;

## Configuration file

The following software or resources are required:

+ `star`: STAR software, for fast alignment of RNA-Seq data to the genome;
+ `samtools`: For sorting the aligned bam file (version >=1.6);
+ `whitelistDir`: The barcode whitelist files for indrop and 10X should be placed under whitelistDir. These files could bed downloaded from https://github.com/beiseq/baseqDrops/tree/master/whitelist;
+ `cellranger_ref_<genome>`: The key process of read alignment and tagging to genes are inspired and borrowed from the open source cellranger pipeline(https://github.com/10XGenomics/cellranger). The references of genome index and transcriptome can be downloaded from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
In the config file, the directory of cellranger references is named as `cellranger_<genome>`.

While running command, the configures are recorded in the file called `config_drops.ini`:

[Drops]
samtools = /path/to/samtools
star = /path/to/STAR
whitelistDir = /path/to/whitelist_file_directory
cellranger_ref_hg38 = /path/to/reference/refdata-cellranger-GRCh38-1.2.0/

## For Help Informations

baseqDrops run-pipe --help

## Process Steps

1. `Cell Barcode Counting`: Counting the existed barcodes in dataset. This will generate a file named: barcode_count_<sample>.csv;
2. `Cell Barcode Correction, Aggregating and Filtering`: Correcting the cell barcodes within 1bp mismatch and then aggregating, filtering the barcode by minimum number of reads (default 5000), this will generate a valid barcode list named: barcode_stats_<sample>.csv;
3. `Split the Reads of Valid Cell Barcodes`: The raw pair-end raw reads are splitted to 16 single-end files for multiprocessing according to the 2bp prefix of the barcode; The folder of barcode_splits contains files like: split.<sample>.<AA|AT|AC|AG...|GG>.fq;
4. `Alignment to Genome using STAR`: Several (defined by --parallel/-p) STAR programs run at the same time, the results will be at folder named as star_align; The bam files are further sorted by sequence header;
5. `Reads Tagging`: Tagging the reads alignment position to the corresponding gene name;
6. `Generating Expression Table`: Both the expression table quantified by UMI (Result.UMIs.<sample>.txt) and raw read count (Result.Reads.<sample>.txt) will be generated;

## Run Pipeline

These parameters should be provided: (or run: baseqDrops run-pipe --help for information)

+ `--outdir/-d`: Output path (default ./, the result will be stored in ./<name>);
+ `--config`: Path to the config file;
+ `--genome/-g`: Genome version [hg38/mm38/hgmm];
+ `--protocol/-p`: [10X|indrop|dropseq];
+ `--minreads`: Minimum reads required for a barcode;
+ `--name/-n` : Name of sample, a folder of <outdir>/<name> will be created and be the main directory;
+ `--parallel` : The number of STAR and tagging processes runs at the same time (default is 4, need more memory for larger parallel number);
+ `--fq1/-1`: Path of Pair-end 1 sequencing file;
+ `--fq2/-2`: Path of Pair-end 2 sequencing file;
+ `--top_million_reads`: For huge dataset, you can choose to use part of the data for a quick look, the reads exceeding N million of reads will be skipped;

If your data is human origin and `cellranger_ref_hg38` has been defined in configuration file, you can run:

baseqDrops run-pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 1000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./

## Run by Single Steps

We also provide step-wise ways for running the pipeline, all the parameters should be provided as described above, an extra "--step" should be provided, for example:

baseqDrops run-pipe --config ./config.ini -g hg38 -p dropseq --minreads 1000 -n dropseq2 --top_million_reads 20 -1 dropseq_1.1.fq.gz -2 dropseq.2.fq.gz --step count -d ./

The steps are listed:

+ `Cell Barcode Counting`: --step count
+ `Cell Barcode Correction, Aggregating and Filtering`: --step stats
+ `Split the Reads of Valid Cell Barcodes`: --step split
+ `Alignment to Genome using STAR`: --step star
+ `Reads Tagging` : --step tagging
+ `Generating Expression Table`: --step table

## Contact

For any questions, please email to: friedpine@gmail.com


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

baseqDrops-2.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

baseqDrops-2.0-py2.py3-none-any.whl (33.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file baseqDrops-2.0.tar.gz.

File metadata

  • Download URL: baseqDrops-2.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.21.0 CPython/3.6.4

File hashes

Hashes for baseqDrops-2.0.tar.gz
Algorithm Hash digest
SHA256 775f40d1e4f394e3b48d44ed47cdbbff7aeed1117996f62900099f147cf6b82c
MD5 777836e05a54391ac0ffb5bd1b5b167b
BLAKE2b-256 c4c25b1323bb5da55797053b7b533b5e864a38c8363d39df365718c8beddccf3

See more details on using hashes here.

File details

Details for the file baseqDrops-2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: baseqDrops-2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.21.0 CPython/3.6.4

File hashes

Hashes for baseqDrops-2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ccfbdb8f99f41fd898c09a40432c4605fc11f7fec95e32a16f5be1f11c7357ee
MD5 b701d3e4f3e02574f679e307d3ee2411
BLAKE2b-256 4f58b668bad105d17ab900757568eeaaeb0f4a824a538cfd8a21994121a673c9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page