baseqDrops

Processing Drop-seq, 10X(3prime) and inDrop RNA-seq dataset

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Software Development :: Build Tools

Project description

# baseqDrops
A versatile pipeline for processing dataset from 10X, indrop and Drop-seq.

## Install baseqDrops
We need python3 and a package called: baseqDrops, which could be installed by:

pip install baseqDrops

After install, you will have a runnable command `baseqDrops`

It is recommend for the computer or server to have memory >= 30Gb and CPU cores >=8 for efficient processing;

## Configuration file

The following software or resources are required:

+ `star`: STAR software, for fast alignment of RNA-Seq data to the genome;
+ `samtools`: For sorting the aligned bam file (version >=1.6);
+ `whitelistDir`: The barcode whitelist files for indrop and 10X should be placed under whitelistDir. These files could bed downloaded from https://github.com/beiseq/baseqDrops/tree/master/whitelist;
+ `cellranger_ref_<genome>`: The key process of read alignment and tagging to genes are inspired and borrowed from the open source cellranger pipeline(https://github.com/10XGenomics/cellranger). The references of genome index and transcriptome can be downloaded from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
In the config file, the directory of cellranger references is named as `cellranger_<genome>`.

While running command, the configures are recorded in the file called `config_drops.ini`:

[Drops]
samtools = /path/to/samtools
star = /path/to/STAR
whitelistDir = /path/to/whitelist_file_directory
cellranger_ref_hg38 = /path/to/reference/refdata-cellranger-GRCh38-1.2.0/

## For Help Informations

baseqDrops run-pipe --help

## Process Steps

1. `Cell Barcode Counting`: Counting the existed barcodes in dataset. This will generate a file named: barcode_count_<sample>.csv;
2. `Cell Barcode Correction, Aggregating and Filtering`: Correcting the cell barcodes within 1bp mismatch and then aggregating, filtering the barcode by minimum number of reads (default 5000), this will generate a valid barcode list named: barcode_stats_<sample>.csv;
3. `Split the Reads of Valid Cell Barcodes`: The raw pair-end raw reads are splitted to 16 single-end files for multiprocessing according to the 2bp prefix of the barcode; The folder of barcode_splits contains files like: split.<sample>.<AA|AT|AC|AG...|GG>.fq;
4. `Alignment to Genome using STAR`: Several (defined by --parallel/-p) STAR programs run at the same time, the results will be at folder named as star_align; The bam files are further sorted by sequence header;
5. `Reads Tagging`: Tagging the reads alignment position to the corresponding gene name;
6. `Generating Expression Table`: Both the expression table quantified by UMI (Result.UMIs.<sample>.txt) and raw read count (Result.Reads.<sample>.txt) will be generated;

## Run Pipeline

These parameters should be provided: (or run: baseqDrops run-pipe --help for information)

+ `--outdir/-d`: Output path (default ./, the result will be stored in ./<name>);
+ `--config`: Path to the config file;
+ `--genome/-g`: Genome version [hg38/mm38/hgmm];
+ `--protocol/-p`: [10X|indrop|dropseq];
+ `--minreads`: Minimum reads required for a barcode;
+ `--name/-n` : Name of sample, a folder of <outdir>/<name> will be created and be the main directory;
+ `--parallel` : The number of STAR and tagging processes runs at the same time (default is 4, need more memory for larger parallel number);
+ `--fq1/-1`: Path of Pair-end 1 sequencing file;
+ `--fq2/-2`: Path of Pair-end 2 sequencing file;
+ `--top_million_reads`: For huge dataset, you can choose to use part of the data for a quick look, the reads exceeding N million of reads will be skipped;

If your data is human origin and `cellranger_ref_hg38` has been defined in configuration file, you can run:

baseqDrops run-pipe --config ./config_drops.ini -g hg38 -p 10X --minreads 1000 -n 10X_test -1 10x_1.1.fq.gz -2 10x.2.fq.gz -d ./

## Run by Single Steps

We also provide step-wise ways for running the pipeline, all the parameters should be provided as described above, an extra "--step" should be provided, for example:

baseqDrops run-pipe --config ./config.ini -g hg38 -p dropseq --minreads 1000 -n dropseq2 --top_million_reads 20 -1 dropseq_1.1.fq.gz -2 dropseq.2.fq.gz --step count -d ./

The steps are listed:

+ `Cell Barcode Counting`: --step count
+ `Cell Barcode Correction, Aggregating and Filtering`: --step stats
+ `Split the Reads of Valid Cell Barcodes`: --step split
+ `Alignment to Genome using STAR`: --step star
+ `Reads Tagging` : --step tagging
+ `Generating Expression Table`: --step table

## Contact

For any questions, please email to: friedpine@gmail.com

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

This version

2.0

Feb 2, 2019

1.5

Nov 21, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

baseqDrops-2.0.tar.gz (20.4 kB view details)

Uploaded Feb 2, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

baseqDrops-2.0-py2.py3-none-any.whl (33.4 kB view details)

Uploaded Feb 2, 2019 Python 2Python 3

File details

Details for the file baseqDrops-2.0.tar.gz.

File metadata

Download URL: baseqDrops-2.0.tar.gz
Upload date: Feb 2, 2019
Size: 20.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.21.0 CPython/3.6.4

File hashes

Hashes for baseqDrops-2.0.tar.gz
Algorithm	Hash digest
SHA256	`775f40d1e4f394e3b48d44ed47cdbbff7aeed1117996f62900099f147cf6b82c`
MD5	`777836e05a54391ac0ffb5bd1b5b167b`
BLAKE2b-256	`c4c25b1323bb5da55797053b7b533b5e864a38c8363d39df365718c8beddccf3`

See more details on using hashes here.

File details

Details for the file baseqDrops-2.0-py2.py3-none-any.whl.

File metadata

Download URL: baseqDrops-2.0-py2.py3-none-any.whl
Upload date: Feb 2, 2019
Size: 33.4 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/38.4.0 requests-toolbelt/0.8.0 tqdm/4.21.0 CPython/3.6.4

File hashes

Hashes for baseqDrops-2.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`ccfbdb8f99f41fd898c09a40432c4605fc11f7fec95e32a16f5be1f11c7357ee`
MD5	`b701d3e4f3e02574f679e307d3ee2411`
BLAKE2b-256	`4f58b668bad105d17ab900757568eeaaeb0f4a824a538cfd8a21994121a673c9`

See more details on using hashes here.

baseqDrops 2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes