pydemult

Streamed and parallel demultiplexing of fastq files in python

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

# Streamed and parallel demultiplexing of fastq files

## Quickstart

` pydemult --fastq input.fastq.gz --barcodes barcodes.txt --threads 4 --writer-threads 16 `

## Requirements and usage

pydemult allows you to demultiplex fastq files in a streamed and parallel way. It expects that a sample barcode can be matched by a regular expression from the first line of each fastq entry and that sample barcodes are known in advance.

Suppose we have a file containing sample barcodes like this:

` Sample Barcode sample1 CTTCAA sample2 CAACAA sample3 GTACGG `

and a typical entry in the fastq file looks like this:

` @HWI-ST808:140:H0L10ADXX:1:1101:8463:2:NNNNNN:CTTCCA TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATGATGCTGTGAGTTCC + @CCDDDDFHHHHHJIJFDDDDDDDDDBDDDDDBB0@B##################### `

Since the sample barcode is six bases long, we have to set the corresponding –barcode-regex option to (.*):(?P<CB>[ATGCN]{6} in the call

` pydemult --fastq input.fastq.gz --barcodes barcodes.txt --barcode-regex "(.*):(?P<CB>[ATGCN]{6}" `

### Barcode and UMI regular expressions

By default, pydemult parses the read name for the cell barcode with regular expressions. Cell barcodes are indicated by a capturing group called CB, while (optional) UMIs are indicated by a capturing group called UMI. Some examples include:

(.*):(?P<CB>[ATGCN]{11}, for a cell barcode of length 11 that is present after the last colon of the read name.
(.*):CELL_(?P<CB>[ATGCN]{10}):UMI_(?P<UMI>[ATGCN]{8}), for a cell barcode of length 10, followed by a UMI sequence of length 8. For DropSeq data preprocessed by the [umis](https://github.com/vals/umis) tool, a regex like this is advisable.

### Output

pydemult will create a compressed fastq file for each sample barcode, with the filename taken from the corresponding Sample column entry of barcodes.txt.

### A note on multithreading

pydemult divides its work into a demultiplexing and output part. The main thread streams the input and lazily distributes data blobs (of size –buffer-size) across n different demultiplexing threads (set with –threads), where the actual work happens. Demultiplexed input is then sent over to m threads for writing into individual output files (set with –writer-threads). Reading and demultiplexing are fast and CPU-bound operations, while output speed is determined by how fast data can be written to the underlying file system. In our experience, output is much slower than demultiplexing itself and requires proportionally more cores to speed up the runtime. We obtained best results when distributing output to three threads for each demultiplexing thread (1:3 ratio of –threads to –writer-threads).

## License

The project is licensed under the MIT license. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

0.6

Apr 15, 2020

0.5

Oct 31, 2019

This version

0.4.1

Oct 26, 2018

0.4

Oct 17, 2018

0.3

Oct 4, 2018

0.2

Aug 28, 2018

0.1

Aug 23, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydemult-0.4.1.tar.gz (7.0 kB view details)

Uploaded Oct 26, 2018 Source

File details

Details for the file pydemult-0.4.1.tar.gz.

File metadata

Download URL: pydemult-0.4.1.tar.gz
Upload date: Oct 26, 2018
Size: 7.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6

File hashes

Hashes for pydemult-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`5a44dc6c819fd4394b282d6aeff4b571b61654486ce6e4b782d3cc0731c3a0e2`
MD5	`1044393ea96c153e1e7cf84a1d8d48ec`
BLAKE2b-256	`9bc0787fd2902fb62cb2e5a5622fc328b904e99739188a4fabe972164c46ea5c`

See more details on using hashes here.

pydemult 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes