Barcode identification from Long reads for AnalyZing single cell gene Expression

These details have not been verified by PyPI

Project links

Homepage

Project description

# Note: It's currently also a developmental branch where I am moving blaze in to pypi

Updates of the specific branch compared to the main branch:

Using fast-edit-distance instead of Levenshtein package for generating the empty barcode list
Add more columns into the putative barcode table:
- putative UMI
- UMI end position (for later trimming the adator-UMI sequence from each reads)
- Flanking bases before barcode and after UMI (for correction of insertion and deletion within the putative barcode and UMIs)
Add function to perform read-to-whitelist assignment. A putative barcode (16nt) with first be extended to include flanking bases from both sides. Then we scan though the whitelist and find the one with lowest subsequence ED (defined as the minimum edits required to make a shorter sequence a subsequence of the longer on). The UMI position will also be corrested if some insertion and deletion found within the 16nt putative barcode.
The bases before and included in UMI will be trimmed it the demultiplexed reads. The output format will be in fastq or fastq.gz. The header with be @<16 nt BC>_<12 nt UMI>#read_id

TODO in this branch:

restructure a code hierachy
- Dependency list/ conda env?
Build and test
github workflow for auto pushing to pip

Intended behaviour:

To install: pip install blaze

To import as a python module:

import blaze

# the function inported should be blaze.blaze, blaze.whitelist, blaze.demultiplex

To run as a excutable:
```
blaze [options] fastqs
```

BLAZE (Barcode identification from Long reads for AnalyZing single cell gene Expression)

Important Notes: This repo is actively being updated. Please make sure you have the latest release.

Keywords:

Oxford Nanopore sequencing, Demultiplexing, Single Cell, Barcode.

Overview

Combining single-cell RNA sequencing with Nanopore long-read sequencing enables isoform level analysis in single cells. However, due to the relatively high error rate in Nanopore reads, the demultiplexing of cell barcodes and Unique molecular Identifiers (UMIs) is challenging. This tool enables the accurate identification of barcodes solely from Nanopore reads. The output of BLAZE is a barcode whitelist that can be utilised by downstream tools such as FLAMES to quantify genes and isoforms in single cells. For a detailed description of how BLAZE works and its performance across different datasets, please see our Genome Biology paper.

Installation

Download and unzip the latest release, the scripts are in bin. Or with the command line:

wget https://github.com/shimlab/BLAZE/releases/download/v1.1.0/BLAZE_v1.1.0.zip
unzip BLAZE_v1.1.0.zip
cd BLAZE

Conda environment

All dependencies can be installed using "conda". The configuration file can be found at conda_env/:

conda config --set channel_priority false
conda env create -f conda_env/environment.yml
conda activate blaze

Dependencies

Biopython
pandas
numpy
tqdm
matplotlib
levenshtein

Test run

The following command runs BLAZE on a test dataset provided in /test/data. The expected output can be found here.

bash test/run_test.sh

Module:

`blaze.py`: Get putative barcodes and barcode whitelist from fastq files.

This script has been tested on Chromium Single Cell 3ʹ gene expression v3 and should be able to work on Chromium Single Cell 3ʹ gene expression v2, but it doesn't support any 10X 5' gene expression kits.

Required Input:

Folder of the fastq files
10X barcode whitelist: a file containing all possible 10x barcodes (more details). The 10X barcode whitelists for both the 10X Single Cell 3ʹ gene expression v2 and v3 chemistries have been packed in 10X_bc/. By default, this module assumes v3 chemistry and uses the file 10X_bc/3M-february-2018.zip. You may specify --kit-version=v2 to use the whitelist for v2 chemistry, or provide your own whitelist by specifying --full-white-list=<filename>. Note that barcodes outside this whitelist will never be found in the output.
expected number of cells: In the current version, the expected number of cells is a required input (specify --expect-cells=xx). Note that the output is NOT sensitive to the specified number, but a rough number is needed to determine the count threshold to output the whitelist.

Output:

Print when running: Overview stats of the number of reads with and without putative barcodes.
Putative barcode in each read, default filename: putative_bc.csv. It contains 3 columns:
- col1: read id
- col2: putative barcode (i.e. the basecalled barcode segment in each read, specifically the 16nt sequence after the identifed 10X adaptor within each read without correction for any basecalling errors)
- col3: minimum Phred score of the bases in the putative barcode
Note: col2 and 3 will be empty if no barcode is found within a read.
Cell-ranger style barcode whitelist, default filename: whitelist.csv.
"Barcode rank plot"" (or "knee plot") using the high-quality putative barcodes.

Understanding the BLAZE output

BLAZE follows the 3-step process:

Locate the putative barcode in each read: BLAZE first searches for the putative barcode (i.e. barcode in read without error correction) by locating 10X adapter and polyT in the reads. Both the sequenced strand and reverse complement strand are considered for each read. These putative barcodes and their quality scores (minQ) are recorded in putative_bc.csv. The IDs of reads where no putative barcode can be located are also listed in the file but without any putative barcode or minQ score.
Identify high-quality putative barcodes: BLAZE filters the putative barcodes to get a list of high-quality putative barcodes using the following criteria:
- High-quality putative barcodes should be an exact sequence match for barcodes in the complete 10x whitelist.
- minQ >= threshold (15 by default).
Note: this step happens internally without an output file.
Generate the barcode whitelist: BLAZE scans through the list of high-quality putative barcodes and counts the number of appearances of each unique barcode sequence.
Finally, BLAZE picks those unique barcodes whose counts are above a quantile-based threshold and writes them into whitelist.csv.

Note: With the input information provided, we could assign putative barcodes to the whitelist by analysing their edit distances. In practise we have found this is not required to identify the cell-associated barcodes. In addtion, downstream tools like FLAMES already perform this assignment and so we didn't reimplement it in BLAZE. Examples of running FLAMES using BLAZE's whitelist can be find at https://github.com/youyupei/bc_whitelist_analysis/.

Additional (optional) features

High-sensitivity mode

By default, BLAZE is configured to minimise false-positive barcode detections and is therefore relatively conservative. BLAZE has a high-sensitivity mode for users who prioritise high recall of the barcodes (and cells) present. To use specify --high-sensitivity-mode. Users should be aware that high-sensitivity mode trades higher recall (i.e. more true barcodes) for potentially lower precision (i.e. more non-cell associated barcodes) and therefore we recommended running an empty drops analysis and removing cells with an ambient profile if using BLAZE HS mode (see below).

Output a list of barcodes associated to empty droplets

Users may prioritise higher sensitivity by using high-sensitivity mode or a user-specified count threshold (run python3 blaze.py -h for more details), which may also ouput more non-cell associated barcodes. In such situations we recommended running an empty drops analysis[1] to distinguish cell-associated barcodes and barcodes from empty droplets with an ambient RNA expression profile. BLAZE can output a list of barcodes from empty droplets (empty_bc.csv) by specifying --emptydrop.

Criteria for selecting these empty droplet barcodes:

Barcodes in 10x full whitelist, and
Edit distance > 4 from all selected cell-associated BCs in the barcode whitelist and
High-quality putative barcode count < a certain number (specified using --emptydrop-max-count, default: infinity)

Example code:

Run BLAZE in default mode: the expected number of cells are set to be 1000 and run with 12 threads

python3 blaze.py --expect-cells=1000 --threads=12 path/to/fastq_pass

Run BLAZE in high-sensitivity mode: the expected number of cells are set to be 1000 and run with 12 threads

python3 blaze.py --high-sensitivity-mode --expect-cells=1000 --threads=12 path/to/fastq_pass

More information:

python3 blaze.py -h

Note: If you need to change any arguments after running blaze.py, you DO NOT have to re-run blaze.py but can use bin/update_whitelist.py. For example, after running the "Run BLAZE in default mode" code above (putative_bc.csv will be generated after running), if you also need output in "high-sensitivity mode" you could run:

python3 update_whitelist.py --expect-cells 1000 --high-sensitivity-mode --emptydrop --out-bc-whitelist=whitelist_hs putative_bc.csv

To obtain an output quickly. See more examples here or run python3 bin/update_whitelist.py -h for more details.

Citing BLAZE

If you find BLAZE useful for your work, please cite our paper:

You, Y., Prawer, Y. D., De Paoli-Iseppi, R., Hunt, C. P., Parish, C. L., Shim, H., & Clark, M. B. (2023) Identification of cell barcodes from long-read single-cell RNA-seq with BLAZE. Genome Biol 24, 66. You et al. 2023

Data availability

The data underlying the article "Identification of cell barcodes from long-read single-cell RNA-seq with BLAZE" are available from ENA under accession PRJEB54718. The processed data and scripts used in this study are available at https://github.com/youyupei/bc_whitelist_analysis/.

References

[1] Lun, A. T., Riesenfeld, S., Andrews, T., Gomes, T., & Marioni, J. C. (2019). EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome biology, 20(1), 1-9.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.5.1

Oct 25, 2024

2.5.0

Sep 30, 2024

2.5.0a0 pre-release

Sep 29, 2024

2.4.0

Aug 16, 2024

2.4.0a1 pre-release

Aug 15, 2024

2.3.0a2 pre-release

Jul 28, 2024

2.3.0a1 pre-release

Jul 1, 2024

2.2.1

May 17, 2024

2.2.0

Mar 19, 2024

2.2.0a1 pre-release

Mar 14, 2024

2.1.4

Nov 22, 2023

2.1.3

Nov 6, 2023

2.0.0

Oct 31, 2023

This version

2.0.0a9 pre-release

Oct 3, 2023

2.0.0a8 pre-release

Sep 20, 2023

2.0.0a7 pre-release

Sep 16, 2023

2.0.0a6 pre-release

Sep 4, 2023

2.0.0a5 pre-release

Sep 2, 2023

2.0.0a4 pre-release

Aug 24, 2023

2.0.0a3 pre-release

Aug 11, 2023

2.0.0a2 pre-release

Aug 9, 2023

2.0.0a1 pre-release

Aug 9, 2023

2.0.0a0 pre-release

Aug 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

blaze2-2.0.0a9-py3-none-any.whl (17.4 MB view details)

Uploaded Oct 3, 2023 Python 3

File details

Details for the file blaze2-2.0.0a9-py3-none-any.whl.

File metadata

Download URL: blaze2-2.0.0a9-py3-none-any.whl
Upload date: Oct 3, 2023
Size: 17.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/1.0.0 urllib3/1.26.16 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.12

File hashes

Hashes for blaze2-2.0.0a9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7f5687c19d711afcd16aa29a0713f0d7b91806a6a0d137f69457d900583514b2`
MD5	`6ece01415547c8207066575d90ab11bd`
BLAKE2b-256	`c497121a52797630df00c676a6852f53821e9c338644ff35e0a51102b0555eea`