Skip to main content

Find barcode in long reads

Project description

FBILR: Find Barcode In Long Reads

Description

FBILR is designed to find the best-matched barcode in long reads and report detailed information, such as direction, location, and edit distance. Since the barcode is likely to be located at one of the ends of the read (head or tail), and the read length is longer than 1,000 bp, FBILR restricts the search range to within 200 nt (-w option) of both ends to reduce the amount of computation and save time. Besides, FBILR can run in parallel (-t option).

In FBILR, edit distance represents the difference between barcode sequence and reference sequence, including mismatch, insertion, and deletion of bases. The edit distance is calculated by edlib.

For each barcode, FBILR searches the best-matched hits of the forward barcode in the read head, the reverse barcode in the read head, the forward barcode in the read tail, and the reverse barcode in the read tail, respectively. In single-end mode (-m option), report the minimum edit distance hit around all barcodes. In paired-end mode, report the minimum edit distance hit in read head and tail around all barcodes respectively.

Schema

Here, we show the schema of the barcode that exists in a 100 nt read:

  • In case 1, the barcode exists in the head of the read with 0 edit distance (fully matched).
  • In case 2, the barcode exists in the middle of the read with 2 edit distance (2 mismatch).
  • In case 3, the barcode exists in the tail of the read with 3 edit distance (1 mismatch and 2 deletion).

Finally, the bar1 is the best-matched barcode in this read.

Installation

# 
python setup.py test
python setup.py install

# 
pip install fbilr

Usage

The usage of FBILR is shown below:

# Single-end
fbilr -t 8 -w 200 -o matrix.tsv -b barcodes.fa reads.fq.gz
fbilr -t 8 -w 200 -b barcodes.fa reads.fq.gz | pigz -p 8 -c > matrix.tsv.gz

# Paired-end
fbilr -t 8 -w 200 -m PE -o matrix.tsv -b barcodes.fa reads.fq.gz

# Multiple barcode list
fbilr -t 8 -w 200 -o matrix.tsv -b barcodes1.fa,barcodes2.fa reads.fq.gz

# Ignore read name in output
fbilr -t 8 -w 200 -i -o matrix.tsv -b barcodes.fa reads.fq.gz

# Include read sequence and quality in output
fbilr -t 8 -w 200 -q -o matrix.tsv -b barcodes.fa reads.fq.gz

# Find barcode and split
fbilr -t 8 -w 200 -q -b barcodes.fa reads.fq.gz | your_custom_split_script.py

Output

The FBILR will output tab-delimited results that consist of multiple columns (shown as follows). In the results, one row corresponds to one read in the input FASTQ file. Each read can find an optimal barcode, even though the edit distance is large.

column 1: read name, if the '-i' option is set, the value is '.'
column 2: read length
column 3: barcode name
column 4: barcode orientation (F or R)
column 5: barcode location (H, M or T)
column 6: start in read (0-base, included)
column 7: end in read (0-base, not included)
column 8: edit distance
column ...

# Example:
1b2e274b-9da7-4a5f-b40f-e6c36249d825    215     Bar4    R       T       172     196     0
ed320d59-77c6-41ba-895d-f4fdba5855f2    249     Bar2    F       H       29      53      0
9aa445f6-63b9-44e5-9b9c-43feea216b7a    492     Bar3    F       H       36      60      0
3087cbe0-7b00-40ff-837c-4cc59cf7e7ff    280     Bar4    R       T       239     263     0
15c53c45-ff43-4374-8716-049495d113aa    345     Bar4    F       H       27      50      3
21c0fe8d-1725-42ba-b490-eec2cd6f76b3    408     Bar2    F       H       27      51      0
90af744f-1367-493d-84e2-ca2375413e2d    551     Bar8    F       H       47      71      0

Column 3 to column 8 represent 1 hit (6 columns). The number of columns is flexible and depends on the number of barcode lists and mode. The structure of columns is: information columns (2) + hit columns (6 * N) + fastq columns (4, optional)

The number of columns in single-end mode is 2 + 6. The number of columns in paired-end mode is 2 + 6 * 2. If the -q option is set, an additional 4 columns (name, sequence, "+", quality) is append to the tail.

For 2 barcode lists, the number of columns in single-end mode is 2 + 6 * 2. The number of columns in paired-end mode is 2 + 6 * 2 * 2.

Number of barcode list Mode Include fastq Number of column
1 Single-end N 2 + 6 = 8
1 Single-end Y 2 + 6 + 4 = 12
1 Paired-end N 2 + 6 * 2 = 14
1 Paired-end Y 2 + 6 * 2 + 4 = 18
2 Single-end N 2 + 6 * 2 = 14
2 Single-end Y 2 + 6 * 2 + 4 = 18
2 Paired-end N 2 + 6 * 4 = 26
2 Paired-end Y 2 + 6 * 4 + 4 = 30
3 Single-end N 2 + 6 * 3 = 20
3 Single-end Y 2 + 6 * 3 + 4 = 24
3 Paired-end N 2 + 6 * 6 = 38
3 Paired-end Y 2 + 6 * 6 + 4 = 42

Splitting

Example

1. Demultiplexing XXX datasets.
2. Demultiplexing XXX datasets.

Packaging and distribute PyPI

python -m build
python3 -m twine upload --repository pypi dist/*

Change logs

2023-09-13 (v1.2.0)

  1. Added test for FBILR.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fbilr-1.2.0.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

fbilr-1.2.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file fbilr-1.2.0.tar.gz.

File metadata

  • Download URL: fbilr-1.2.0.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.9.6 readme-renderer/27.0 requests/2.28.1 requests-toolbelt/1.0.0 urllib3/1.26.15 tqdm/4.26.0 importlib-metadata/4.8.1 keyring/23.2.1 rfc3986/2.0.0 colorama/0.4.5 CPython/3.6.15

File hashes

Hashes for fbilr-1.2.0.tar.gz
Algorithm Hash digest
SHA256 4223848e1e30eb89f9e55d6c9da86ff2c4afa63d0220af1eecd12d549b4a7678
MD5 9a79e84147ba93a00e793da2f893c14e
BLAKE2b-256 aa6ade70d8c4fa7dc851b8bc43574c341cb674bffc12d5ab117d9a19c02958b3

See more details on using hashes here.

File details

Details for the file fbilr-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: fbilr-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.9.6 readme-renderer/27.0 requests/2.28.1 requests-toolbelt/1.0.0 urllib3/1.26.15 tqdm/4.26.0 importlib-metadata/4.8.1 keyring/23.2.1 rfc3986/2.0.0 colorama/0.4.5 CPython/3.6.15

File hashes

Hashes for fbilr-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9caaf0d299ea765e48a8d9280e99c51700ed5f8370b38b69fae32dd6e5b5e253
MD5 e92e721a955d1bafc5989c5c4c8f8707
BLAKE2b-256 34f1c779a58dd2adcf9f291f77658486e46aaee2ad64dbff3f6a73401f9a82f6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page