Skip to main content

Anglerfish, a tool to demultiplex Illumina libraries from ONT data

Project description

Anglerfish

Anglerfish CI Status PyPI Conda (channel only) Docker Container available

Introduction

Anglerfish is a tool designed to demultiplex Illumina libraries sequenced on Oxford Nanopore flowcells. The primary purpose for this would be to do QC, i.e. to check pool balancing, assess contamination, library insert sizes and so on.

For more information on how this can be used, please see this poster.

Installation

Requirements

  • Python3 (3.7)

Python modules:

  • biopython v. 1.70
  • python-levenshtein v. 0.12.0
  • numpy v. 1.19.2
  • pyyaml v. 6.0

Software:

  • minimap2 v. 2.20

From PyPi

pip install bio-anglerfish

From Bioconda

conda install -c bioconda anglerfish

Manually with Conda

First install miniconda, then:

git clone https://github.com/remiolsen/anglerfish.git
cd anglerfish
# Create a the anglerfish conda environment
conda env create -f environment.yml
# Install anglerfish
pip install -e .

Development version

pip install --upgrade --force-reinstall git+https://github.com/remiolsen/anglerfish.git

Usage

Anglerfish requires two files to run.

  • A basecalled FASTQ file from for instance Guppy (/path/to/ONTreads.fastq.gz)
  • A samplesheet containing the sample names and indices expected to be found in the sequencing run. (/path/to/samples.csv)

Example of a samplesheet file:

P12864_201,truseq_dual,TAATGCGC-CAGGACGT,/path/to/ONTreads.fastq.gz
P12864_202,truseq_dual,TAATGCGC-GTACTGAC,/path/to/ONTreads.fastq.gz
P9712_101, truseq_dual,ATTACTCG-TATAGCCT,/path/to/ONTreads.fastq.gz
P9712_102, truseq_dual,ATTACTCG-ATAGAGGC,/path/to/ONTreads.fastq.gz
P9712_103, truseq_dual,ATTACTCG-CCTATCCT,/path/to/ONTreads.fastq.gz
P9712_104, truseq_dual,ATTACTCG-GGCTCTGA,/path/to/ONTreads.fastq.gz
P9712_105, truseq_dual,ATTACTCG-AGGCGAAG,/path/to/ONTreads.fastq.gz
P9712_106, truseq_dual,ATTACTCG-TAATCTTA,/path/to/ONTreads.fastq.gz

Or using single index (note samplesheet supports wildcard * use):

P12345_101,truseq,CAGGACGT,/path/to/*.fastq.gz

Then run:

anglerfish -s /path/to/samples.csv

Options

Common

--out_fastq OUT_FASTQ, -o OUT_FASTQ
                      Analysis output folder (default: Current dir)
--samplesheet SAMPLESHEET, -s SAMPLESHEET
                      CSV formatted list of samples and barcodes
--threads THREADS, -t THREADS
                      Number of threads to use (default: 4)
--skip_demux, -c      Only do BC counting and not demuxing
--max-distance MAX_DISTANCE, -m MAX_DISTANCE
                       Manually set maximum edit distance for BC matching, automatically set this is set to either 1 or 2
--run_name RUN_NAME, -r RUN_NAME
                      Name of the run (default: anglerfish)
--debug, -d           Extra commandline output
--version, -v         Print version and quit

--max-unknowns / -u

Anglerfish will try to recover indices which are not specified in the samplesheet but follow the specified adaptor setup(s). This is analogous to undetermined indices as reported by Illumina demultiplexing. --max-unknowns will set the number of such indices reported.

--lenient / -l

This will consider both orientations of the I5 barcode and will use the reverse complement (of what was inputted in the samplesheet) only if significantly more reads were matched. This should be used with with extreme care, but the reason for this is that Anglerfish will try to guess which version of the Illumina samplesheet these indices were derived from. See this guide for when i5 should be reverse complemented and not.

--ont_barcodes / -n

This is an ONT barcode aware mode. Which means each ONT barcode will be mapped and treated separately. A use case for this might be to put one Illumina pool per ONT barcode to spot potential index collisions you don't know of if you want to later make a pool of pools for sequencing in the same lane. This mode requires the fastq files to be placed in folders named barcode01, barcode02, etc. as is the default for MinKNOW (23.04). Example of such an anglerfish samplesheet:

P12345_101,truseq,CAGGACGT,/path/to/barcode01/*.fastq.gz
P54321_101,truseq,ATTACTCG,/path/to/barcode02/*.fastq.gz

Output files

In folder anglerfish_????_??_??_?????/

  • *.fastq.gz Demultiplexed reads (if any)
  • anglerfish_stats.txt Barcode statistics from anglerfish run
  • anglerfish_stats.json Machine readable anglerfish statistics

Credits

The Anglerfish code was written by @remiolsen but it would not exist without the contributions of @FranBonath, @taborsak, @ssjunnebo and Carl Rubin. Also, the Anglerfish logo was designed by @FranBonath.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio-anglerfish-0.6.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

bio_anglerfish-0.6.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file bio-anglerfish-0.6.0.tar.gz.

File metadata

  • Download URL: bio-anglerfish-0.6.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for bio-anglerfish-0.6.0.tar.gz
Algorithm Hash digest
SHA256 b67a0c77680f85ebdf88622382bb271655e807c60dc3bdcd01997d336e6574fa
MD5 a3b8efe54a50caea6bce57d1c9602f3a
BLAKE2b-256 324f23f6b895bf4e0a99184c4e3ef5df78f72cced4a1d16a1a51b8d8e8a48979

See more details on using hashes here.

File details

Details for the file bio_anglerfish-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bio_anglerfish-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 298ecf9d9c3beb8ece003158c344b1be5e31cf7bbb4a7f8ff189c2860af3d675
MD5 54448225101c8140ba2168e01ba5be07
BLAKE2b-256 62860e2d137bd1f3f582b364368207c017012c0764bd4c34422056f2980b9045

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page