ont-fast5-api

Oxford Nanopore Technologies fast5 API software

These details have not been verified by PyPI

Project links

Homepage

Project description

ont_fast5_api

ont_fast5_api is a simple interface to HDF5 files of the Oxford Nanopore .fast5 file format.

Source code: https://github.com/nanoporetech/ont_fast5_api
Fast5 File Schema: https://github.com/nanoporetech/ont_h5_validator

It provides:

Concrete implementation of the fast5 file schema using the generic h5py library
Plain-english-named methods to interact with and reflect the fast5 file schema
Tools to convert between multi_read and single_read formats
Tools to compress/decompress raw data in files

Getting Started

The ont_fast5_api is available on PyPI and can be installed via pip:

pip install ont-fast5-api

Alternatively, it is available on github where it can be built from source:

git clone https://github.com/nanoporetech/ont_fast5_api
pip install ./ont_fast5_api

Dependencies

ont_fast5_api is a pure python project and should run on most python versions and operating systems.

It requires:

h5py: 2.6 or higher
NumPy: 1.11 or higher
six: 1.10 or higher
progressbar33: 2.3.1 or higher

Interface - get_fast5_file

The ont_fast5_api provides a simple interface to access the data structures in .fast5 files of either single- or multi- read format using the same method calls.

For example to print the raw data from all reads in a file:

from ont_fast5_api.fast5_interface import get_fast5_file

def print_all_raw_data():
    fast5_filepath = "test/data/single_reads/read0.fast5" # This can be a single- or multi-read file
    with get_fast5_file(fast5_filepath, mode="r") as f5:
        for read in f5.get_reads():
            raw_data = read.get_raw_data()
            print(read.read_id, raw_data)

Interface - Console Scripts

The ont_fast5_api provides terminal/command-line console_scripts for converting between files in the Oxford Nanopore single_read and multi_read .fast5 file formats. These are provided to ensure compatibility between tools which expect either the single_read or multi_read .fast5 file formats.

The scripts are added during installation and can be called from the terminal/command-line or from within python.

single_to_multi_fast5

This script converts folders containing single_read_fast5 files into multi_read_fast5_files:

single_to_multi_fast5
[required]
    -i, --input_path    INPUT_PATH      <(path) folder containing single_read_fast5 files>
    -s, --save_path     SAVE_PATH       <(path) to folder where multi_read fast5 files will be output>

[optional]
    -t, --threads       THREADS         <(int) number of CPU threads to use; default=1>
    -f, --filename_base FILENAME_BASE   <(string) name for new multi_read file; default="batch" (see note-1)>
    -n, --batch_size    BATCH_SIZE      <(int) number of single_reads to include in each multi_read file; default=4000>
    --recursive                         <if included, recursively search sub-directories for single_read files>

note-1: newly created multi_read files require a name. This is the filename_base with the batch count and .fast5 appended to it; e.g. -f batch yields batch_0.fast5, batch_1.fast5, ...

example usage:

single_to_multi_fast5 --input_path /data/reads --save_path /data/multi_reads
    --filename_base batch_output --batch_size 100 --recursive

Where /data/reads and/or its subfolders contain single_read .fast5 files. The output will be multi_read fast5 files each containing 100 reads, in the folder: /data/multi_reads with the names: batch_output_0.fast5, batch_output_1.fast5 etc.

multi_to_single_fast5

This script converts folders containing multi_read_fast5 files into single_read_fast5 files:

multi_to_single_fast5
[required]
    -i, --input_path    INPUT_PATH  <(path) folder containing multi_read_fast5 files>
    -s, --save_path     SAVE_PATH   <(path) to folder where single_read fast5 files will be output

[optional]
    -t, --threads       THREADS     <(int) number of CPU threads to use; default=1>
    --recursive                     <if included, recursively search sub-directories for multi_read files>

example usage:

multi_to_single_fast5 --input_path /data/multi_reads --save_path /data/single_reads
    --recursive

Where /data/multi_reads and/or its subfolders contain multi_read .fast5 files. The output will be single_read .fast5 files in the folder /data/single_reads with one subfolder per multi_read input file

fast5_subset

This script extracts reads from multi_read_fast5_file(s) based on a list of read_ids:

fast5_subset
[required]
    -i, --input         INPUT_PATH      <(path) to folder containing multi_read_fast5 files or an individual multi_read_fast5 file>
    -s, --save_path     SAVE_PATH       <(path) to folder where multi_read fast5 files will be output>
    -l,--read_id_list   SUMMARY_PATH    <(file) either sequencing_summary.txt file or a file containing a list of read_ids>

[optional]
    -f, --filename_base FILENAME_BASE   <(string) name for new multi_read file; default="batch" (see note-1)>
    -n, --batch_size    BATCH_SIZE      <(int) number of single_reads to include in each multi_read file; default=4000>
    --recursive                         <if included, recursively search sub-directories for single_read files>

example usage:

fast5_subset --input /data/multi_reads --save_path /data/subset
    --read_id_list read_id_list.txt --batch_size 100 --recursive

Where /data/multi_reads and/or its subfolders contain multi_read .fast5 files and read_id_list.txt is a text file either containing 1 read_id per line or a tsv file with a column named read_id. The output will be multi_read .fast5 files each containing 100 reads, in the folder: /data/multi_reads with the names: batch_output_0.fast5, batch_output_1.fast5 etc.

demux_fast5

This script for demultiplexing reads from multi_read_fast5_file(s).

Extracts reads into multiple directories based on column value in a summary file:

demux_fast5.py
[required]
  -i, --input          INPUT_PATH    <Path to Fast5 file or directory of Fast5 files>
  -s, --save_path      SAVE_PATH     <Directory to output MultiRead subsets>
  -l, --summary_file   SUMMARY_PATH  <TSV file containing read_id and demultiplex columns>

[optional]
  --read_id_column     COLUMN_NAME   <Name of read_id column in summary file (default 'read_id')>
  --demultiplex_column COLUMN_NAME   <Name of column for demultiplexing in summary file (default 'barcoding_arrangement')>
  -f, --filename_base  FILENAME_BASE <Root of output filename, default='batch' -> 'batch_0.fast5'>
  -n, --batch_size     BATCH_SIZE    <Number of reads per multi-read file, default 4000>
  -t, --threads        THREADS       <Maximum number of processes to use>
  -r, --recursive                    <Flag to search recursively through input directory for MultiRead fast5 files>
  --ignore_symlinks                  <Ignore symlinks when searching recursively for fast5 files>
  -c --compression     COMPRESSION   <Target output compression type (vbz,vbz_legacy_v0,gzip,None)>

Intended use is for multiplexed experiments, for reads with different barcodes or from different genomes.

example usage:

demux_fast5 --input /data/multi_reads --save_path /data/demultiplexed_reads --summary_file barcoding_summary.txt

Where /data/multi_reads and/or its subfolders contain fast5 files from multiplexed experiment, barcoding_summary.txt is the output of guppy_barcoder. /data/demultiplexed_reads will contain a directory per barcode, containing multi_read .fast5 files with names: /data/demultiplexed_reads/barcode01/batch_0.fast5, /data/demultiplexed_reads/barcode02/batch_0.fast5 etc. Directories are named by values in demultiplex column.

compress_fast5

This script copies and converts raw data between vbz and gzip compression formats:

compress_fast5
[required]
    -i, --input_path    INPUT_PATH  <(path) folder containing multi_read_fast5 files>
    -s, --save_path     SAVE_PATH   <(path) to folder where single_read fast5 files will be output>
    -c, --compression   COMPRESSION <(str) [vbz, gzip] target compression format>

[optional]
    -t, --threads       THREADS     <(int) number of CPU threads to use; default=1>
    --recursive                     <if included, recursively search sub-directories for fast5 files>
    --sanitize                      <flag to remove optional groups (such as basecalling and modified base information)>

example usage:

compress_fast5 --input_path /data/uncompressed_reads --save_path /data/compressed_reads
    --compression vbz --recursive --threads 40

Where /data/uncompressed_reads and/or its subfolders contain .fast5 files. The output will be a copy of the input folder structure containing compressed reads preserving both the folder structure and file type.

The optional --sanitize option can be used to greatly reduce file size when files contain optional data from the Guppy basecaller that could in principle be regenerated by running Guppy. The files output when using the sanitize option will be identical in structure to those output by MinKNOW when live basecalling is disabled.

NB compress_fast5 will copy .fast5 files in order to compress them due to HDF5 implementation constraints. Further detail of HDF5 data management strategies can be found: https://support.hdfgroup.org/HDF5/doc/Advanced/FileSpaceManagement/FileSpaceManagement.pdf

VBZ Compression

VBZ compression is a compression algorithm developed by Oxford Nanopore to reduce file size and improve read/write performance when handling raw data in Fast5 files. Previously, the default compression was GZIP and comparing to GZIP we see a compression improvement of >30% and a CPU performance improvement of >10X for compression and >5X for decompression. Further details of the implementation and benchmarks can be found here: https://github.com/nanoporetech/vbz_compression

Benchmarking the performance of compression within the ont_fast5_api against a normal file copy showed compressing from gzip to vbz was approximately 2x slower than copying files. In other words, if it would take two hours to copy a set of files from an input folder to an output folder then it should take four hours to compress those files with VBZ. Running the script without compressing (i.e. the same type of compression in and out; gzip->gzip) was approximately 2x faster than a file copy since it can utilise mutiple threads.

Glossary of Terms:

HDF5 file format - a portable file format for storing and managing data. It is designed for flexible and efficient I/O and for high volume and complex data

Fast5 - an implementation of the HDF5 file format, with specific data schemas for Oxford Nanopore sequencing data

Single read fast5 - A fast5 file containing all the data pertaining to a single Oxford Nanopore read. This may include raw signal data, run metadata, fastq-basecalls and any other additional analyses

Multi read fast5 - A fast5 file containing data pertaining to a multiple Oxford Nanopore reads.

Demultiplexing - A process of separating reads of an experiment where multiple samples were mixed together (multiplexed), into corresponding samples. Demultiplexing is based on markers that identify sample origin, e.g. unique barcodes or alignment to a reference genome.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

4.1.3

Feb 28, 2024

4.1.2

Dec 21, 2023

4.1.1

Dec 20, 2022

4.1.0

Sep 28, 2022

4.0.2

Mar 10, 2022

4.0.1 yanked

Mar 10, 2022

Reason this release was yanked:

Bug when importing Fast5Read

4.0.0

Aug 11, 2021

3.3.0

Feb 17, 2021

3.2.0

Jan 28, 2021

3.1.6

Aug 20, 2020

3.1.5

Jun 15, 2020

3.1.4

Jun 12, 2020

3.1.3

May 28, 2020

3.1.2

May 4, 2020

3.1.1

Apr 3, 2020

3.1.0

Apr 2, 2020

3.0.2

Mar 17, 2020

3.0.1

Jan 29, 2020

3.0.0

Jan 20, 2020

2.0.1

Nov 28, 2019

2.0.0

Nov 19, 2019

1.4.8

Oct 22, 2019

1.4.7

Jul 29, 2019

1.4.4

Jun 18, 2019

1.4.3

Jun 12, 2019

1.4.2

Jun 10, 2019

1.4.1

Jun 6, 2019

1.4.0

May 29, 2019

1.3.0

Mar 1, 2019

1.2.0

Jan 11, 2019

1.1.1

Jan 10, 2019

1.1.0

Jan 7, 2019

1.0.1

Sep 26, 2018

1.0.0

Sep 26, 2018

0.4.1

Jul 25, 2017

0.3.3

Jul 6, 2017

0.3.2

Mar 22, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ont-fast5-api-4.1.3.tar.gz (2.3 MB view details)

Uploaded Feb 28, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ont_fast5_api-4.1.3-py3-none-any.whl (2.3 MB view details)

Uploaded Feb 28, 2024 Python 3

File details

Details for the file ont-fast5-api-4.1.3.tar.gz.

File metadata

Download URL: ont-fast5-api-4.1.3.tar.gz
Upload date: Feb 28, 2024
Size: 2.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for ont-fast5-api-4.1.3.tar.gz
Algorithm	Hash digest
SHA256	`302d10ed87b439f8f22c2c06d45d68d017e47dd8df9bd48f155cad041f464b68`
MD5	`4390bb2e34d41acb8fc2a04fd1e39a3d`
BLAKE2b-256	`6d1a6d108133f1b7770c9550bf63398119f9eb70492b0928b1f566704ec63ac9`

See more details on using hashes here.

File details

Details for the file ont_fast5_api-4.1.3-py3-none-any.whl.

File metadata

Download URL: ont_fast5_api-4.1.3-py3-none-any.whl
Upload date: Feb 28, 2024
Size: 2.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for ont_fast5_api-4.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`642a89775b370e44206625f03bd41330650656bbb29325dd958a34010667a970`
MD5	`5dd62d9fce94d7aad39d6ac75cc2a38b`
BLAKE2b-256	`f4ce0d6fe4e6fd7fbebd6948511663a14c2a6642a355ab550294dbb5f8065c58`

See more details on using hashes here.

ont-fast5-api 4.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ont_fast5_api

Getting Started

Dependencies

Interface - get_fast5_file

Interface - Console Scripts

single_to_multi_fast5

multi_to_single_fast5

fast5_subset

demux_fast5

compress_fast5

VBZ Compression

Glossary of Terms:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes