Skip to main content

Nanopore methylation/modified base calling detached from basecalling

Project description

[Oxford Nanopore Technologies]

Remora

Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs such as Bonito. Remora also provides the tools to prepare datasets, train modified base models and run simple inference.

Installation

Install from pypi:

pip install ont-remora

Install from github source for development:

git clone git@github.com:nanoporetech/remora.git
pip install -e remora/[tests]

It is recommended that Remora be installed in a virtual environment. For example python3.8 -m venv --prompt remora --copies venv; source venv/bin/activate.

See help for any Remora sub-command with the -h flag.

Getting Started

Remora models are trained to perform binary or categorical prediction of modified base content of a nanopore read. Models may also be trained to perform canonical base prediction, but this feature may be removed at a later time. The rest of the documentation will focus on the modified base detection task.

The Remora training/prediction input unit (referred to as a chunk) consists of:

  1. Section of normalized signal

  2. Canonical bases attributed to the section of signal

  3. Mapping between these two

Chunks have a fixed signal length defined at data preparation time and saved as a model attribute. A fixed position within the chunk is defined as the “focus position”. By default, this position is the center of the “focus base” being interrogated by the model.

Pre-trained Models

See the selection of current released models with remora model list_pretrained. Pre-trained models are stored remotely and can be downloaded using the remora model download command or will be downloaded on demand when needed.

Models may be run from Bonito. See Bonito documentation to apply Remora models.

More advanced research models may be supplied via Rerio. Note that older ONNX format models require Remora version < 2.0.

Python API

The Remora API can be applied to make modified base calls given a basecalled read via a RemoraRead object.

  • dacs (Data acquisition values) should be an int16 numpy array.

  • shift and scale are float values to convert dacs to mean=0 SD=1 scaling (or similar) for input to the Remora neural network.

  • str_seq is a string derived from sig (can be either basecalls or other downstream derived sequence; e.g. mapped reference positions).

  • seq_to_sig_map should be an int32 numpy array of length len(seq) + 1 and elements should be indices within sig array assigned to each base in seq.

from remora.model_util import load_model
from remora.data_chunks import RemoraRead
from remora.inference import call_read_mods

model, model_metadata = load_model("remora_train_results/model_best.pt")
read = RemoraRead(dacs, shift, scale, seq_to_sig_map, str_seq=seq)
mod_probs, _, pos = call_read_mods(
  read,
  model,
  model_metadata,
  return_mod_probs=True,
)

mod_probs will contain the probability of each modeled modified base as found in model_metadata[“mod_long_names”]. For example, run mod_probs.argmax(axis=1) to obtain the prediction for each input unit. pos contains the position (index in input sequence) for each prediction within mod_probs.

Data Preparation

Remora data preparation begins from a POD5 file (containing signal data) and a BAM file containing basecalls from the POD5 file. Note that the BAM file must contain the move table (default in Bonito and --moves_out in Guppy).

The following example generates training data from canonical (PCR) and modified (M.SssI treatment) samples in the same fashion as the releasd 5mC CG-context models. Example reads and kit14 level table can be found in the Remora respoitory in the test/data/ directory.

remora \
  dataset prepare \
  can_reads.pod5 \
  can_mappings.bam \
  --output-remora-training-file can_chunks.npz \
  --log-filename prep_can.log \
  --refine-kmer-level-table levels.txt \
  --refine-rough-rescale \
  --motif CG 0 \
  --mod-base-control
remora \
  dataset prepare \
  mod_reads.pod5 \
  mod_mappings.bam \
  --output-remora-training-file mod_chunks.npz \
  --log-filename prep_can.log \
  --refine-kmer-level-table levels.txt \
  --refine-rough-rescale \
  --motif CG 0 \
  --mod-base m 5mC
remora \
  dataset merge \
  --input-dataset can_chunks.npz 10_000_000 \
  --input-dataset mod_chunks.npz 10_000_000 \
  --output-dataset chunks.npz

The resulting chunks.npz file can then be used to train a Remora model.

Model Training

Models are trained with the remora model train command. For example a model can be trained with the following command.

remora \
  model train \
  chunks.npz \
  --model remora/models/ConvLSTM_w_ref.py \
  --device 0 \
  --output-path train_results

This command will produce a “best” model in torchscript format for use in Bonito, or remora infer commands.

Model Inference

For testing purposes inference within Remora is provided.

remora \
  infer from_pod5_and_bam \
  can_signal.pod5 \
  can_basecalls.bam \
  --model train_results/model_best.pt \
  --out-file can_infer.bam \
  --device 0
remora \
  infer from_pod5_and_bam \
  mod_signal.pod5 \
  mod_basecalls.bam \
  --model train_results/model_best.pt \
  --out-file mod_infer.bam \
  --device 0

Finally, Remora provides tools to validate these results. Ground truth BED files references positions where each read should be called as the modified or canonical base listed in the BED name field.

remora \
  validate from_modbams \
  --bam-and-bed can_infer.bam can_ground_truth.bed \
  --bam-and-bed mod_infer.bam mod_ground_truth.bed \
  --full-output-filename validation_results.txt

Raw Signal Analysis

As of version 2.1, Remora has made access to raw signal analysis more accessible via two CLI commands and an improved API. The remora analyze command group contains two commands plot ref_region and estimate_kmer_levels. Additional commands will be added to this group to produce more useful raw signal analysis tasks.

The plot ref_region command is useful for gaining intuition into signal attributes and visualize signal shifts around modified bases.

As an example using the test data, the following command produces the plot below.

remora \
  analyze plot ref_region \
  --pod5-and-bam can_reads.pod5 can_mappings.bam \
  --pod5-and-bam mod_reads.pod5 mod_mappings.bam \
  --ref-regions ref_regions.bed \
  --highlight-ranges mod_gt.bed \
  --refine-kmer-level-table levels.txt \
  --refine-rough-rescale \
  --log-filename log.txt
Plot reference region image (forward strand) Plot reference region image (reverse strand)

The remora analyze estimate_kmer_levels command allows one to estimate the current level for each defined k-mer from the above signal. For each read, the mean level at each covered base is computed. Then for all reads covering a reference location the median of read levels is taken. These are grouped by kmer (defined by --kmer-context-bases) and the median is taken over all occurences of each kmer to produce the output table. The following command exemplifies this.

remora \
  analyze estimate_kmer_levels \
  --pod5-and-bam can_reads.pod5 can_mappings.bam \
  --refine-kmer-level-table levels.txt \
  --refine-rough-rescale \
  --kmer-context-bases 1 1 \
  --min-coverage 3 \
  --num-workers 8 \
  --log-filename log.txt

Note that a reasonable starting kmer table is necessary to obtain reasonable output here. This command is only using 14 reads, so in practice --min-coverage should be >=10. This command is also only estimating a 3-mer model (--kmer-context-bases 1 1), so this can be increased on larger datasets for a more representative model.

Raw Signal Analysis

The new metrics API allows access to these per-read, per-site metrics for more advanced statistical analysis. This is API is primarily accessed via the remora.io.Read object.

The iPython notebooks included in this repository exemplify some common analyses. [TODO add notebooks to repo]

Terms and Licence

This is a research release provided under the terms of the Oxford Nanopore Technologies’ Public Licence. Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. Much as we would like to rectify every issue, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid change by Oxford Nanopore Technologies.

© 2021 Oxford Nanopore Technologies Ltd. Remora is distributed under the terms of the Oxford Nanopore Technologies’ Public Licence.

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ont-remora-2.1.0.tar.gz (95.0 kB view details)

Uploaded Source

Built Distribution

ont_remora-2.1.0-cp38-cp38-macosx_12_0_arm64.whl (232.1 kB view details)

Uploaded CPython 3.8 macOS 12.0+ ARM64

File details

Details for the file ont-remora-2.1.0.tar.gz.

File metadata

  • Download URL: ont-remora-2.1.0.tar.gz
  • Upload date:
  • Size: 95.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.13

File hashes

Hashes for ont-remora-2.1.0.tar.gz
Algorithm Hash digest
SHA256 9dccde70f9cdbbade10363bbd4612d9f402ca64174f37446ab465db1927ede5d
MD5 27c15fea3bd104f87f2bf18545ccf2cd
BLAKE2b-256 855aeed0190211909587fd0df338e4560ff47d4be162bbb2db6aced92995748e

See more details on using hashes here.

File details

Details for the file ont_remora-2.1.0-cp38-cp38-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for ont_remora-2.1.0-cp38-cp38-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 080e6bb57f8758defd2e5888e7fb04f95081652239ac3f05eb717afc6a77d2af
MD5 d738429415976cc1cde37da53f109e4e
BLAKE2b-256 b38861686bb2f9c661cd33fcfd97a8816c7de1c25bde6d64439bfafd63b049a7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page