Skip to main content

Extract Methylation calls from ONT or PB long read data

Project description

LoReMe pipeline

TODO:

  • Tidy up rough edges in documentation (e.g. cli reference)
  • Review handling of ONT bedMethyl format
  • Find out why Modkit runs with only one thread
  • Rust-ify postprocessing functions
  • unit tests for postprocessing functions

LoReMe (Long Read Methylaton) is a Python package facilitating analysis of DNA methylation signals from Pacific Biosciences or Oxford Nanopore long read sequencing data.

It consists of an API and CLI for three distinct applications:

  1. Pacific Biosciences data processing. PB reads in SAM/BAM format are aligned to a reference genome with the special-purpose aligner pbmm2, a modified version of minimap2. Methylation calls are then piled up from the aligned reads with pb-CpG-tools.

  2. Oxford nanopore basecalling. ONT reads are optionally converted from FAST5 to POD5 format, then basecalled and aligned to a reference with dorado (dorado alignment also uses minimap2 under the hood), and finally piled up with modkit.

  3. Postprocessing and QC of methylation calls. Several functions are available to generate diagnostic statistics and plots.

See also the full documentation.

Other tools of interest: methylartist, modbamtools (modbamtools docs), methplotlib

Installation

In a Conda environment

The recommended way to install loreme is with a dedicated conda environment:

First create an environment including all dependencies:

conda create -n loreme -c conda-forge -c bioconda samtools pbmm2 \
  urllib3 pybedtools gff2bed seaborn pyfaidx psutil gputil tabulate \
  cython h5py iso8601 more-itertools tqdm
conda activate loreme

Then install with pip:

pip install loreme

You may also wish to install nvtop to monitor GPU usage:

conda install -c conda-forge nvtop

With pip

pip install loreme

Check installation

Check that the correct version was installed with loreme --version

Uninstall

To uninstall loreme:

loreme clean
pip uninstall loreme

Oxford Nanopore reads

Download example dataset

Use loreme download-ont-example to download a K. azureus (zebra perch) genome assembly, gene annotations, and a POD5 file containing ONT reads.

Note

This example analysis involves downloading and processing a 26 GB POD5 file of nanopore reads, which will be time consuming and resource-intensive. For that reason, it is recommended to reproduce it on a remote server (such as Seabiscuit for members of the Michael Lab at Salk).

mkdir loreme_example_Kazu/
loreme download-ont-example -d loreme_example_Kazu/

Consider going for a coffee break with your colleagues, as this will take a few minutes. Once the download is complete, the example datasets will be present in the desintation directory.

cd loreme_example_Kazu/
ls
Kazu_090722.primary_high_confidence.gff3.gz
Kazu_090722.softmasked.fasta.gz
Kazu_090722_95etoh_22C_6w_SRE1_deep.pod5

Download dorado

Calling methylation from ONT long reads requires the basecaller dorado . Download it by running

loreme download-dorado <platform>

This will download dorado and several basecalling models. The platform should be one of: linux-x64, linux-arm64, osx-arm64, win64, whichever matches your system. Running loreme download-dorado --help will show a hint as to the correct choice.

Note

For members of Michael Lab at Salk running on Seabiscuit, use loreme download-dorado linux-x64.

Modified basecalling

Note

If you have FAST5 data, convert it to POD5 using loreme dorado-convert (see loreme dorado-convert --help)

The POD5 file will need to be contained in its own directory on a scratch disk to be basecalled, so create a directory and move it there:

Note

Since basecalling ONT data is disk-read intensive, it will be slow on a spinning disk. The environment variable $SCRATCH should be set to a location on a fast SSD or NVMe disk for the following steps. For Michael lab members running on Seabiscuit, use this command: SCRATCH=/scratch/<username>.

mkdir $SCRATCH/Kazu_090722_pod5/
mv Kazu_090722_95etoh_22C_6w_SRE1_deep.pod5 $SCRATCH/Kazu_090722_pod5/

First check that a GPU is available to use:

loreme check-gpu

You can carry out modified basecalling (i.e. DNA methylation) with default parameters by running:

loreme dorado-basecall $SCRATCH/Kazu_090722_pod5/ $SCRATCH/Kazu_090722.bam

The input argument should be a directory containing one or more POD5 files. The output argument is a BAM file containing MM/ML tags. For other parameter options, see loreme dorado-basecall --help

Note

By default, LoReMe uses the "fast" basecalling model, which is less accurate. In practice you will probably want to use a higher accuracy model by setting the --accuracy option, e.g. loreme dorado-basecall --accuracy sup $SCRATCH/Kazu_090722_pod5/ $SCRATCH/Kazu_090722.bam

Note

While basecalling, tools like htop or mutil may show your memory usage increasing to hundreds of gigabytes. This is a bug in dorado and/or how Linux communicates with the GPU, and does not reflect actual memory usage.

To run dorado with only regular basecalling, use the --no-mod option:

loreme dorado-basecall --no-mod <pod5s/> <output.bam>

Alignment

The BAM file produced by dorado can be aligned to a reference index (FASTA or MMI file) with loreme dorado-align:

loreme dorado-align Kazu_090722.softmasked.fasta.gz $SCRATCH/Kazu_090722.bam Kazu_090722_aligned.bam

Download modkit

Piling up methylation calls from BAM data requires modkit . Download it by running:

loreme download-modkit

Check BAM file for MM/ML tags

Before processing with modkit, check that the BAM file contains MM/ML tags.

loreme check-tags Kazu_090722_aligned.bam

Pileup

The pileup step generates a bedMethyl file from an aligned BAM file.

Before running modkit, you will need to decompress the FASTA file.

gunzip Kazu_090722.softmasked.fasta.gz

Then you can generate a pileup

loreme modkit-pileup Kazu_090722.softmasked.fasta Kazu_090722_aligned.bam Kazu_090722.bed

Note

See loreme modkit-pileup --help for additional options. On a HPC system you may want to use additional threads with the -t flag.

Postprocessing

See also the Pacific Biosciences reads section for examples of postprocessing analysis that can be applied to bedMethyl files.

Calculate mean methylation level

loreme mean --total Kazu_090722.bed
Group        Methylation level (%)
0    79.64114117704054

Note

This and subsequent steps represent postprocessing that can be applied to either Pacific Biosciences or Oxford Nanopore datasets.

Methylation level of promoters

To calculate methylation levels of promoter regions:

loreme promoter Kazu_090722.primary_high_confidence.gff3.gz \
  Kazu_090722.bed --hist Kazu_090722_promoter.svg > Kazu_090722_promoter.bed
head Kazu_090722_promoter.bed
ctg1	159758	161758	Kazu_090722.ctg1.g000010	86.36	-
ctg1	256072	258072	Kazu_090722.ctg1.g000030	64.29	+
ctg1	321679	323679	Kazu_090722.ctg1.g000050	76.79	-
ctg1	397819	399819	Kazu_090722.ctg1.g000060	45.23	+
ctg1	437295	439295	Kazu_090722.ctg1.g000070	100.00	+
ctg1	444908	446908	Kazu_090722.ctg1.g000080	59.09	-
ctg1	499196	501196	Kazu_090722.ctg1.g000100	97.64	-
ctg1	547954	549954	Kazu_090722.ctg1.g000110	100.00	+
ctg1	577225	579225	Kazu_090722.ctg1.g000120	77.38	-
ctg1	662637	664637	Kazu_090722.ctg1.g000130	100.00	-

Histogram of methylation levels of promoter regions:

promoter methylation

Methylation level of gene bodies

To calculate methylation levels of gene bodies:

loreme gene-body Kazu_090722.primary_high_confidence.gff3.gz \
  Kazu_090722.bed --hist Kazu_090722_gene_body.svg > Kazu_090722_gene_body.bed
head Kazu_090722_gene_body.bed
ctg1	159267	159758	Kazu_090722.ctg1.g000010	95.00	-
ctg1	237075	241223	Kazu_090722.ctg1.g000020	92.05	-
ctg1	258072	259127	Kazu_090722.ctg1.g000030	100.00	+
ctg1	320906	321679	Kazu_090722.ctg1.g000050	66.67	-
ctg1	399819	411211	Kazu_090722.ctg1.g000060	83.36	+
ctg1	439295	440181	Kazu_090722.ctg1.g000070	77.78	+
ctg1	444411	444908	Kazu_090722.ctg1.g000080	100.00	-
ctg1	456252	477732	Kazu_090722.ctg1.g000090	89.00	+
ctg1	498921	499196	Kazu_090722.ctg1.g000100	100.00	-
ctg1	549954	558103	Kazu_090722.ctg1.g000110	98.23	+

Histogram of methylation levels of gene bodies:

gene body methylation

Gene methylation profile

To plot a profile of methylation across gene bodies:

loreme plot-genes gene-body Kazu_090722.primary_high_confidence.gff3.gz \
  Kazu_090722.bed Kazu_090722_genes.svg

Methylation profile across gene bodies:

gene methylation profile

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loreme-0.1.12.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

loreme-0.1.12-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file loreme-0.1.12.tar.gz.

File metadata

  • Download URL: loreme-0.1.12.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for loreme-0.1.12.tar.gz
Algorithm Hash digest
SHA256 6ad8fc7055e4c0cc9ccdff420b74b27318ad2f70d8fbff79a633b5c0bc8d5db4
MD5 1fd121f030aed7f36ee79b0f0d03d324
BLAKE2b-256 15187211ed187e2f1732245e4f93f6d2d1ea49511b0da16724e3269ec86381a5

See more details on using hashes here.

File details

Details for the file loreme-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: loreme-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for loreme-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 bcbe142b80612d33fc15d31bb3bca7a7440995594dcf4ba77d26d5170900be8e
MD5 4677f2ea8f0b432b64afd663efada540
BLAKE2b-256 5085a3609be3beaedb9d1cbb9ecfcf9742ec347df4326d7477be38ec34b1b7f6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page