A simple tool designed to visualize the features that distinguish between two groups of ONT data at the site level. It supports two re-squiggle program(tombo and f5c).

These details have not been verified by PyPI

Project links

Homepage

Project description

nanoCEM

The nanopore current events magnifier (nanoCEM) is a python command line to facilitate the analysis of DNA/RNA modification sites by visualizing statistical features of current events. NanoCEM can be used to showcase high confidence sites and observe the difference based on the modification sample and the low or no modification sample.

It supports two re-squiggle pipeline(Tombo and f5c) and support R9 and R10. If you want to view single read signal or raw signal, Squigualiser is recommended.

Example

Here is an example to help the user confirm the installation of A2030 on 23S rRNA:

pip install nanoCEM
git clone https://github.com/lrslab/nanoCEM
cd nanoCEM/example
# tackle f5c result
current_events_magnifier f5c -i data/wt/file -c data/ivt/file -o f5c_result \
--chrom NR_103073.1 --strand + \
--pos 2030 \
--ref data/23S_rRNA.fasta \
--base_shift 2 --rna --norm
# tackle tombo result
current_events_magnifier tombo -i data/wt/single -c data/ivt/single -o tombo_result \
--chrom NR_103073.1 --strand + \
--pos 2030 \
--ref data/23S_rRNA.fasta \
--rna --cpu 4 --norm

Then you can generate the following pdf files. alt text

Data release

For the data we used and related commands in our paper, please view our wiki

Before start, you should know

Re-squiggle

The electric current signal level data produced from a nanopore read is referred to as a squiggle. Base calling this squiggle information generally contains some errors compared to a reference sequence. The re-squiggle algorithm defines a new assignment from squiggle to reference sequence, hence a re-squiggle. Although new basecall program (Guppy/Boinito/Dorado) generated the bam file with move table to record the event index,but re-squiggle is a more fine alignment than the move table in most cases.

alt text

Data format

Since the release of the R10, ONT's data formats have become more diverse, including the initial fast5 format, the new pod5 format, and community-provided slow5/blow5 formats. The relationship between them and conversion tools are shown in the following figure.

alt text

In our program, we assume that the input provided by the user is the multi-fast5 format by default.

Reference and alignment

For RNA showcase, the expected input for the vast majority of species is a fasta file of transcripts, rather than the genome. This is because RNA undergoes splicing and other phenomena after transcription, allowing a single gene to produce multiple different transcripts with varying splicing forms and exon compositions.

Base shift (only available for f5c)

The mechanism of tombo and f5c is different, f5c applied a k-mer model, which means base should satisfy at least 4 bases before it. For example, in CTATG, f5c will only return the last G's current event.So, compared to tombo, there is always an offset in the results of f5c. In order to make the results of the two methods comparable and draw similar conclusions, we recommend using an offset of 2 maintained a distance no greater than 1 base compared with Tombo (default : 2). However, if you trust the original input, you can set the offset to 0.

Installation

Requirement : Python >=3.7, <3.10

pip install nanoCEM==0.0.2.4

Other tools if you needed

pip install ont-fast5-api pod5
conda install -c bioconda f5c slow5tools minimap2 samtools

Options

read_tombo_resquiggle

current_events_magnifier tombo -h
optional arguments:
  -h, --help            show this help message and exit
  --basecall_group BASECALL_GROUP
                        The attribute group to extract the training data from. e.g. RawGenomeCorrected_000
  --basecall_subgroup BASECALL_SUBGROUP
                        Basecall subgroup Nanoraw resquiggle into. Default is BaseCalled_template
  -i FAST5, --fast5 FAST5
                        fast5_file
  -c CONTROL_FAST5, --control_fast5 CONTROL_FAST5
                        control_fast5_file
  -o OUTPUT, --output OUTPUT
                        output_file
  --chrom CHROM         Gene or chromosome name(head of your fasta file)
  --pos POS             site of your interest
  --len LEN             region around the position (default:10)
  --strand STRAND       Strand of your interest (default:+)
  -t CPU, --cpu CPU     num of process (default:8)
  --ref REF             fasta file
  --overplot-number OVERPLOT_NUMBER (default:500)
                        Number of read will be used to plot
  --rna                 Turn on the RNA mode 
  --norm                Turn on the normalization

read_f5c_resquiggle

current_events_magnifier f5c -h
optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path and suffix of blow5, bam file and paf files
  -c CONTROL, --control CONTROL
                        control path and suffix of blow5, bam file and paf files
  -o OUTPUT, --output OUTPUT
                        output_file
  --chrom CHROM         Gene or chromosome name(head of your fasta file)
  --pos POS             site of your interest
  --len LEN             region around the position (default:10)
  --strand STRAND       Strand of your interest (default:+)
  --ref REF             fasta file
  --overplot-number OVERPLOT_NUMBER (default:500)
                        Number of read will be used to plot
  --rna                 
                        Turn on the RNA mode
  --base_shift BASE_SHIFT
                        base shift if required (default:2)
  --norm                Turn on the normalization

Quick start

1. Run Basecaller and alignment on your ONT data

# assumed your fast5 file folder name is fast5/ and reference is reference.fasta
# q 30 is recommended for DNA and q 5 for RNA ,but you can try other filter in your data
# guppy is just an example, and other basecalling software such as Bonito and Dorado can also be used.
guppy_basecaller -i fast5/ -s ./guppy_out --recursive --device auto -c rna_r9.4.1_70bps_hac.cfg  &
cat guppy_out/*/*.fastq > all.fastq
minimap2 -ax map-ont -t 16 --MD reference.fasta all.fastq | samtools view -hbS -F 260 -q 5 - | samtools sort -@ 16 -o file.bam
samtools index file.bam

Option -c means config file ,which will depend on your data

2. Decide the chrom or transcript name and region of your interest

In this sample, I plot the 23s rRNA whose header in fasta file is NR_103073.1, and I am interested in A2030 on the plus strand. So for the following command , I used --chrom NR_103073.1 --pos 2030 --strand +.

3. Subsample (Optional)

Re-squiggle is a really time-consuming program, it will be applied on all reads not only the reads around interest region. So I provide a simple py file to help extract the reads you want to visualize. And the new reads will be copied to subsample_single/

multi_to_single_fast5 -i fast5/ -s single/ --recursive -t 16
extract_sub_fast5_from_bam -i single/ -o subsample_single/ -b file.bam --chrom NR_103073.1 --pos 2030 
# Remember to sample fastq if you sampled your fast5
extract_sub_fastq_from_bam -i all.fastq -o final.fastq -b file.bam --chrom NR_103073.1 --pos 2030

4 Re-squiggle

4.1 Tombo resquiggle (v1.5.0)

Step 1 and 2 should run on your two sample respectively, before the step 3.

Data format conversion

If you did the subsample,skip this step and used subsample_single as the following input rather than single/

# assumed your fast5 file folder name is fast5/
multi_to_single_fast5 -i fast5/ -s single/ --recursive -t 16

Run tombo resquiggle

# if fast5 is not single format need to transfer to single format by ont-fast-api
# single is fast5s-base-directory

tombo preprocess annotate_raw_with_fastqs --fast5-basedir  single/ --fastq-filenames all.fastq --processes 16 
tombo resquiggle single/ reference.fasta --processes 16 --num-most-common-errors 5
# Notes:
# Tombo resquiggle will take various of time, which means subsample your aligned reads of the special region is recommended
# Run the Tombo pipeline above for your two sample respective, the SSD disk is recommended 
# If you ran step2, run the tombo command on subsample_single but single

Run current_events_magnifier to plot

# tackle tombo result
current_events_magnifier tombo -i data/wt/single -c data/ivt/single -o tombo_result \
--chrom NR_103073.1 --strand + \
--pos 2030 \
--ref data/23S_rRNA.fasta \
--rna --cpu 4 --norm

4.2 F5c resquiggle (v1.2) (support R10)

Step 1 and 2 should run on your two sample respectively, before the step 3.

Data format conversion. If you did the subsample,skip this step and used subsample_single as the following input rather than fast5/

slow5tools f2s fast5/ -d blow5_dir
slow5tools merge blow5_dir -o file.blow5
slow5tools index file.blow5

Run f5c resquiggle

Use --rna to turn to the rna mode and --pore r10 to re-squiggle reads from R10

f5c resquiggle -c all.fastq file.blow5 -o file.paf --rna --pore r9

Run nanoCEM to plot

# run the pipeline below for your two sample respective and keep the suffix of bam/paf/blow5 is the same
current_events_magnifier f5c -i data/wt/file -c data/ivt/file -o f5c_result \
--chrom NR_103073.1 --strand + \
--pos 2030 \
--ref data/23S_rRNA.fasta \
--base_shift 2 --rna --norm

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.6.1

May 22, 2024

0.0.6.0

May 21, 2024

0.0.5.9

May 13, 2024

0.0.5.8

Apr 17, 2024

0.0.5.7

Apr 16, 2024

0.0.5.6

Apr 16, 2024

0.0.5.5

Apr 16, 2024

0.0.5.4

Mar 6, 2024

0.0.5.3

Mar 5, 2024

0.0.5.2

Mar 4, 2024

0.0.5.1

Feb 26, 2024

0.0.5.0

Feb 26, 2024

0.0.4.9

Feb 26, 2024

0.0.4.8

Feb 26, 2024

0.0.4.7

Feb 20, 2024

0.0.4.6

Feb 19, 2024

0.0.4.5

Feb 15, 2024

0.0.4.4

Feb 6, 2024

0.0.4.3

Feb 6, 2024

0.0.4.2

Jan 24, 2024

0.0.4.1

Jan 24, 2024

0.0.4.0

Jan 24, 2024

0.0.3.9

Jan 18, 2024

0.0.3.8

Dec 16, 2023

0.0.3.7

Dec 16, 2023

0.0.3.6

Dec 16, 2023

0.0.3.5

Dec 15, 2023

0.0.3.4

Dec 13, 2023

0.0.3.3

Dec 12, 2023

0.0.3.2

Dec 11, 2023

0.0.3.1

Dec 11, 2023

0.0.3.0

Dec 10, 2023

0.0.2.9

Dec 10, 2023

0.0.2.8

Oct 11, 2023

0.0.2.7

Sep 12, 2023

0.0.2.6

Jul 31, 2023

This version

0.0.2.5

Jul 24, 2023

0.0.2.4

Jul 21, 2023

0.0.2.3

Jul 20, 2023

0.0.2.2

Jul 15, 2023

0.0.2.1

Jul 14, 2023

0.0.2.0

Jul 11, 2023

0.0.1.8

Jul 11, 2023

0.0.1.6

Jul 10, 2023

0.0.1.5

Jul 5, 2023

0.0.1.4

Jul 5, 2023

0.0.1.3

Jul 3, 2023

0.0.1.2

Jul 3, 2023

0.0.1.1

Jul 3, 2023

0.0.1

Jul 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoCEM-0.0.2.5.tar.gz (21.5 kB view details)

Uploaded Jul 24, 2023 Source

File details

Details for the file nanoCEM-0.0.2.5.tar.gz.

File metadata

Download URL: nanoCEM-0.0.2.5.tar.gz
Upload date: Jul 24, 2023
Size: 21.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for nanoCEM-0.0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`ab5c04b1dd765c2b4ee832e275e0b16fe08a581c3ae6f3703fb4e96592deaf08`
MD5	`dc22c4e074274ef3e2f161aba108c61a`
BLAKE2b-256	`f0bd244fa0ff313dd538062791ecd367bb534077cf2de509393ca6bd607bda98`