Skip to main content

A simple tool designed to visualize the features that distinguish between two groups of ONT data at the site level. It supports two re-squiggle program(tombo and f5c).

Project description

nanoCEM

The nanopore current events magnifier (nanoCEM) is a python command line to facilitate the analysis of DNA/RNA modification sites by visualizing statistical features of current events. NanoCEM can be used to showcase high confidence sites and observe the difference based on the modification sample and the low or no modification sample.

It supports two re-squiggle pipeline(Tombo and f5c) and support R9 and R10. If you want to view single read signal or raw signal, Squigualiser is recommended.

Example

Here is an example to help the user confirm the installation of A2030 on 23S rRNA:

pip install nanoCEM
git clone https://github.com/lrslab/nanoCEM
cd nanoCEM/example
# tackle f5c result
current_events_magnifier f5c -i data/wt/file -c data/ivt/file -o f5c_result \
--chrom NR_103073.1 --strand + \
--pos 2030 \
--ref data/23S_rRNA.fasta \
--base_shift 2 --rna --norm
# tackle tombo result
current_events_magnifier tombo -i data/wt/single -c data/ivt/single -o tombo_result \
--chrom NR_103073.1 --strand + \
--pos 2030 \
--ref data/23S_rRNA.fasta \
--rna --cpu 4 --norm

Then you can generate the following pdf files. alt text alt text

Data release

For the data we used and related commands in our paper, please view our wiki

Before start, you should know

Re-squiggle

The electric current signal level data produced from a nanopore read is referred to as a squiggle. Base calling this squiggle information generally contains some errors compared to a reference sequence. The re-squiggle algorithm defines a new assignment from squiggle to reference sequence, hence a re-squiggle. Although new basecall program (Guppy/Boinito/Dorado) generated the bam file with move table to record the event index,but re-squiggle is a more fine alignment than the move table in most cases.

alt text

Data format

Since the release of the R10, ONT's data formats have become more diverse, including the initial fast5 format, the new pod5 format, and community-provided slow5/blow5 formats. The relationship between them and conversion tools are shown in the following figure.

alt text

In our program, we assume that the input provided by the user is the multi-fast5 format by default.

Reference and alignment

For RNA showcase, the expected input for the vast majority of species is a fasta file of transcripts, rather than the genome. This is because RNA undergoes splicing and other phenomena after transcription, allowing a single gene to produce multiple different transcripts with varying splicing forms and exon compositions.

Base shift (only available for f5c)

The mechanism of tombo and f5c is different, f5c applied a k-mer model, which means base should satisfy at least 4 bases before it. For example, in CTATG, f5c will only return the last G's current event.So, compared to tombo, there is always an offset in the results of f5c. In order to make the results of the two methods comparable and draw similar conclusions, we recommend using an offset of 2 maintained a distance no greater than 1 base compared with Tombo (default : 2). However, if you trust the original input, you can set the offset to 0.

Installation

Requirement : Python >=3.7, <3.10

pip install nanoCEM==0.0.2.4

Other tools if you needed

pip install ont-fast5-api pod5
conda install -c bioconda f5c slow5tools minimap2 samtools

Options

read_tombo_resquiggle

current_events_magnifier tombo -h
optional arguments:
  -h, --help            show this help message and exit
  --basecall_group BASECALL_GROUP
                        The attribute group to extract the training data from. e.g. RawGenomeCorrected_000
  --basecall_subgroup BASECALL_SUBGROUP
                        Basecall subgroup Nanoraw resquiggle into. Default is BaseCalled_template
  -i FAST5, --fast5 FAST5
                        fast5_file
  -c CONTROL_FAST5, --control_fast5 CONTROL_FAST5
                        control_fast5_file
  -o OUTPUT, --output OUTPUT
                        output_file
  --chrom CHROM         Gene or chromosome name(head of your fasta file)
  --pos POS             site of your interest
  --len LEN             region around the position (default:10)
  --strand STRAND       Strand of your interest (default:+)
  -t CPU, --cpu CPU     num of process (default:8)
  --ref REF             fasta file
  --overplot-number OVERPLOT_NUMBER (default:500)
                        Number of read will be used to plot
  --rna                 Turn on the RNA mode 
  --norm                Turn on the normalization

read_f5c_resquiggle

current_events_magnifier f5c -h
optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path and suffix of blow5, bam file and paf files
  -c CONTROL, --control CONTROL
                        control path and suffix of blow5, bam file and paf files
  -o OUTPUT, --output OUTPUT
                        output_file
  --chrom CHROM         Gene or chromosome name(head of your fasta file)
  --pos POS             site of your interest
  --len LEN             region around the position (default:10)
  --strand STRAND       Strand of your interest (default:+)
  --ref REF             fasta file
  --overplot-number OVERPLOT_NUMBER (default:500)
                        Number of read will be used to plot
  --rna                 
                        Turn on the RNA mode
  --base_shift BASE_SHIFT
                        base shift if required (default:2)
  --norm                Turn on the normalization

Quick start

1. Run Basecaller and alignment on your ONT data

# assumed your fast5 file folder name is fast5/ and reference is reference.fasta
# q 30 is recommended for DNA and q 5 for RNA ,but you can try other filter in your data
# guppy is just an example, and other basecalling software such as Bonito and Dorado can also be used.
guppy_basecaller -i fast5/ -s ./guppy_out --recursive --device auto -c rna_r9.4.1_70bps_hac.cfg  &
cat guppy_out/*/*.fastq > all.fastq
minimap2 -ax map-ont -t 16 --MD reference.fasta all.fastq | samtools view -hbS -F 260 -q 5 - | samtools sort -@ 16 -o file.bam
samtools index file.bam

Option -c means config file ,which will depend on your data

2. Decide the chrom or transcript name and region of your interest

In this sample, I plot the 23s rRNA whose header in fasta file is NR_103073.1, and I am interested in A2030 on the plus strand. So for the following command , I used --chrom NR_103073.1 --pos 2030 --strand +.

3. Subsample (Optional)

Re-squiggle is a really time-consuming program, it will be applied on all reads not only the reads around interest region. So I provide a simple py file to help extract the reads you want to visualize. And the new reads will be copied to subsample_single/

multi_to_single_fast5 -i fast5/ -s single/ --recursive -t 16
extract_sub_fast5_from_bam -i single/ -o subsample_single/ -b file.bam --chrom NR_103073.1 --pos 2030 
# Remember to sample fastq if you sampled your fast5
extract_sub_fastq_from_bam -i all.fastq -o final.fastq -b file.bam --chrom NR_103073.1 --pos 2030 

4 Re-squiggle

4.1 Tombo resquiggle (v1.5.0)

Step 1 and 2 should run on your two sample respectively, before the step 3.

  1. Data format conversion

If you did the subsample,skip this step and used subsample_single as the following input rather than single/

# assumed your fast5 file folder name is fast5/
multi_to_single_fast5 -i fast5/ -s single/ --recursive -t 16
  1. Run tombo resquiggle
# if fast5 is not single format need to transfer to single format by ont-fast-api
# single is fast5s-base-directory

tombo preprocess annotate_raw_with_fastqs --fast5-basedir  single/ --fastq-filenames all.fastq --processes 16 
tombo resquiggle single/ reference.fasta --processes 16 --num-most-common-errors 5
# Notes:
# Tombo resquiggle will take various of time, which means subsample your aligned reads of the special region is recommended
# Run the Tombo pipeline above for your two sample respective, the SSD disk is recommended 
# If you ran step2, run the tombo command on subsample_single but single
  1. Run current_events_magnifier to plot
# tackle tombo result
current_events_magnifier tombo -i data/wt/single -c data/ivt/single -o tombo_result \
--chrom NR_103073.1 --strand + \
--pos 2030 \
--ref data/23S_rRNA.fasta \
--rna --cpu 4 --norm

4.2 F5c resquiggle (v1.2) (support R10)

Step 1 and 2 should run on your two sample respectively, before the step 3.

  1. Data format conversion. If you did the subsample,skip this step and used subsample_single as the following input rather than fast5/
slow5tools f2s fast5/ -d blow5_dir
slow5tools merge blow5_dir -o file.blow5
slow5tools index file.blow5
  1. Run f5c resquiggle

Use --rna to turn to the rna mode and --pore r10 to re-squiggle reads from R10

f5c resquiggle -c all.fastq file.blow5 -o file.paf --rna --pore r9
  1. Run nanoCEM to plot
# run the pipeline below for your two sample respective and keep the suffix of bam/paf/blow5 is the same
current_events_magnifier f5c -i data/wt/file -c data/ivt/file -o f5c_result \
--chrom NR_103073.1 --strand + \
--pos 2030 \
--ref data/23S_rRNA.fasta \
--base_shift 2 --rna --norm

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoCEM-0.0.2.5.tar.gz (21.5 kB view details)

Uploaded Source

File details

Details for the file nanoCEM-0.0.2.5.tar.gz.

File metadata

  • Download URL: nanoCEM-0.0.2.5.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for nanoCEM-0.0.2.5.tar.gz
Algorithm Hash digest
SHA256 ab5c04b1dd765c2b4ee832e275e0b16fe08a581c3ae6f3703fb4e96592deaf08
MD5 dc22c4e074274ef3e2f161aba108c61a
BLAKE2b-256 f0bd244fa0ff313dd538062791ecd367bb534077cf2de509393ca6bd607bda98

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page