Skip to main content

A sample tool designed to visualize the features that distinguish between two groups of ONT data at the site level. It supports two re-squiggle pipeline(Tombo and f5c).

Project description

nanoCEM

The nanopore current events magnifier (nanoCEM) is a python command line to facilitate the analysis of DNA/RNA modification sites by visualizing statistical features of current events. CEM can be used to showcase high confidence sites and observe the difference based on the modification sample and the low or no modification sample.

It supports two re-squiggle pipeline(Tombo and f5c) and support R9 and R10. If you want to view single read signal or raw signal, Squigualiser is recommended.

Example

Here is an example that show the difference of A2030 on 23S rRNA. alt text alt text

Before start, you should know

Re-squiggle

In ONT technology, "resquiggle" refers to the process of converting the raw electrical signals from the sequencer into corresponding DNA/RNA sequence information, which is then corrected and realigned. This process utilizes the signal features of ONT sequencing, such as changes in electrical resistance and noisy signals, to capture information from the DNA/RNA sequence and analyze and interpret it. Although new basecaller program (Guppy/Boinito/Dorado) generated the bam file with move table to record the event index,but resquiggle is a more fine alignment than the move table in most cases.

Data format

Since the release of the R10, ONT's data formats have become more diverse, including the initial fast5 format, the new pod5 format, and community-provided slow5/blow5 formats. The relationship between them and conversion tools are shown in the following figure.

alt text

In our program, we assume that the input provided by the user is the multi-fast5 format by default.

Reference and alignment

For RNA showcase, the expected input for the vast majority of species is a fasta file of transcripts, rather than the genome. This is because RNA undergoes splicing and other phenomena after transcription, allowing a single gene to produce multiple different transcripts with varying splicing forms and exon compositions.

Base shift (only available for f5c)

The mechanism of tombo and f5c is different, f5c applied a k-mer model, which means base should satisfy at least 4 bases before it. For example, in CTATG, f5c will only return the last G's current event.So, compared to tombo, there is always an offset in the results of f5c. In order to make the results of the two methods comparable and draw similar conclusions, we recommend using an offset of 2 maintained a distance no greater than 1 base compared with Tombo. However, if you trust the original input, you can set the offset to 0.

Installation

Requirement : Python >=3.7, <3.10

pip install nanoCEM==0.0.4.4
pip install ont-fast5-api
conda install -c bioconda f5c slow5tools

Options

read_tombo_resquiggle

current_events_magnifier.py tombo -h
optional arguments:
  -h, --help            show this help message and exit
  --basecall_group BASECALL_GROUP
                        The attribute group to extract the training data from. e.g. RawGenomeCorrected_000
  --basecall_subgroup BASECALL_SUBGROUP
                        Basecall subgroup Nanoraw resquiggle into. Default is BaseCalled_template
  -i FAST5, --fast5 FAST5
                        fast5_file
  -c CONTROL_FAST5, --control_fast5 CONTROL_FAST5
                        control_fast5_file
  -o OUTPUT, --output OUTPUT
                        output_file
  --chrom CHROM         Gene or chromosome name(head of your fasta file)
  --pos POS             site of your interest
  --len LEN             region around the position (default:10)
  --strand STRAND       Strand of your interest (default:+)
  -t CPU, --cpu CPU     num of process (default:8)
  --ref REF             fasta file
  --overplot-number OVERPLOT_NUMBER (default:500)
                        Number of read will be used to plot
  --rna                 Turn on the RNA mode 

read_f5c_resquiggle

current_events_magnifier.py f5c -h
optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path and suffix of blow5, bam file and paf files
  -c CONTROL, --control CONTROL
                        control path and suffix of blow5, bam file and paf files
  -o OUTPUT, --output OUTPUT
                        output_file
  --chrom CHROM         Gene or chromosome name(head of your fasta file)
  --pos POS             site of your interest
  --len LEN             region around the position (default:10)
  --strand STRAND       Strand of your interest (default:+)
  --ref REF             fasta file
  --overplot-number OVERPLOT_NUMBER (default:500)
                        Number of read will be used to plot
  --rna                 
                        Turn on the RNA mode
  --base_shift BASE_SHIFT
                        base shift if required (default:0)

Quick start

1. Run Basecaller and alignment on your ONT data

# assumed your fast5 file folder name is fast5/ and reference is reference.fasta
guppy_basecaller -i fast5/ -s ./guppy_out --recursive --device auto -c rna_r9.4.1_70bps_hac.cfg  &
cat guppy_out/*/*.fastq > all.fastq
minimap2 -ax map-ont -t 16 --MD reference.fasta all.fastq | samtools view -hbS -F 260 - | samtools sort -@ 16 -o file.bam
samtools index file.bam

Option -c means config file ,which will depend on your data

2. Decide the chrom or transcript name and region of your interest

In this sample, I plot the 23s rRNA whose header in fasta file is NR_103073.1 and I am intereted in A2030 on the plus strand. So for the following command , I used --chrom NR_103073.1 --pos 2030 --strand +.

3. Subsample (Optional)

Re-squiggle is a really time-consuming program, it will be applied on all reads not only the reads around interest region. So I provide a simple py file to help extract the reads you want to visualize. And the new reads will be copied to subsample_single/

multi_to_single_fast5 -i fast5/ -s single/ --recursive -t 16
extract_sub_fast5_from_bam.py -i single/ -o subsample_single/ -b file.bam --chrom NR_103073.1 --pos 2030 

4 Re-squiggle

4.1 Tombo resquiggle (v1.5.0)

Step 1 and 2 should run on your two sample respectively, before the step 5.

  1. Data format conversion

If you did the subsample,skip this step and used subsample_single as the following input rather than single/

# assumed your fast5 file folder name is fast5/
multi_to_single_fast5 -i fast5/ -s single/ --recursive -t 16
  1. Run tombo resquiggle
# if fast5 is not single format need to transfer to single format by ont-fast-api
# single is fast5s-base-directory

tombo preprocess annotate_raw_with_fastqs --fast5-basedir  single/ --fastq-filenames all.fastq --processes 16 
tombo resquiggle single/ reference.fasta --processes 16 --num-most-common-errors 5
# Notes:
# Tombo resquiggle will take various of time, which means subsample your aligned reads of the special region is recommended
# Run the Tombo pipeline above for your two sample respective, the SSD disk is recommended 
# If you ran step2, run the tombo command on subsample_single but single
  1. Run current_events_magnifier to plot
nanoCEM.py tombo -i sample1/single -c sample2/single -o tombo_result\
 --chrom NR_103073.1 --strand + --pos 2030 \
 --len 5 --cpu 4 --ref reference.fasta --rna

4.2 F5c resquiggle (v1.2) (support R10)

Step 1 and 2 should run on your two sample respectively, before the step 3.

  1. Data format conversion. If you did the subsample,skip this step and used subsample_single as the following input rather than fast5/
slow5tools f2s fast5/ -d blow5_dir
slow5tools merge blow5_dir -o file.blow5
slow5tools index file.blow5
  1. Run f5c resquiggle

Use --rna to turn to the rna mode and --pore r10 to re-squiggle reads from R10

f5c resquiggle -c all.fastq file.blow5 -o file.paf --rna --pore r9
  1. Run nanoCEM to plot
# run the pipeline below for your two sample respective and keep the suffix of bam/paf/blow5 is the same
current_events_magnifier.py f5c -i sample1/file -c sample2/file -o f5c_result \
--chrom NR_103073.1 --strand + \
--pos 3929 --len 10 \
--ref reference.fasta --base_shift 2 --rna

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoCEM-0.0.1.5.tar.gz (20.0 kB view details)

Uploaded Source

File details

Details for the file nanoCEM-0.0.1.5.tar.gz.

File metadata

  • Download URL: nanoCEM-0.0.1.5.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for nanoCEM-0.0.1.5.tar.gz
Algorithm Hash digest
SHA256 ada4e89efcae460a4974ed0d84d95cd3faf7a7c1c4d469ba84ca4353278c0db3
MD5 8a99dd9b7bee223b0a8a516a5377a78d
BLAKE2b-256 d7902ab822216e7dd0523396b50f98a8a8dc5ad7be42221dddd8d904a2bb252b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page