Skip to main content

RNA 5' and 3' End Correction Tool with Intron Refinement and Ambiguity Resolution

Project description

RECTIFY

RNA 5' and 3' End Correction Tool with Intron reFinement and ambiguitY resolution

PyPI License: MIT Python 3.8+


Overview

Nanopore direct RNA sequencing offers unprecedented read lengths, but accurate transcript structure mapping requires solving four intertwined problems: spurious 3' ends created by poly(A) tail artifacts (indels and false splice junctions), soft-clipped 5' bases that actually align upstream of splice sites, homopolymer-driven soft-clipping at 3' ends, and conflicting junction calls between different aligners. RECTIFY solves all four through multi-aligner rectification, artifact-aware corrections, and optional NET-seq refinement, delivering nucleotide-precision 5' and 3' end coordinates and splice junction sets.

Use RECTIFY when you need:

  • Accurate cleavage and polyadenylation (CPA) site mapping from DRS data
  • Correction of poly(A) misalignment artifacts in A-tract regions
  • Robust splice junction calls from reads spanning multiple exons
  • Detection of alternative polyadenylation (APA) with cluster-level resolution
  • Differential expression analysis at gene and isoform levels
  • Optional NET-seq-informed refinement for A-tract ambiguity

Quick Start

Installation

# Via PyPI
pip install rectify-rna

# With visualization support (metagene plots, genome figures)
pip install rectify-rna[visualize]

# Via Conda (includes MEME Suite for motif discovery)
conda install -c conda-forge -c bioconda rectify-rna

Basic Usage

# Correct 3' ends from FASTQ (bundled yeast genome — no external files needed)
rectify correct reads.fastq.gz --organism yeast -o corrected.tsv

# Full pipeline: alignment → correction → analysis
rectify run reads.bam --genome genome.fa --annotation genes.gtf --output-dir results/

# Process NET-seq data (nascent RNA 3' ends)
rectify netseq netseq.bam --genome genome.fa --gff genes.gff -o netseq_output/

How It Works

RECTIFY reconstructs true RNA 3' and 5' ends through four sequential corrections, each addressing a specific alignment artifact.

1. 3' End Walk-Back: Recovering the True CPA Site

When poly(A) tails align to genomic A-tracts, aligners introduce indels and spurious splice junctions (N operations) to maximize alignment score, shifting the apparent 3' end far downstream of the true cleavage site. RECTIFY walks backward from the soft-clip boundary, skipping A's, deletions, T sequencing errors, and any intron-skip (N) operations it encounters, until it finds the first non-A/T agreement between genome and read — the true CPA site.

3' End Walk-Back Correction

Why simple poly(A) trimming fails: The boundary between genomic A's and tail A's is ambiguous in A-tract regions. RECTIFY's walk-back algorithm handles deletions, T sequencing errors, and false splice junctions within the A-tract, recovering the true CPA position even when the aligner has spread the poly(A) signal across multiple genomic A-runs or introduced spurious N operations to reach downstream A-tracts. For minus-strand genes, the poly(A) tail appears as a poly(T) prefix extending leftward — RECTIFY applies identical logic in reverse orientation.

False junction cleanup is built-in: Poly(A) tails can cause aligners to introduce skip (N) operations to reach downstream A-tracts, creating spurious splice junctions. The same walk-back that corrects indel artifacts transparently absorbs these N operations — they require no separate detection step.

False Junction Walk-Back

2. 5' End Junction Rescue: Recovering Soft-Clipped Bases at Splice Sites

Nanopore reads that begin near a splice junction frequently have their 5'-most bases soft-clipped rather than placed in the upstream exon. RECTIFY identifies these soft-clipped sequences, locates the nearest annotated donor site, and extends the alignment through the intron to recover the true transcription start position.

5' End Junction Rescue

3. Soft-Clip Rescue: Recovering 5' Bases at Homopolymer Boundaries

Nanopore basecallers systematically under-call homopolymer runs. At CPA sites with upstream T-rich regions, this causes the aligner to soft-clip non-T bases rather than place them in the correct exon. RECTIFY identifies soft-clipped sequences, skips remaining reference homopolymer bases, and matches them to downstream reference positions.

Soft-Clip Rescue at Homopolymer Boundaries

This correction is especially critical for detecting true 3' ends in regions where weak basecalling and homopolymer under-calling create false soft-clip boundaries.

4. Multi-Aligner Rectification: Selecting the Optimal Junction Set

Different aligners make different tradeoffs at splice junctions. RECTIFY runs three aligners in parallel (minimap2, mapPacBio, gapmm2), applies soft-clip rescue to all outputs, scores each alignment by canonical splice sites and annotation matches, and selects the optimal rectified alignment per read.

Multi-Aligner Rectification Pipeline

Scoring criteria: Each alignment is scored by (1) number of GT-AG canonical junctions, (2) matches to annotated junctions in the provided GFF/GTF, and (3) remaining soft-clip length. The highest-scoring alignment is written to the output BAM.

# Multi-aligner rectification (default, DRS-optimized)
rectify align reads.fastq.gz --genome genome.fa --annotation genes.gff -o aligned.bam

# Single-aligner mode (faster, less accurate)
rectify align reads.fastq.gz --genome genome.fa --aligner minimap2 -o aligned.bam

Key Features

Feature Benefit
Multi-Aligner Rectification Runs minimap2, mapPacBio, gapmm2, scores each alignment, and selects the optimal rectified result per read
5' End Junction Recovery Rescues soft-clipped bases by extending alignments through known splice junctions
3' End Walk-Back Walks backward from soft-clip boundary to recover true CPA site, transparently absorbing indels, T sequencing errors, and spurious splice junctions (N ops) in a single pass
Junction Ambiguity Resolution Resolves reads matching multiple junctions using proportional assignment
Poly(A) Measurement Reports tail length including both aligned and soft-clipped bases
NET-seq Refinement Uses nascent RNA 3' ends to deconvolve A-tract ambiguity (optional)
Adaptive Clustering Groups nearby CPA sites using valley-based peak detection
Dual-Resolution Differential Expression DESeq2 at both gene level and cluster (isoform) level
APA Shift Analysis Detects significant proximal/distal CPA site usage changes
Visualization Metagene plots and genome browser figures (pip install rectify-rna[visualize])
Bundled Yeast Data S288C genome, SGD annotations, GO terms, WT NET-seq, 64K pre-computed A-tract CPA sites

Output and Results

Each read receives a corrected position with confidence scoring:

read_id   │ chrom │ strand │ original │ corrected │ shift │ confidence │ polya_len │ qc_flags
read001   │ chrI  │   +    │  147592  │   147585  │  -7   │    HIGH    │    42     │   PASS
read002   │ chrI  │   +    │  147594  │   147591  │  -3   │   MEDIUM   │    38     │   PASS
read003   │ chrII │   +    │  283109  │   283104  │  -5   │    LOW     │    31     │ AG_RICH

The rectify analyze command produces:

  • clusters.tsv — CPA site clusters with read counts per condition
  • deseq2_gene_results.tsv — Differential expression at gene level
  • deseq2_cluster_results.tsv — Differential expression at cluster (isoform) level
  • shift_results.tsv — Genes with statistically significant APA shifts
  • go_enrichment.tsv — GO term enrichment on shifted genes
  • motif_results/ — Enriched sequence motifs near CPA sites

NET-seq Refinement (Optional)

For organisms with nascent RNA (NET-seq) data, RECTIFY resolves remaining ambiguity within A-tracts. NET-seq samples RNA still attached to polymerase, providing a reference for true CPA positions. Since nascent RNA is oligo-adenylated post-capture, RECTIFY uses NNLS deconvolution with a point-spread function derived from 5000+ zero-A calibration sites to recover true CPA positions.

Oligo(A) Spreading Artifact

Oligo(A) Deconvolution

For S. cerevisiae, bundled WT NET-seq data is auto-detected. For other organisms or mutant conditions, provide NET-seq bigWigs with the --netseq-dir flag.


Commands Reference

Command Purpose
rectify correct Correct 3' end positions (indel correction + A-tract resolution)
rectify analyze Downstream analysis (clustering, DESeq2, GO enrichment, motifs)
rectify export Export corrected positions to bigWig/bedGraph tracks
rectify extract Extract per-read 5'/3' ends and junctions to TSV
rectify aggregate Group reads into 3'/5'/junction dataset files
rectify align Align FASTQ with multi-aligner rectification
rectify netseq Process NET-seq BAM files (3' extraction + deconvolution)
rectify run Full pipeline: align (if FASTQ) → correct → analyze
rectify run-all Full pipeline with provenance tracking and step-skip
Usage examples
# Correct 3' ends (bundled yeast genome, no external files needed)
rectify correct reads.fastq.gz --organism yeast -o corrected.tsv

# Correct with custom genome and optional NET-seq deconvolution
rectify correct reads.bam --genome genome.fa --netseq-dir my_netseq/ -o corrected.tsv

# Extract per-read features (5'/3' ends, junctions) to TSV
rectify extract reads.bam -o reads.tsv --genome genome.fa --annotation genes.gff

# Aggregate into separate 3'/5'/junction datasets by condition
rectify aggregate reads.bam -o aggregated/ --annotation genes.gff --mode all

# Differential expression analysis (gene and cluster level)
rectify analyze corrected.tsv --annotation genes.gtf --output-dir results/

# Export corrected positions as genome browser tracks
rectify export corrected.tsv -o tracks/ --genome genome.fa

# Complete pipeline from reads to differential expression
rectify run reads.bam --genome genome.fa --annotation genes.gtf --output-dir results/

# Process NET-seq data (nascent RNA 3' ends for A-tract refinement)
rectify netseq netseq.bam --genome genome.fa --gff genes.gff -o netseq_output/

Supported Technologies

Direct RNA sequencing: Nanopore direct RNA-seq (DRS) Short-read quantification: QuantSeq (oligo-dT), PacBio Iso-Seq, NET-seq General: Any poly(A)-tailed RNA-seq platform


Citation

Please cite RECTIFY if you use it in your research:

Roy KR, Chanfreau GF. Robust mapping of polyadenylated and non-polyadenylated RNA 3' ends at nucleotide resolution by 3'-end sequencing. Methods. 2020;176:4-13. PMID: 31128237

RECTIFY 2.0: Manuscript in preparation.


License

MIT — see LICENSE for details.

Contact

Kevin R. Roy Email: kevinrjroy@gmail.com GitHub: k-roy/RECTIFY

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rectify_rna-2.7.8.tar.gz (94.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rectify_rna-2.7.8-py3-none-any.whl (94.0 MB view details)

Uploaded Python 3

File details

Details for the file rectify_rna-2.7.8.tar.gz.

File metadata

  • Download URL: rectify_rna-2.7.8.tar.gz
  • Upload date:
  • Size: 94.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.7.8.tar.gz
Algorithm Hash digest
SHA256 e7f03565338b1c9a9c78113f6c09d109b83f003929292c02848aaaac6b8a8df0
MD5 e44c9d42be3675a83acef6b29c1dc4c4
BLAKE2b-256 178fb150e927a5a67250232d93d390d951f069c3e96d81e75b60a5aa58b8325d

See more details on using hashes here.

File details

Details for the file rectify_rna-2.7.8-py3-none-any.whl.

File metadata

  • Download URL: rectify_rna-2.7.8-py3-none-any.whl
  • Upload date:
  • Size: 94.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.7.8-py3-none-any.whl
Algorithm Hash digest
SHA256 caeadbabc2bf528ea95680d8b55a9b98a00dbe705b238820eb123f126e799f22
MD5 9e7835f7d7b36565af07e39462c9bc9d
BLAKE2b-256 a316a84cce4da314bd8360dfee97c0127f68807fa7328097e2119f098d7179c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page