RNA 5' and 3' End Correction Tool with Intron Refinement and Ambiguity Resolution
Project description
RECTIFY
RNA 5' and 3' End Correction Tool with Intron reFinement and ambiguitY resolution
Overview
Nanopore direct RNA sequencing offers unprecedented read lengths, but accurate transcript structure mapping requires solving four intertwined problems: spurious 3' ends created by poly(A) tail artifacts (indels and false splice junctions), soft-clipped 5' bases that actually align upstream of splice sites, homopolymer-driven soft-clipping at 3' ends, and conflicting junction calls between different aligners. RECTIFY solves all four through multi-aligner rectification, artifact-aware corrections, and optional NET-seq refinement, delivering nucleotide-precision 5' and 3' end coordinates and splice junction sets.
Use RECTIFY when you need:
- Accurate cleavage and polyadenylation (CPA) site mapping from DRS data
- Correction of poly(A) misalignment artifacts in A-tract regions
- Robust splice junction calls from reads spanning multiple exons
- Detection of alternative polyadenylation (APA) with cluster-level resolution
- Differential expression analysis at gene and isoform levels
- Optional NET-seq-informed refinement for A-tract ambiguity
Quick Start
Installation
# Via PyPI
pip install rectify-rna
# With visualization support (metagene plots, genome figures)
pip install rectify-rna[visualize]
# Via Conda (includes MEME Suite for motif discovery)
conda install -c conda-forge -c bioconda rectify-rna
Basic Usage
# Correct 3' ends from FASTQ (bundled yeast genome — no external files needed)
rectify correct reads.fastq.gz --organism yeast -o corrected.tsv
# Full pipeline: alignment → correction → analysis
rectify run reads.bam --genome genome.fa --annotation genes.gtf --output-dir results/
# Process NET-seq data (nascent RNA 3' ends)
rectify netseq netseq.bam --genome genome.fa --gff genes.gff -o netseq_output/
How It Works
RECTIFY reconstructs true RNA 3' and 5' ends through four sequential corrections, each addressing a specific alignment artifact.
1. 3' End Walk-Back: Recovering the True CPA Site
When poly(A) tails align to genomic A-tracts, aligners introduce indels and spurious splice junctions (N operations) to maximize alignment score, shifting the apparent 3' end far downstream of the true cleavage site. RECTIFY walks backward from the soft-clip boundary, skipping A's, deletions, T sequencing errors, and any intron-skip (N) operations it encounters, until it finds the first non-A/T agreement between genome and read — the true CPA site.
Why simple poly(A) trimming fails: The boundary between genomic A's and tail A's is ambiguous in A-tract regions. RECTIFY's walk-back algorithm handles deletions, T sequencing errors, and false splice junctions within the A-tract, recovering the true CPA position even when the aligner has spread the poly(A) signal across multiple genomic A-runs or introduced spurious N operations to reach downstream A-tracts. For minus-strand genes, the poly(A) tail appears as a poly(T) prefix extending leftward — RECTIFY applies identical logic in reverse orientation.
False junction cleanup is built-in: Poly(A) tails can cause aligners to introduce skip (N) operations to reach downstream A-tracts, creating spurious splice junctions. The same walk-back that corrects indel artifacts transparently absorbs these N operations — they require no separate detection step.
2. 5' End Junction Rescue: Recovering Soft-Clipped Bases at Splice Sites
Nanopore reads that begin near a splice junction frequently have their 5'-most bases soft-clipped rather than placed in the upstream exon. RECTIFY identifies these soft-clipped sequences, locates the nearest annotated donor site, and extends the alignment through the intron to recover the true transcription start position.
3. Soft-Clip Rescue: Recovering 5' Bases at Homopolymer Boundaries
Nanopore basecallers systematically under-call homopolymer runs. At CPA sites with upstream T-rich regions, this causes the aligner to soft-clip non-T bases rather than place them in the correct exon. RECTIFY identifies soft-clipped sequences, skips remaining reference homopolymer bases, and matches them to downstream reference positions.
This correction is especially critical for detecting true 3' ends in regions where weak basecalling and homopolymer under-calling create false soft-clip boundaries.
4. Multi-Aligner Rectification: Selecting the Optimal Junction Set
Different aligners make different tradeoffs at splice junctions. RECTIFY runs three aligners in parallel (minimap2, mapPacBio, gapmm2), applies soft-clip rescue to all outputs, scores each alignment by canonical splice sites and annotation matches, and selects the optimal rectified alignment per read.
Scoring criteria: Each alignment is scored by (1) number of GT-AG canonical junctions, (2) matches to annotated junctions in the provided GFF/GTF, and (3) remaining soft-clip length. The highest-scoring alignment is written to the output BAM.
# Multi-aligner rectification (default, DRS-optimized)
rectify align reads.fastq.gz --genome genome.fa --annotation genes.gff -o aligned.bam
# Single-aligner mode (faster, less accurate)
rectify align reads.fastq.gz --genome genome.fa --aligner minimap2 -o aligned.bam
Key Features
| Feature | Benefit |
|---|---|
| Multi-Aligner Rectification | Runs minimap2, mapPacBio, gapmm2, scores each alignment, and selects the optimal rectified result per read |
| 5' End Junction Recovery | Rescues soft-clipped bases by extending alignments through known splice junctions |
| 3' End Walk-Back | Walks backward from soft-clip boundary to recover true CPA site, transparently absorbing indels, T sequencing errors, and spurious splice junctions (N ops) in a single pass |
| Junction Ambiguity Resolution | Resolves reads matching multiple junctions using proportional assignment |
| Poly(A) Measurement | Reports tail length including both aligned and soft-clipped bases |
| NET-seq Refinement | Uses nascent RNA 3' ends to deconvolve A-tract ambiguity (optional) |
| Adaptive Clustering | Groups nearby CPA sites using valley-based peak detection |
| Dual-Resolution Differential Expression | DESeq2 at both gene level and cluster (isoform) level |
| APA Shift Analysis | Detects significant proximal/distal CPA site usage changes |
| Visualization | Metagene plots and genome browser figures (pip install rectify-rna[visualize]) |
| Bundled Yeast Data | S288C genome, SGD annotations, GO terms, WT NET-seq, 64K pre-computed A-tract CPA sites |
Output and Results
Each read receives a corrected position with confidence scoring:
read_id │ chrom │ strand │ original │ corrected │ shift │ confidence │ polya_len │ qc_flags
read001 │ chrI │ + │ 147592 │ 147585 │ -7 │ HIGH │ 42 │ PASS
read002 │ chrI │ + │ 147594 │ 147591 │ -3 │ MEDIUM │ 38 │ PASS
read003 │ chrII │ + │ 283109 │ 283104 │ -5 │ LOW │ 31 │ AG_RICH
The rectify analyze command produces:
- clusters.tsv — CPA site clusters with read counts per condition
- deseq2_gene_results.tsv — Differential expression at gene level
- deseq2_cluster_results.tsv — Differential expression at cluster (isoform) level
- shift_results.tsv — Genes with statistically significant APA shifts
- go_enrichment.tsv — GO term enrichment on shifted genes
- motif_results/ — Enriched sequence motifs near CPA sites
NET-seq Refinement (Optional)
For organisms with nascent RNA (NET-seq) data, RECTIFY resolves remaining ambiguity within A-tracts. NET-seq samples RNA still attached to polymerase, providing a reference for true CPA positions. Since nascent RNA is oligo-adenylated post-capture, RECTIFY uses NNLS deconvolution with a point-spread function derived from 5000+ zero-A calibration sites to recover true CPA positions.
For S. cerevisiae, bundled WT NET-seq data is auto-detected. For other organisms or mutant conditions, provide NET-seq bigWigs with the --netseq-dir flag.
Commands Reference
| Command | Purpose |
|---|---|
rectify correct |
Correct 3' end positions (indel correction + A-tract resolution) |
rectify analyze |
Downstream analysis (clustering, DESeq2, GO enrichment, motifs) |
rectify export |
Export corrected positions to bigWig/bedGraph tracks |
rectify extract |
Extract per-read 5'/3' ends and junctions to TSV |
rectify aggregate |
Group reads into 3'/5'/junction dataset files |
rectify align |
Align FASTQ with multi-aligner rectification |
rectify netseq |
Process NET-seq BAM files (3' extraction + deconvolution) |
rectify run |
Full pipeline: align (if FASTQ) → correct → analyze |
rectify run-all |
Full pipeline with provenance tracking and step-skip |
Usage examples
# Correct 3' ends (bundled yeast genome, no external files needed)
rectify correct reads.fastq.gz --organism yeast -o corrected.tsv
# Correct with custom genome and optional NET-seq deconvolution
rectify correct reads.bam --genome genome.fa --netseq-dir my_netseq/ -o corrected.tsv
# Extract per-read features (5'/3' ends, junctions) to TSV
rectify extract reads.bam -o reads.tsv --genome genome.fa --annotation genes.gff
# Aggregate into separate 3'/5'/junction datasets by condition
rectify aggregate reads.bam -o aggregated/ --annotation genes.gff --mode all
# Differential expression analysis (gene and cluster level)
rectify analyze corrected.tsv --annotation genes.gtf --output-dir results/
# Export corrected positions as genome browser tracks
rectify export corrected.tsv -o tracks/ --genome genome.fa
# Complete pipeline from reads to differential expression
rectify run reads.bam --genome genome.fa --annotation genes.gtf --output-dir results/
# Process NET-seq data (nascent RNA 3' ends for A-tract refinement)
rectify netseq netseq.bam --genome genome.fa --gff genes.gff -o netseq_output/
Supported Technologies
Direct RNA sequencing: Nanopore direct RNA-seq (DRS) Short-read quantification: QuantSeq (oligo-dT), PacBio Iso-Seq, NET-seq General: Any poly(A)-tailed RNA-seq platform
Citation
Please cite RECTIFY if you use it in your research:
Roy KR, Chanfreau GF. Robust mapping of polyadenylated and non-polyadenylated RNA 3' ends at nucleotide resolution by 3'-end sequencing. Methods. 2020;176:4-13. PMID: 31128237
RECTIFY 2.0: Manuscript in preparation.
License
MIT — see LICENSE for details.
Contact
Kevin R. Roy Email: kevinrjroy@gmail.com GitHub: k-roy/RECTIFY
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rectify_rna-2.7.8.tar.gz.
File metadata
- Download URL: rectify_rna-2.7.8.tar.gz
- Upload date:
- Size: 94.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7f03565338b1c9a9c78113f6c09d109b83f003929292c02848aaaac6b8a8df0
|
|
| MD5 |
e44c9d42be3675a83acef6b29c1dc4c4
|
|
| BLAKE2b-256 |
178fb150e927a5a67250232d93d390d951f069c3e96d81e75b60a5aa58b8325d
|
File details
Details for the file rectify_rna-2.7.8-py3-none-any.whl.
File metadata
- Download URL: rectify_rna-2.7.8-py3-none-any.whl
- Upload date:
- Size: 94.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
caeadbabc2bf528ea95680d8b55a9b98a00dbe705b238820eb123f126e799f22
|
|
| MD5 |
9e7835f7d7b36565af07e39462c9bc9d
|
|
| BLAKE2b-256 |
a316a84cce4da314bd8360dfee97c0127f68807fa7328097e2119f098d7179c2
|