Unified RNA 3' End Correction Framework for poly(A)-tailed RNA sequencing
Project description
RECTIFY: Unified RNA 3' End Correction Framework
RECTIFY (RNA 3' End Correction Tool Integrating False-priming and poly(A) ambiguity) is a unified framework for correcting 3' end mapping artifacts in poly(A)-tailed RNA sequencing data.
Overview
RECTIFY addresses two fundamental problems affecting RNA 3' end mapping:
- A-tract Ambiguity (Universal): Genomic A-tracts near true 3' ends create positional uncertainty affecting ALL poly(A)-tailed RNA-seq technologies
- Technology-Specific Artifacts:
- AG mispriming: Internal priming on A/G-rich regions (oligo-dT methods)
- Poly(A) tail alignment: Tail bases align to genomic A-tracts creating systematic shifts (when poly(A) is sequenced)
Key Features
- Modular correction strategies that apply based on sequencing technology
- Universal A-tract ambiguity detection for all poly(A)-tailed RNA-seq
- AG mispriming screening (from original RECTIFY, Roy & Chanfreau 2019)
- Poly(A) tail trimming and indel artifact correction (for direct RNA-seq: nanopore, Helicos, QuantSeq)
- NET-seq refinement (optional, technology-independent)
- Unified output format with confidence scores and QC flags
How It Works
RECTIFY corrects 3' end mapping artifacts by walking a read through multiple correction steps. Here's a single Nanopore read traversing the full pipeline:
===============================================================================
STEP 1: RAW ALIGNMENT
===============================================================================
The aligner maps a Nanopore direct RNA read to the genome. The poly(A) tail
is soft-clipped where the genomic A-tract ends, but the aligned region of the tail
contains additional errors due to alignment and sequencing errors.
genome: 5'..CTAGTGACAGTCAAAAAAAA-AAACAAAAGTAAAAAAAAAAAA|CTAGCGATC..3'
|
CPA site (true 3' end)
read: 5'..CTAGTGACAGTCAAAAAAAATAAA-AAAAA--AAAAAAAAAAAAAAAAAAAAA..
^ ^^ |___________________|
T error dels soft-clipped tail
Key observations:
- Soft-clip boundary placed at mapped 3' end
- Deletions (-) = aligner attempts to maximize alignment of the poly(A) tail
at expense of indels
- 'T' in the poly(A) tail = Nanopore sequencing error in the homopolymer
===============================================================================
STEP 2: INDEL CORRECTION - Walk Backwards to Find True 3' End
===============================================================================
Starting from soft-clip boundary, walk upstream comparing genome vs read
to find the first position where both agree on a non-A base:
genome: ..CTAGTGACAGTCAAAAAAAA-AAACAAAAGTAAAAAAAAAAAA|..
read: ..CTAGTGACAGTCAAAAAAAATAAA-AAAAA--AAAAAAAAAA.|..
^ ^ ^^
AGREE T error dels
(C=C) (ignore) (count)
RECTIFY walks back from soft-clip boundary:
- Skip all A's (ambiguous with poly(A) tail)
- Ignore deletions where mRNA has the poly(A) tail: 3 bp total
- Ignore T (likely sequencing error in homopolymer)
- Find first non-A agreement: 'C' at upstream position
- Result: ambiguity window spanning the A-tract from upstream C to downstream C
===============================================================================
STEP 3: NET-seq REFINEMENT (Optional)
===============================================================================
For species with NET-seq data, we can resolve the ambiguity window.
NET-seq captures nascent RNA bound to RNA Pol II. Cleavage intermediates
are oligo-adenylated (mean ~6.6 A's added) before release:
CPA site
|
v
___________ ✂️
| Pol II |====~~~CGUACGUAG*AAAAAAA <- oligo-A tail (~6 A's)
|___________| 3' added after cleavage
When this oligo-adenylated intermediate is sequenced and aligned to a
genome with downstream A's, the oligo-A tail extends the alignment:
true CPA
|
genome: ...CGTACGTAG|AAAAAAAA|GTCACC...
|^^^^^^^^|
genomic A-tract
|
aligned: ...CGTACGTAG*AAAAAAA <- oligo-A aligns to genomic A's
|<----->|
SHIFT (apparent 3' end moves downstream)
This creates a spreading artifact: signal is shifted downstream.
RECTIFY uses NNLS (Non-Negative Least Squares) deconvolution to remove
the spreading artifact and recover true peak positions:
1. Build convolution matrix from 0A PSF (Point-Spread Function)
- ~54% of signal stays at true position
- ~46% spreads downstream (mean ~3bp)
2. Solve: observed = A @ true_peaks (regularized NNLS)
3. Assign reads PROPORTIONALLY to deconvolved peaks:
- If 2 peaks with 70%/30% signal → assign 0.7 and 0.3 reads
Confidence assignment:
- HIGH: Single dominant peak (>90% of signal)
- MEDIUM: Dominant peak (>70% of signal)
- SPLIT: Multiple significant peaks (no dominant)
- LOW: Weak signal or no NET-seq data
===============================================================================
STEP 4: FINAL OUTPUT (with proportional apportionment)
===============================================================================
When NET-seq reveals multiple peaks within an ambiguity window, reads are
APPORTIONED PROPORTIONALLY to each peak position. This preserves quantitative
accuracy for downstream analysis.
Example 1: Single dominant peak (HIGH confidence)
-------------------------------------------------
Ambiguity window [31, 42], NET-seq shows single peak at position 35:
31 35 42
| | |
Ambiguity: |===============================================|
NET-seq peak: #
Output: read001 → position 35, weight 1.0, confidence HIGH
Example 2: Multiple peaks (SPLIT confidence)
--------------------------------------------
A SINGLE Nanopore read with ambiguity window [31, 42] can be refined using
NET-seq to reveal THREE distinct CPA sites at positions 33, 36, and 40:
31 33 36 40 42
| | | | |
Ambiguity: |==============================================|
NET-seq peaks: ### ## #
50% 30% 20%
Output: read001 → split into THREE output rows (one read becomes three):
- read001 at position 33, weight 0.5
- read001 at position 36, weight 0.3
- read001 at position 40, weight 0.2
This preserves quantitative accuracy: if 100 reads map to this ambiguous
region, the output will have ~50 reads at pos 33, ~30 at pos 36, ~20 at pos 40.
Final output format:
+----------+----------+--------+------------+-----------------+
| read_id | position | weight | confidence | ambiguity_range |
+----------+----------+--------+------------+-----------------+
| read001 | 33 | 0.5 | SPLIT | 11 |
| read001 | 36 | 0.3 | SPLIT | 11 |
| read001 | 40 | 0.2 | SPLIT | 11 |
+----------+----------+--------+------------+-----------------+
Without NET-seq: Reports ambiguity window [31, 42] and uses leftmost
position (most conservative estimate), weight 1.0.
Installation
From source (development)
git clone https://github.com/k-roy/RECTIFY.git
cd RECTIFY
pip install -e .
From PyPI
pip install rectify-rna
Quick Start
QuantSeq (oligo-dT short-read)
rectify correct quantseq.bam \
--genome sacCer3.fa \
--annotation genes.gtf \
--polya-sequenced \
--output corrected_3ends.tsv
Nanopore direct RNA-seq with NET-seq refinement
rectify correct nanopore.bam \
--genome sacCer3.fa \
--annotation genes.gtf \
--polya-sequenced \
--aligner minimap2 \
--netseq-dir churchman_bigwigs/ \
--output corrected_3ends.tsv
Output Format
RECTIFY produces two output files:
1. Corrected Positions TSV (output.tsv)
Per-read corrected 3' end positions with QC metrics:
read_id chrom strand original_3prime corrected_3prime ambiguity_min ambiguity_max ambiguity_range polya_length correction_applied confidence qc_flags
read001 chrI + 147588 147585 147583 147588 5 42 atract_ambiguity high PASS
read002 chrI + 147593 147591 147591 147593 2 38 atract_ambiguity medium AG_RICH
Key columns:
polya_length: Total observed poly(A) tail length = aligned A's (from A-tract) + soft-clipped A'sambiguity_range: Positional uncertainty due to poly(A) tail aligning to genomic A-tractcorrection_applied:atract_ambiguity(position correction),indel_correction,netseq_refinement
2. Processing Statistics TSV (output_stats.tsv)
Comprehensive read flow statistics through each filtering and correction stage:
metric count percent description
total_reads_in_bam 1000000 100.00 Total reads in BAM file
reads_unmapped 5000 0.50 Unmapped reads (skipped)
reads_secondary 2000 0.20 Secondary alignments (skipped)
reads_processed 993000 99.30 Reads with 3' ends corrected
ends_with_downstream_A 450000 45.32 3' ends with >=1 A immediately downstream
ends_ambiguous_atract 180000 18.13 3' ends in A-tract (ambiguity range > 0)
ends_shifted_atract_walking 85000 8.56 3' ends shifted by A-tract walking
reads_with_polya 890000 89.63 Reads with detected poly(A) tail
polya_length_mean 42.3 - Mean poly(A) tail length (bp)
polya_length_max 185 - Maximum poly(A) tail length (bp)
confidence_high 750000 75.53 High confidence assignments
confidence_medium 200000 20.14 Medium confidence assignments
confidence_low 43000 4.33 Low confidence assignments
Module Architecture
RECTIFY applies corrections modularly based on your data:
-
Module 1: A-tract Ambiguity (always applied)
- Identifies genomic A-tracts near 3' ends
- Calculates ambiguity windows
-
Module 2A: AG Mispriming (when oligo-dT priming used)
- Screens for downstream AG-richness
- Flags likely misprimed reads
-
Module 2B: Poly(A) Detection (when poly(A) IS sequenced)
- Measures observed poly(A) tail length (aligned A's + soft-clipped A's)
- Reports tail length for QC (does NOT correct position - that's handled by A-tract detection)
-
Module 2C: Indel Correction (when poly(A) IS sequenced)
- Detects and corrects indel artifacts in aligned region
-
Module 3: NET-seq Refinement (optional)
- Resolves ambiguity using NET-seq data via NNLS deconvolution
- Apportions reads PROPORTIONALLY to multiple peaks when ambiguity exists
- Assigns confidence scores (HIGH/MEDIUM/SPLIT/LOW)
Downstream Analysis
RECTIFY includes a comprehensive analyze command for downstream analysis of corrected 3' ends:
rectify analyze corrected_3ends.tsv \
--annotation genes.gtf \
--output-dir analysis_results/
Analysis Modules
| Module | Description | Output |
|---|---|---|
| Clustering | Group nearby CPA sites into clusters | clusters.tsv, count matrices |
| Differential Expression | DESeq2-based gene and cluster-level analysis | deseq2_results.tsv |
| PCA | Sample quality control and batch effect detection | pca_plot.png |
| Shift Analysis | Detect condition-specific CPA site usage changes | shift_results.tsv, browser plots |
| GO Enrichment | Functional enrichment of genes with shifted 3' ends | go_enrichment.tsv |
| Motif Discovery | Identify sequence motifs near CPA sites | motif_results/ |
Browser-Style Visualization
The shift analysis module generates publication-ready genome browser plots showing:
- Per-condition CPA site usage distributions
- Gene track with CDS boxes as directional pentagon arrows
- CPA cluster markers
Citation
If you use RECTIFY, please cite:
Original RECTIFY (AG mispriming correction):
Roy KR, Chanfreau GF. Robust mapping of polyadenylated and non-polyadenylated RNA 3' ends at nucleotide resolution by 3'-end sequencing. Methods. 2020 Apr 1;176:4-13. doi: 10.1016/j.ymeth.2019.05.016. PMID: 31128237
RECTIFY 2.0 (unified framework):
Manuscript in preparation
License
MIT License - See LICENSE for details
Contact
- Kevin R. Roy - kevinrjroy@gmail.com
- GitHub: k-roy/RECTIFY
Acknowledgments
- Original RECTIFY development supported by Chanfreau Lab, UCLA
- NET-seq data from Churchman Lab, Harvard Medical School
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rectify_rna-2.1.3.tar.gz.
File metadata
- Download URL: rectify_rna-2.1.3.tar.gz
- Upload date:
- Size: 105.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b7248cf64e8a2345f27e5f5ccb116ee848557b3d96336d574a7cecbc322d5cd
|
|
| MD5 |
91c0a2083d1aa57ca79939354d948f15
|
|
| BLAKE2b-256 |
191c59b317f3b888fa94231ff318ddcb5c7e9bda329569a8b5646ec42ebd47a6
|
File details
Details for the file rectify_rna-2.1.3-py3-none-any.whl.
File metadata
- Download URL: rectify_rna-2.1.3-py3-none-any.whl
- Upload date:
- Size: 91.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82763e5fa595aa7e286ba37c4f2dd037797c46e84f806739f193fff4f7b66479
|
|
| MD5 |
fbbd9fcc83807d42fec64fff802f0779
|
|
| BLAKE2b-256 |
c28ba4f5b37c1fa66da5dcae0d87c28a6c3d44e90fce44d30c4ac54e775bf2ae
|