Unified RNA 3' End Correction Framework for poly(A)-tailed RNA sequencing
Project description
RECTIFY: Unified RNA 3' End Correction Framework
RECTIFY (RNA 3' End Correction Tool Integrating False-priming and poly(A) ambiguity) is a unified framework for correcting 3' end mapping artifacts in poly(A)-tailed RNA sequencing data.
Overview
RECTIFY addresses two fundamental problems affecting RNA 3' end mapping:
- A-tract Ambiguity (Universal): Genomic A-tracts near true 3' ends create positional uncertainty affecting ALL poly(A)-tailed RNA-seq technologies
- Technology-Specific Artifacts:
- AG mispriming: Internal priming on A/G-rich regions (oligo-dT methods)
- Poly(A) tail alignment: Tail bases align to genomic A-tracts creating systematic shifts (when poly(A) is sequenced)
Key Features
- Modular correction strategies that apply based on sequencing technology
- Universal A-tract ambiguity detection for all poly(A)-tailed RNA-seq
- AG mispriming screening (from original RECTIFY, Roy & Chanfreau 2019)
- Poly(A) tail trimming and indel artifact correction (for direct RNA-seq: nanopore, Helicos, QuantSeq)
- NET-seq refinement (optional, technology-independent)
- Unified output format with confidence scores and QC flags
How It Works
RECTIFY corrects common 3' end mapping artifacts through a series of modular steps:
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXAMPLE 1: Homopolymer Deletion Artifact (Nanopore) │
│ ═══════════════════════════════════════════════════ │
│ │
│ True RNA: 5'...GCTAAGCTTAAAAAA-3' + AAAAAAAAAA (poly(A) tail) │
│ └────┘ │
│ 6A genomic tract │
│ │
│ Genome: ...GCTAAGCTTAAAAAA|GTCACC... (| = true CPA site) │
│ │
│ Nanopore read: ...GCTAAGCTT--AAAA|GTCACC (2bp deletion in A-tract)│
│ ↑↑ │
│ systematic homopolymer error │
│ │
│ Problem: Aligner maps 3' end 2bp upstream of true position │
│ (deletion consumes genomic bases that should be in transcript) │
│ │
│ RECTIFY: Detects A-tract deletion, adjusts position +2bp │
│ Result: Correct 3' end position restored │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXAMPLE 2: Multiple Indels Near 3' End │
│ ═══════════════════════════════════════ │
│ │
│ Genome: ...TACGTTTTTTAAAAAA|GTCA... │
│ └────┘└────┘ │
│ T-tract A-tract │
│ │
│ Nanopore read: ...TACGT---TTAA-AAA|GTCA │
│ ↑↑↑ ↑ │
│ 3bp del 1bp del │
│ │
│ RECTIFY logic: │
│ • T-tract deletion (3bp): TRUE artifact → correct +3bp │
│ • A-tract deletion (1bp): TRUE artifact → correct +1bp │
│ • Total correction: +4bp │
│ │
│ Note: Insertions do NOT shift reference coordinates (no correction needed) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXAMPLE 3: A-tract Ambiguity Window │
│ ═══════════════════════════════════ │
│ │
│ Genome: ...CGTACAAAAAAAA|GTCACC... │
│ └───────┘ │
│ 8bp A-tract │
│ │
│ Problem: Any position within the A-tract could be the true 3' end │
│ (indistinguishable from poly(A) tail) │
│ │
│ ...CGTACAAAAAAAA| ← could be here │
│ ...CGTACAAAAAAA|A ← or here │
│ ...CGTACAAAAAA|AA ← or here │
│ ...CGTACAAAAA|AAA ← etc. │
│ │
│ RECTIFY: Reports ambiguity window [pos-7, pos] with range=8 │
│ Confidence score reflects uncertainty │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXAMPLE 4: NET-seq Refinement │
│ ═════════════════════════════ │
│ │
│ Genome: ...CGTACAAAAAAAA|GTCACC... │
│ └───────┘ │
│ ambiguity window │
│ │
│ NET-seq: ▁▂▃█▇▅▂▁ │
│ signal: ↑ │
│ peak at -3 │
│ │
│ RECTIFY: Uses NET-seq Pol II occupancy to identify most likely │
│ termination site within the ambiguity window │
│ │
│ Result: Position refined to NET-seq peak, confidence = HIGH │
└─────────────────────────────────────────────────────────────────────────────┘
Pipeline Flow:
══════════════
┌──────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Input │ │ Module 1 │ │ Module 2A/B │ │ Module 3 │
│ BAM │───▶│ A-tract │───▶│ Poly(A) & │───▶│ NET-seq │
│ │ │ Ambiguity │ │ Indels │ │ Refinement │
└──────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
┌───────────────────────────────────────────────┐
│ Corrected 3' Ends │
│ (position, ambiguity range, confidence) │
└───────────────────────────────────────────────┘
Installation
From source (development)
git clone https://github.com/k-roy/RECTIFY.git
cd RECTIFY
pip install -e .
From PyPI (future release)
pip install rectify
Quick Start
QuantSeq (oligo-dT short-read)
rectify correct quantseq.bam \
--genome sacCer3.fa \
--annotation genes.gtf \
--polya-sequenced \
--output corrected_3ends.tsv
Nanopore direct RNA-seq with NET-seq refinement
rectify correct nanopore.bam \
--genome sacCer3.fa \
--annotation genes.gtf \
--polya-sequenced \
--aligner minimap2 \
--netseq-dir churchman_bigwigs/ \
--output corrected_3ends.tsv
Output Format
RECTIFY produces a TSV file with corrected 3' end positions and QC metrics:
read_id chrom strand raw_position corrected_position ambiguity_min ambiguity_max ambiguity_range correction_type confidence qc_flags
read001 chrI + 147588 147585 147583 147588 5 polya_trim high PASS
read002 chrI + 147593 147591 147591 147593 2 ag_mispriming medium AG_RICH
Module Architecture
RECTIFY applies corrections modularly based on your data:
-
Module 1: A-tract Ambiguity (always applied)
- Identifies genomic A-tracts near 3' ends
- Calculates ambiguity windows
-
Module 2A: AG Mispriming (when oligo-dT priming used)
- Screens for downstream AG-richness
- Flags likely misprimed reads
-
Module 2B+2C: Poly(A) Corrections (when poly(A) IS sequenced)
- Models and trims poly(A) tails
- Detects and removes indel artifacts
-
Module 3: NET-seq Refinement (optional)
- Resolves ambiguity using NET-seq data
- Assigns confidence scores
Citation
If you use RECTIFY, please cite:
Original RECTIFY (AG mispriming correction):
Roy KR, Chanfreau GF. Robust mapping of polyadenylated and non-polyadenylated RNA 3' ends at nucleotide resolution by 3'-end sequencing. Methods. 2020 Apr 1;176:4-13. doi: 10.1016/j.ymeth.2019.05.016. PMID: 31128237
RECTIFY 2.0 (unified framework):
Manuscript in preparation
License
MIT License - See LICENSE for details
Contact
- Kevin R. Roy - kevinrjroy@gmail.com
- GitHub: k-roy/RECTIFY
Acknowledgments
- Original RECTIFY development supported by Chanfreau Lab, UCLA
- NET-seq data from Churchman Lab, Harvard Medical School
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rectify_rna-2.1.0.tar.gz.
File metadata
- Download URL: rectify_rna-2.1.0.tar.gz
- Upload date:
- Size: 81.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf9407ab66bfa867fbe519a1c9a565d79f146bbae053f4c12e61dd207f2eb030
|
|
| MD5 |
a9c883f869a5937bcd9fed8423581103
|
|
| BLAKE2b-256 |
7de2c417971f43a3181bd884bf97fa1b692f4149ba0e233d617ccf20b1aee5aa
|
File details
Details for the file rectify_rna-2.1.0-py3-none-any.whl.
File metadata
- Download URL: rectify_rna-2.1.0-py3-none-any.whl
- Upload date:
- Size: 75.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0cb58104b82884809aeef898c8764c29304fca8bd49ae459388057878699125
|
|
| MD5 |
68e6322a897cc5c3331dc34a44b98c76
|
|
| BLAKE2b-256 |
2777c7b101f54d3ccbb219b8a764aa68d4b2931a0b92278257ae40d06bba46c2
|