Unified RNA 3' End Correction Framework for poly(A)-tailed RNA sequencing

These details have not been verified by PyPI

Project links

Project description

RECTIFY: Unified RNA 3' End Correction Framework

RECTIFY (RNA 3' End Correction Tool Integrating False-priming and poly(A) ambiguity) is a unified framework for correcting 3' end mapping artifacts in poly(A)-tailed RNA sequencing data.

Overview

RECTIFY addresses two fundamental problems affecting RNA 3' end mapping:

A-tract Ambiguity (Universal): Genomic A-tracts near true 3' ends create positional uncertainty affecting ALL poly(A)-tailed RNA-seq technologies
Technology-Specific Artifacts:
- AG mispriming: Internal priming on A/G-rich regions (oligo-dT methods)
- Poly(A) tail alignment: Tail bases align to genomic A-tracts creating systematic shifts (when poly(A) is sequenced)

Key Features

Modular correction strategies that apply based on sequencing technology
Universal A-tract ambiguity detection for all poly(A)-tailed RNA-seq
AG mispriming screening (from original RECTIFY, Roy & Chanfreau 2019)
Poly(A) tail trimming and indel artifact correction (for direct RNA-seq: nanopore, Helicos, QuantSeq)
NET-seq refinement (optional, technology-independent)
Unified output format with confidence scores and QC flags

How It Works

RECTIFY corrects 3' end mapping artifacts by walking a read through multiple correction steps. Here's a single Nanopore read traversing the full pipeline:

===============================================================================
STEP 1: RAW ALIGNMENT
===============================================================================

The aligner maps a Nanopore direct RNA read to the genome. The poly(A) tail
is soft-clipped where the genomic A-tract ends, but the aligned region of the tail
contains additional errors due to alignment and sequencing errors.

genome: 5'..CTAGTGACAGTCAAAAAAAA-AAACAAAAGTAAAAAAAAAAAA|CTAGCGATC..3'
                                                       |
                                              CPA site (true 3' end)

read:   5'..CTAGTGACAGTCAAAAAAAATAAA-AAAAA--AAAAAAAAAAAAAAAAAAAAA..
                                 ^      ^^  |___________________|
                              T error  dels   soft-clipped tail

Key observations:
  - Soft-clip boundary placed at mapped 3' end
  - Deletions (-) = aligner attempts to maximize alignment of the poly(A) tail
    at expense of indels
  - 'T' in the poly(A) tail = Nanopore sequencing error in the homopolymer

===============================================================================
STEP 2: INDEL CORRECTION - Walk Backwards to Find True 3' End
===============================================================================

Starting from soft-clip boundary, walk upstream comparing genome vs read
to find the first position where both agree on a non-A base:

genome: ..CTAGTGACAGTCAAAAAAAA-AAACAAAAGTAAAAAAAAAAAA|..
read:   ..CTAGTGACAGTCAAAAAAAATAAA-AAAAA--AAAAAAAAAA.|..
                    ^          ^      ^^
                 AGREE      T error  dels
                 (C=C)     (ignore) (count)

RECTIFY walks back from soft-clip boundary:
  - Skip all A's (ambiguous with poly(A) tail)
  - Ignore deletions where mRNA has the poly(A) tail: 3 bp total
  - Ignore T (likely sequencing error in homopolymer)
  - Find first non-A agreement: 'C' at upstream position
  - Result: ambiguity window spanning the A-tract from upstream C to downstream C

===============================================================================
STEP 3: NET-seq REFINEMENT (Optional)
===============================================================================

For species with NET-seq data, we can resolve the ambiguity window.

NET-seq captures nascent RNA bound to RNA Pol II. Cleavage intermediates
are oligo-adenylated (mean ~6.6 A's added) before release:

                              CPA site
                                 |
                                 v
    ___________                 ✂️
   |  Pol II   |====~~~CGUACGUAG*AAAAAAA     <- oligo-A tail (~6 A's)
   |___________|               3'              added after cleavage

When this oligo-adenylated intermediate is sequenced and aligned to a
genome with downstream A's, the oligo-A tail extends the alignment:

                         true CPA
                            |
genome:       ...CGTACGTAG|AAAAAAAA|GTCACC...
                          |^^^^^^^^|
                          genomic A-tract
                                   |
aligned:      ...CGTACGTAG*AAAAAAA      <- oligo-A aligns to genomic A's
                          |<----->|
                           SHIFT (apparent 3' end moves downstream)

This creates a spreading artifact: signal is shifted downstream.
RECTIFY uses NNLS (Non-Negative Least Squares) deconvolution to remove
the spreading artifact and recover true peak positions:

1. Build convolution matrix from 0A PSF (Point-Spread Function)
   - ~54% of signal stays at true position
   - ~46% spreads downstream (mean ~3bp)

2. Solve: observed = A @ true_peaks (regularized NNLS)

3. Assign reads PROPORTIONALLY to deconvolved peaks:
   - If 2 peaks with 70%/30% signal → assign 0.7 and 0.3 reads

Confidence assignment:
  - HIGH:   Single dominant peak (>90% of signal)
  - MEDIUM: Dominant peak (>70% of signal)
  - SPLIT:  Multiple significant peaks (no dominant)
  - LOW:    Weak signal or no NET-seq data

===============================================================================
STEP 4: FINAL OUTPUT (with proportional apportionment)
===============================================================================

When NET-seq reveals multiple peaks within an ambiguity window, reads are
APPORTIONED PROPORTIONALLY to each peak position. This preserves quantitative
accuracy for downstream analysis.

Example 1: Single dominant peak (HIGH confidence)
-------------------------------------------------
Ambiguity window [31, 42], NET-seq shows single peak at position 35:

                 31              35                          42
                 |               |                           |
Ambiguity:       |===============================================|
NET-seq peak:                    #

Output: read001 → position 35, weight 1.0, confidence HIGH

Example 2: Multiple peaks (SPLIT confidence)
--------------------------------------------
A SINGLE Nanopore read with ambiguity window [31, 42] can be refined using
NET-seq to reveal THREE distinct CPA sites at positions 33, 36, and 40:

                 31   33   36        40                       42
                 |    |    |         |                        |
Ambiguity:       |==============================================|
NET-seq peaks:        ###  ##        #
                      50%  30%       20%

Output: read001 → split into THREE output rows (one read becomes three):
  - read001 at position 33, weight 0.5
  - read001 at position 36, weight 0.3
  - read001 at position 40, weight 0.2

This preserves quantitative accuracy: if 100 reads map to this ambiguous
region, the output will have ~50 reads at pos 33, ~30 at pos 36, ~20 at pos 40.

Final output format:
+----------+----------+--------+------------+-----------------+
| read_id  | position | weight | confidence | ambiguity_range |
+----------+----------+--------+------------+-----------------+
| read001  |    33    |  0.5   |   SPLIT    |       11        |
| read001  |    36    |  0.3   |   SPLIT    |       11        |
| read001  |    40    |  0.2   |   SPLIT    |       11        |
+----------+----------+--------+------------+-----------------+

Without NET-seq: Reports ambiguity window [31, 42] and uses leftmost
position (most conservative estimate), weight 1.0.

Installation

From source (development)

git clone https://github.com/k-roy/RECTIFY.git
cd RECTIFY
pip install -e .

From PyPI

pip install rectify-rna

Quick Start

QuantSeq (oligo-dT short-read)

rectify correct quantseq.bam \
  --genome sacCer3.fa \
  --annotation genes.gtf \
  --polya-sequenced \
  --output corrected_3ends.tsv

Nanopore direct RNA-seq with NET-seq refinement

rectify correct nanopore.bam \
  --genome sacCer3.fa \
  --annotation genes.gtf \
  --polya-sequenced \
  --aligner minimap2 \
  --netseq-dir churchman_bigwigs/ \
  --output corrected_3ends.tsv

Output Format

RECTIFY produces two output files:

1. Corrected Positions TSV (`output.tsv`)

Per-read corrected 3' end positions with QC metrics:

read_id  chrom  strand  original_3prime  corrected_3prime  ambiguity_min  ambiguity_max  ambiguity_range  polya_length  correction_applied  confidence  qc_flags
read001  chrI   +       147588           147585            147583         147588         5                42            atract_ambiguity    high        PASS
read002  chrI   +       147593           147591            147591         147593         2                38            atract_ambiguity    medium      AG_RICH

Key columns:

polya_length: Total observed poly(A) tail length = aligned A's (from A-tract) + soft-clipped A's
ambiguity_range: Positional uncertainty due to poly(A) tail aligning to genomic A-tract
correction_applied: atract_ambiguity (position correction), indel_correction, netseq_refinement

2. Processing Statistics TSV (`output_stats.tsv`)

Comprehensive read flow statistics through each filtering and correction stage:

metric                       count      percent   description
total_reads_in_bam           1000000    100.00    Total reads in BAM file
reads_unmapped               5000       0.50      Unmapped reads (skipped)
reads_secondary              2000       0.20      Secondary alignments (skipped)
reads_processed              993000     99.30     Reads with 3' ends corrected
ends_with_downstream_A       450000     45.32     3' ends with >=1 A immediately downstream
ends_ambiguous_atract        180000     18.13     3' ends in A-tract (ambiguity range > 0)
ends_shifted_atract_walking  85000      8.56      3' ends shifted by A-tract walking
reads_with_polya             890000     89.63     Reads with detected poly(A) tail
polya_length_mean            42.3       -         Mean poly(A) tail length (bp)
polya_length_max             185        -         Maximum poly(A) tail length (bp)
confidence_high              750000     75.53     High confidence assignments
confidence_medium            200000     20.14     Medium confidence assignments
confidence_low               43000      4.33      Low confidence assignments

Module Architecture

RECTIFY applies corrections modularly based on your data:

Module 1: A-tract Ambiguity (always applied)
- Identifies genomic A-tracts near 3' ends
- Calculates ambiguity windows
Module 2A: AG Mispriming (when oligo-dT priming used)
- Screens for downstream AG-richness
- Flags likely misprimed reads
Module 2B: Poly(A) Detection (when poly(A) IS sequenced)
- Measures observed poly(A) tail length (aligned A's + soft-clipped A's)
- Reports tail length for QC (does NOT correct position - that's handled by A-tract detection)
Module 2C: Indel Correction (when poly(A) IS sequenced)
- Detects and corrects indel artifacts in aligned region
Module 3: NET-seq Refinement (optional)
- Resolves ambiguity using NET-seq data via NNLS deconvolution
- Apportions reads PROPORTIONALLY to multiple peaks when ambiguity exists
- Assigns confidence scores (HIGH/MEDIUM/SPLIT/LOW)

Downstream Analysis

RECTIFY includes a comprehensive analyze command for downstream analysis of corrected 3' ends:

rectify analyze corrected_3ends.tsv \
  --annotation genes.gtf \
  --output-dir analysis_results/

Analysis Modules

Module	Description	Output
Clustering	Group nearby CPA sites into clusters	`clusters.tsv`, count matrices
Differential Expression	DESeq2-based gene and cluster-level analysis	`deseq2_results.tsv`
PCA	Sample quality control and batch effect detection	`pca_plot.png`
Shift Analysis	Detect condition-specific CPA site usage changes	`shift_results.tsv`, browser plots
GO Enrichment	Functional enrichment of genes with shifted 3' ends	`go_enrichment.tsv`
Motif Discovery	Identify sequence motifs near CPA sites	`motif_results/`

Browser-Style Visualization

The shift analysis module generates publication-ready genome browser plots showing:

Per-condition CPA site usage distributions
Gene track with CDS boxes as directional pentagon arrows
CPA cluster markers

Citation

If you use RECTIFY, please cite:

Original RECTIFY (AG mispriming correction):

Roy KR, Chanfreau GF. Robust mapping of polyadenylated and non-polyadenylated RNA 3' ends at nucleotide resolution by 3'-end sequencing. Methods. 2020 Apr 1;176:4-13. doi: 10.1016/j.ymeth.2019.05.016. PMID: 31128237

RECTIFY 2.0 (unified framework):

Manuscript in preparation

License

MIT License - See LICENSE for details

Contact

Kevin R. Roy - kevinrjroy@gmail.com
GitHub: k-roy/RECTIFY

Acknowledgments

Original RECTIFY development supported by Chanfreau Lab, UCLA
NET-seq data from Churchman Lab, Harvard Medical School

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.7.8

Apr 9, 2026

2.7.7

Apr 5, 2026

2.3.0

Mar 19, 2026

2.2.0

Mar 19, 2026

This version

2.1.3

Mar 18, 2026

2.1.2

Mar 18, 2026

2.1.1

Mar 17, 2026

2.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rectify_rna-2.1.3.tar.gz (105.3 kB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rectify_rna-2.1.3-py3-none-any.whl (91.9 kB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file rectify_rna-2.1.3.tar.gz.

File metadata

Download URL: rectify_rna-2.1.3.tar.gz
Upload date: Mar 18, 2026
Size: 105.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.1.3.tar.gz
Algorithm	Hash digest
SHA256	`5b7248cf64e8a2345f27e5f5ccb116ee848557b3d96336d574a7cecbc322d5cd`
MD5	`91c0a2083d1aa57ca79939354d948f15`
BLAKE2b-256	`191c59b317f3b888fa94231ff318ddcb5c7e9bda329569a8b5646ec42ebd47a6`

See more details on using hashes here.

File details

Details for the file rectify_rna-2.1.3-py3-none-any.whl.

File metadata

Download URL: rectify_rna-2.1.3-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 91.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`82763e5fa595aa7e286ba37c4f2dd037797c46e84f806739f193fff4f7b66479`
MD5	`fbbd9fcc83807d42fec64fff802f0779`
BLAKE2b-256	`c28ba4f5b37c1fa66da5dcae0d87c28a6c3d44e90fce44d30c4ac54e775bf2ae`

See more details on using hashes here.

rectify-rna 2.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RECTIFY: Unified RNA 3' End Correction Framework

Overview

Key Features

How It Works

Installation

From source (development)

From PyPI

Quick Start

QuantSeq (oligo-dT short-read)

Nanopore direct RNA-seq with NET-seq refinement

Output Format

1. Corrected Positions TSV (output.tsv)

2. Processing Statistics TSV (output_stats.tsv)

Module Architecture

Downstream Analysis

Analysis Modules

Browser-Style Visualization

Citation

License

Contact

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Corrected Positions TSV (`output.tsv`)

2. Processing Statistics TSV (`output_stats.tsv`)