RNA 5' and 3' End Correction Tool with Intron Refinement and Ambiguity Resolution

These details have not been verified by PyPI

Project links

Project description

RECTIFY

RNA 5' and 3' End Correction Tool with Intron reFinement and ambiguitY resolution

Overview

Nanopore direct RNA sequencing offers unprecedented read lengths, but accurate transcript structure mapping requires solving four intertwined problems: spurious 3' ends created by poly(A) tail artifacts (indels and false splice junctions), soft-clipped 5' bases that actually align upstream of splice sites, homopolymer-driven soft-clipping at 3' ends, and conflicting junction calls between different aligners. RECTIFY solves all four through multi-aligner rectification, artifact-aware corrections, and optional NET-seq refinement, delivering nucleotide-precision 5' and 3' end coordinates and splice junction sets.

Use RECTIFY when you need:

Accurate cleavage and polyadenylation (CPA) site mapping from DRS data
Correction of poly(A) misalignment artifacts in A-tract regions
Robust splice junction calls from reads spanning multiple exons
Detection of alternative polyadenylation (APA) with cluster-level resolution
Differential expression analysis at gene and isoform levels
Optional NET-seq-informed refinement for A-tract ambiguity

Quick Start

Installation

# Via PyPI
pip install rectify-rna

# With visualization support (metagene plots, genome figures)
pip install rectify-rna[visualize]

# Via Conda (includes MEME Suite for motif discovery)
conda install -c conda-forge -c bioconda rectify-rna

Basic Usage

# Correct 3' ends from FASTQ (bundled yeast genome — no external files needed)
rectify correct reads.fastq.gz --organism yeast -o corrected.tsv

# Full pipeline: alignment → correction → analysis
rectify run reads.bam --genome genome.fa --annotation genes.gtf --output-dir results/

# Process NET-seq data (nascent RNA 3' ends)
rectify netseq netseq.bam --genome genome.fa --gff genes.gff -o netseq_output/

How It Works

RECTIFY reconstructs true RNA 3' and 5' ends through four sequential corrections, each addressing a specific alignment artifact.

1. 3' End Walk-Back: Recovering the True CPA Site

When poly(A) tails align to genomic A-tracts, aligners introduce indels and spurious splice junctions (N operations) to maximize alignment score, shifting the apparent 3' end far downstream of the true cleavage site. RECTIFY walks backward from the soft-clip boundary, skipping A's, deletions, T sequencing errors, and any intron-skip (N) operations it encounters, until it finds the first non-A/T agreement between genome and read — the true CPA site.

3' End Walk-Back Correction

Why simple poly(A) trimming fails: The boundary between genomic A's and tail A's is ambiguous in A-tract regions. RECTIFY's walk-back algorithm handles deletions, T sequencing errors, and false splice junctions within the A-tract, recovering the true CPA position even when the aligner has spread the poly(A) signal across multiple genomic A-runs or introduced spurious N operations to reach downstream A-tracts. For minus-strand genes, the poly(A) tail appears as a poly(T) prefix extending leftward — RECTIFY applies identical logic in reverse orientation.

False junction cleanup is built-in: Poly(A) tails can cause aligners to introduce skip (N) operations to reach downstream A-tracts, creating spurious splice junctions. The same walk-back that corrects indel artifacts transparently absorbs these N operations — they require no separate detection step.

False Junction Walk-Back

2. 5' End Junction Rescue: Recovering Soft-Clipped Bases at Splice Sites

Nanopore reads that begin near a splice junction frequently have their 5'-most bases soft-clipped rather than placed in the upstream exon. RECTIFY identifies these soft-clipped sequences, locates the nearest annotated donor site, and extends the alignment through the intron to recover the true transcription start position.

5' End Junction Rescue

3. Soft-Clip Rescue: Recovering 5' Bases at Homopolymer Boundaries

Nanopore basecallers systematically under-call homopolymer runs. At CPA sites with upstream T-rich regions, this causes the aligner to soft-clip non-T bases rather than place them in the correct exon. RECTIFY identifies soft-clipped sequences, skips remaining reference homopolymer bases, and matches them to downstream reference positions.

Soft-Clip Rescue at Homopolymer Boundaries

This correction is especially critical for detecting true 3' ends in regions where weak basecalling and homopolymer under-calling create false soft-clip boundaries.

4. Multi-Aligner Rectification: Selecting the Optimal Junction Set

Different aligners make different tradeoffs at splice junctions. RECTIFY runs three aligners in parallel (minimap2, mapPacBio, gapmm2), applies soft-clip rescue to all outputs, scores each alignment by canonical splice sites and annotation matches, and selects the optimal rectified alignment per read.

Multi-Aligner Rectification Pipeline

Scoring criteria: Each alignment is scored by (1) number of GT-AG canonical junctions, (2) matches to annotated junctions in the provided GFF/GTF, and (3) remaining soft-clip length. The highest-scoring alignment is written to the output BAM.

# Multi-aligner rectification (default, DRS-optimized)
rectify align reads.fastq.gz --genome genome.fa --annotation genes.gff -o aligned.bam

# Single-aligner mode (faster, less accurate)
rectify align reads.fastq.gz --genome genome.fa --aligner minimap2 -o aligned.bam

Key Features

Feature	Benefit
Multi-Aligner Rectification	Runs minimap2, mapPacBio, gapmm2, scores each alignment, and selects the optimal rectified result per read
5' End Junction Recovery	Rescues soft-clipped bases by extending alignments through known splice junctions
3' End Walk-Back	Walks backward from soft-clip boundary to recover true CPA site, transparently absorbing indels, T sequencing errors, and spurious splice junctions (N ops) in a single pass
Junction Ambiguity Resolution	Resolves reads matching multiple junctions using proportional assignment
Poly(A) Measurement	Reports tail length including both aligned and soft-clipped bases
NET-seq Refinement	Uses nascent RNA 3' ends to deconvolve A-tract ambiguity (optional)
Adaptive Clustering	Groups nearby CPA sites using valley-based peak detection
Dual-Resolution Differential Expression	DESeq2 at both gene level and cluster (isoform) level
APA Shift Analysis	Detects significant proximal/distal CPA site usage changes
Visualization	Metagene plots and genome browser figures (`pip install rectify-rna[visualize]`)
Bundled Yeast Data	S288C genome, SGD annotations, GO terms, WT NET-seq, 64K pre-computed A-tract CPA sites

Output and Results

Each read receives a corrected position with confidence scoring:

read_id   │ chrom │ strand │ original │ corrected │ shift │ confidence │ polya_len │ qc_flags
read001   │ chrI  │   +    │  147592  │   147585  │  -7   │    HIGH    │    42     │   PASS
read002   │ chrI  │   +    │  147594  │   147591  │  -3   │   MEDIUM   │    38     │   PASS
read003   │ chrII │   +    │  283109  │   283104  │  -5   │    LOW     │    31     │ AG_RICH

The rectify analyze command produces:

clusters.tsv — CPA site clusters with read counts per condition
deseq2_gene_results.tsv — Differential expression at gene level
deseq2_cluster_results.tsv — Differential expression at cluster (isoform) level
shift_results.tsv — Genes with statistically significant APA shifts
go_enrichment.tsv — GO term enrichment on shifted genes
motif_results/ — Enriched sequence motifs near CPA sites

NET-seq Refinement (Optional)

For organisms with nascent RNA (NET-seq) data, RECTIFY resolves remaining ambiguity within A-tracts. NET-seq samples RNA still attached to polymerase, providing a reference for true CPA positions. Since nascent RNA is oligo-adenylated post-capture, RECTIFY uses NNLS deconvolution with a point-spread function derived from 5000+ zero-A calibration sites to recover true CPA positions.

Oligo(A) Spreading Artifact

Oligo(A) Deconvolution

For S. cerevisiae, bundled WT NET-seq data is auto-detected. For other organisms or mutant conditions, provide NET-seq bigWigs with the --netseq-dir flag.

Commands Reference

Command	Purpose
`rectify correct`	Correct 3' end positions (indel correction + A-tract resolution)
`rectify analyze`	Downstream analysis (clustering, DESeq2, GO enrichment, motifs)
`rectify export`	Export corrected positions to bigWig/bedGraph tracks
`rectify extract`	Extract per-read 5'/3' ends and junctions to TSV
`rectify aggregate`	Group reads into 3'/5'/junction dataset files
`rectify align`	Align FASTQ with multi-aligner rectification
`rectify netseq`	Process NET-seq BAM files (3' extraction + deconvolution)
`rectify run`	Full pipeline: align (if FASTQ) → correct → analyze
`rectify run-all`	Full pipeline with provenance tracking and step-skip

Usage examples

# Correct 3' ends (bundled yeast genome, no external files needed)
rectify correct reads.fastq.gz --organism yeast -o corrected.tsv

# Correct with custom genome and optional NET-seq deconvolution
rectify correct reads.bam --genome genome.fa --netseq-dir my_netseq/ -o corrected.tsv

# Extract per-read features (5'/3' ends, junctions) to TSV
rectify extract reads.bam -o reads.tsv --genome genome.fa --annotation genes.gff

# Aggregate into separate 3'/5'/junction datasets by condition
rectify aggregate reads.bam -o aggregated/ --annotation genes.gff --mode all

# Differential expression analysis (gene and cluster level)
rectify analyze corrected.tsv --annotation genes.gtf --output-dir results/

# Export corrected positions as genome browser tracks
rectify export corrected.tsv -o tracks/ --genome genome.fa

# Complete pipeline from reads to differential expression
rectify run reads.bam --genome genome.fa --annotation genes.gtf --output-dir results/

# Process NET-seq data (nascent RNA 3' ends for A-tract refinement)
rectify netseq netseq.bam --genome genome.fa --gff genes.gff -o netseq_output/

Supported Technologies

Direct RNA sequencing: Nanopore direct RNA-seq (DRS) Short-read quantification: QuantSeq (oligo-dT), PacBio Iso-Seq, NET-seq General: Any poly(A)-tailed RNA-seq platform

Citation

Please cite RECTIFY if you use it in your research:

Roy KR, Chanfreau GF. Robust mapping of polyadenylated and non-polyadenylated RNA 3' ends at nucleotide resolution by 3'-end sequencing. Methods. 2020;176:4-13. PMID: 31128237

RECTIFY 2.0: Manuscript in preparation.

License

MIT — see LICENSE for details.

Contact

Kevin R. Roy Email: kevinrjroy@gmail.com GitHub: k-roy/RECTIFY

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.7.8

Apr 9, 2026

2.7.7

Apr 5, 2026

2.3.0

Mar 19, 2026

2.2.0

Mar 19, 2026

2.1.3

Mar 18, 2026

2.1.2

Mar 18, 2026

2.1.1

Mar 17, 2026

2.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rectify_rna-2.7.8.tar.gz (94.0 MB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rectify_rna-2.7.8-py3-none-any.whl (94.0 MB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file rectify_rna-2.7.8.tar.gz.

File metadata

Download URL: rectify_rna-2.7.8.tar.gz
Upload date: Apr 9, 2026
Size: 94.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.7.8.tar.gz
Algorithm	Hash digest
SHA256	`e7f03565338b1c9a9c78113f6c09d109b83f003929292c02848aaaac6b8a8df0`
MD5	`e44c9d42be3675a83acef6b29c1dc4c4`
BLAKE2b-256	`178fb150e927a5a67250232d93d390d951f069c3e96d81e75b60a5aa58b8325d`

See more details on using hashes here.

File details

Details for the file rectify_rna-2.7.8-py3-none-any.whl.

File metadata

Download URL: rectify_rna-2.7.8-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 94.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.7.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`caeadbabc2bf528ea95680d8b55a9b98a00dbe705b238820eb123f126e799f22`
MD5	`9e7835f7d7b36565af07e39462c9bc9d`
BLAKE2b-256	`a316a84cce4da314bd8360dfee97c0127f68807fa7328097e2119f098d7179c2`

See more details on using hashes here.

rectify-rna 2.7.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RECTIFY

Overview

Quick Start

Installation

Basic Usage

How It Works

1. 3' End Walk-Back: Recovering the True CPA Site

2. 5' End Junction Rescue: Recovering Soft-Clipped Bases at Splice Sites

3. Soft-Clip Rescue: Recovering 5' Bases at Homopolymer Boundaries

4. Multi-Aligner Rectification: Selecting the Optimal Junction Set

Key Features

Output and Results

NET-seq Refinement (Optional)

Commands Reference

Supported Technologies

Citation

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes