Unified RNA 3' End Correction Framework for poly(A)-tailed RNA sequencing

These details have not been verified by PyPI

Project links

Project description

RECTIFY

RNA 5' and 3' End Correction Tool with Intron reFinement and ambiguitY resolution

Precision transcript structure mapping for direct RNA nanopore sequencing data. Accurate 5' ends, 3' ends, and splice junctions through refinement of alignments at transcript termini, correction of artifacts due to sequencing errors, and selection of the optimal alignments from a panel of aligners (minimap2, gapmm2, and mapPacBio).

Quick Start

pip install rectify-rna

# Single sample — bundled yeast genome, no external files needed
rectify run-all reads.fastq.gz --Scer -o results/

# Multiple samples via manifest (typical usage)
rectify run-all --manifest samples.tsv --Scer -o results/

The manifest samples.tsv is a tab-separated file (no header):

wt_rep1.fastq.gz      WT    rep1
wt_rep2.fastq.gz      WT    rep2
wt_rep3.fastq.gz      WT    rep3
rna15_rep1.fastq.gz   rna15 rep1
rna15_rep2.fastq.gz   rna15 rep2
rna15_rep3.fastq.gz   rna15 rep3

Columns: filename (resolved relative to the manifest), group (condition label for DESeq2 contrasts), bio_rep.

Outputs (results/):

results/
├── <sample_id>/                        # Per-sample
│   ├── <sample_id>.consensus.bam       # Triple-aligner consensus alignment
│   ├── corrected_3ends.tsv             # Per-read corrected 3' ends, confidence, poly(A) length, fraction
│   ├── corrected_3ends_stats.tsv       # Correction statistics
│   ├── corrected_3ends_report.html     # Per-sample QC report
│   ├── junctions/junctions.tsv         # Splice junctions with partial-rescue evidence
│   └── PROVENANCE.json
│
└── combined/                           # Cross-sample analysis (≥2 samples in manifest)
    ├── cpa_clusters.tsv                # CPA site clusters with per-sample read counts
    ├── tables/
    │   ├── deseq2_genes_*.tsv          # Gene-level differential expression (≥2 conditions)
    │   ├── deseq2_clusters_*.tsv       # Cluster-level differential expression (≥2 conditions)
    │   └── shift_results.tsv           # APA shift analysis (≥2 conditions)
    ├── go_enrichment.tsv               # GO enrichment for DE genes (≥2 conditions)
    ├── motif_results/                  # Enriched sequence motifs near CPA sites
    ├── report.html                     # Combined QC and results report
    └── PROVENANCE.json

Key Features

Feature	Description
Spike-in Filtering	Removes synthetic spike-in reads (e.g. ENO2) by k-mer sequence matching, restricted to the spike-in gene locus to prevent false positives
5' End Junction Recovery	Attempts to rescue 5' soft-clipped bases in each aligner's output by extending through splice junctions
3' End A-tract Estimation	Estimates true CPA position for each aligner using downstream A-tract depth; used to score and break ties in consensus selection
Multi-Aligner Consensus	Scores rescued alignments from minimap2, mapPacBio, and gapmm2 — penalizing 5' soft-clips and 3' A-tract depth — and selects the best per read
3' End Correction (Walk-back)	Refines 3' end on the consensus BAM: fixes CIGAR deletion artifacts and walks upstream to the true CPA site; false splice junctions (N operations) are discarded automatically
Poly(A) Measurement	Reports tail length (aligned + soft-clipped)
Junction Ambiguity Resolution	Resolves reads matching multiple junctions using proportional assignment
NET-seq Refinement	Resolves A-tract ambiguity using nascent RNA data; reads assigned proportionally across peaks
Adaptive Clustering	Groups CPA sites with valley-based algorithm
Dual-Resolution DESeq2	Gene-level and cluster-level differential expression
APA Shift Analysis	Detects proximal/distal CPA site usage changes
Visualization	Metagene plots, genome browser figures (`pip install rectify-rna[visualize]`)

Bundled Data (Yeast)

For S. cerevisiae, RECTIFY includes the S288C genome, a merged annotation (see below), GO terms, and pre-deconvolved WT NET-seq data (pan-mutant consensus, NNLS-deconvolved offline)—no external files needed.

Bundled annotation (--Scer) fuses four sources into a single GFF3:

Source	Features	Reference
SGD R64-5-1	All protein-coding genes, tRNAs, snoRNAs, rDNA, etc.	SGD (yeastgenome.org)
CUTs (925)	Cryptic Unstable Transcripts	Xu et al. 2009, Nature 457:1033 (PMID:19169243)
SUTs (847)	Stable Unannotated Transcripts	Xu et al. 2009, Nature 457:1033 (PMID:19169243)
XUTs (1,658)	Xrn1-sensitive Unstable Transcripts	van Dijk et al. 2011, Nature 475:114 (PMID:21697827)

CUT/SUT coordinates are SGD-curated and lifted to R64-1-1. XUT coordinates are from the original van Dijk 2011 supplementary data (R64/sacCer3). All features use Roman numeral chromosome names (chrI–chrXVI, chrMito).

3' End Correction: Indel Artifacts in Poly(A) Regions

When poly(A) tails align to genomic A-tracts, aligners like minimap2 introduce indel artifacts to maximize alignment score. This shifts the apparent 3' end downstream of the true cleavage site.

3' End Indel Correction

The Problem: The aligner introduces deletions to extend alignment of poly(A) tail bases into downstream genomic A-tracts, shifting the apparent 3' end.

RECTIFY's Solution: Walk upstream from the soft-clip boundary through the aligned region:

Skip positions where genome = A (ambiguous with poly(A) tail)
Skip deletions (D in CIGAR) - these are alignment artifacts
Skip T sequencing errors in the tail
STOP at first non-A/T agreement between genome and read

Result: True 3' end at the correct position. Poly(A) length = soft-clipped + aligned A's + absorbed deletions.

Key insight for IGV users: For minus strand reads, the poly(A) tail appears as poly(T) extending leftward. RECTIFY corrects by shifting rightward toward the true CPA.

Soft-Clip Rescue at Homopolymer Boundaries

Nanopore basecallers systematically under-call homopolymer runs (e.g., calling 8 U's instead of 10). When this happens at CPA sites with upstream T-tracts, the aligner soft-clips the non-T bases instead of placing them correctly.

Soft-Clip Rescue

The Problem: The basecaller under-calls the T-tract, so when the aligner reaches the first non-T base, it can't fit it into the shortened homopolymer and soft-clips it instead.

RECTIFY's Solution:

Detect soft-clips adjacent to homopolymer boundaries
Skip over remaining reference homopolymer bases (the under-called T's)
Match soft-clipped bases to reference sequence beyond the homopolymer
Extend the 3' end to include the rescued bases

5' End Correction: Splice Junction Soft-Clips

Long reads spanning splice junctions often have soft-clipped bases at the 5' end where the aligner fails to find the exact junction boundary. RECTIFY recovers the true splice site.

5' Junction Rescue

The Problem: Soft-clipped bases at the 5' end actually match the upstream exon, but the aligner couldn't extend through the splice junction.

RECTIFY's Solution:

Identify reads with 5' soft-clips near annotated splice sites
Check if soft-clipped sequence matches upstream exon
Extend the alignment to the canonical splice donor (GT)

Result: Accurate read 5' end and recovered splice junction. Note: Due to 5'-to-3' degradation in direct RNA sequencing, the read's 5' end often does not represent the true TSS.

Multi-Aligner Consensus Pipeline

Different aligners make different tradeoffs at splice junctions. RECTIFY runs three aligners in parallel and selects the best alignment per read.

Multi-Aligner Consensus

The Problem: Different aligners handle the same read differently. Some soft-clip at splice boundaries while others find the junction.

RECTIFY's Solution:

Run all 3 aligners (minimap2, mapPacBio, gapmm2) on the same reads
Attempt to rescue each alignment's 5' soft-clips by extending through known junctions
Score the (potentially rescued) alignments by canonical splice sites (GT-AG) and annotation matches
Select the highest-scoring alignment per read
Output: Single consensus BAM with best alignment per read

Note: 3' false junctions from poly(A) artifacts are handled separately by walk back correction (see "3' False Junction Handling" below).

Usage:

# Multi-aligner consensus alignment (default, aligners run in parallel)
rectify align reads.fastq.gz --genome genome.fa --annotation genes.gff -o aligned.bam

# Parallel aligners with proportional thread allocation (minimap2 gets fewer threads
# since it's faster; mapPacBio and gapmm2 get more to finish at the same time)
rectify align reads.fastq.gz --genome genome.fa --annotation genes.gff \
    --parallel-aligners --threads 16 -o aligned.bam

# Single aligner mode (faster, less accurate)
rectify align reads.fastq.gz --genome genome.fa --aligner minimap2 -o aligned.bam

3' False Junction Handling

Poly(A) tails can create spurious "junctions" when the aligner introduces a skip (N) operation to align tail bases to a downstream genomic A-tract. RECTIFY's walk back correction completely handles this artifact.

False Junction Walk Back

The Problem: The aligner introduces an N (skip) to extend the poly(A) tail alignment into a downstream A-tract, creating a spurious junction that doesn't exist in the transcript.

RECTIFY's Solution: The walk back algorithm finds the true 3' end by walking upstream through ALL aligned A's until it finds the first non-A agreement between genome and read. Crucially, it DISCARDS any N (skip) operations it encounters.

Result:

Walk back finds the true CPA at the EXON/A boundary
The false junction (N operation) is completely ignored
No special false junction detection needed

This means 3' false junctions are NOT a problem for downstream analysis—walk back correction handles them automatically as part of finding the true cleavage and polyadenylation site.

Adaptive Clustering and Differential Expression

After correction, RECTIFY groups nearby CPA sites into clusters using a valley-based algorithm, then runs DESeq2 at both gene and cluster resolution.

Adaptive Clustering

Algorithm:

Find peaks (local maxima in 3' end signal)
Find valleys (local minima between peaks)
Set boundaries at midpoint between peak and valley (capped at ±10bp)

Why cluster-level analysis matters:

Genes often have MULTIPLE CPA sites (alternative polyadenylation)
Conditions may shift usage between proximal/distal sites
Cluster-level DESeq2 detects isoform-specific changes that gene-level misses

Dual-resolution output:

Level	Detects	Example
Gene	Total expression changes	HSP82 is 2-fold down in heat shock
Cluster	CPA site usage changes	FAS1 shifts from distal to proximal site

Output

Each read gets a corrected position with confidence scores:

read_id   │ chrom │ strand │ original │ corrected │ shift │ confidence │ polya_len │ fraction │ qc_flags
read001   │ chrI  │   +    │  147592  │   147585  │  -7   │    HIGH    │    42     │  1.0000  │   PASS
read002   │ chrI  │   +    │  147594  │   147591  │  -3   │   MEDIUM   │    38     │  1.0000  │   PASS
read003   │ chrII │   +    │  283109  │   283104  │  -5   │    LOW     │    31     │  0.6500  │ AG_RICH

The fraction column reflects proportional NET-seq assignment: when a read falls in an A-tract with multiple NET-seq peaks, it is split across peaks rather than snapped to a single position. Fractions sum to 1.0 per input read.

The rectify analyze command produces:

clusters.tsv - CPA site clusters with read counts per sample
deseq2_gene_results.tsv - Gene-level differential expression
deseq2_cluster_results.tsv - Cluster-level differential expression
shift_results.tsv - Genes with significant APA shifts
go_enrichment.tsv - GO enrichment for DE genes
motif_results/ - Enriched sequence motifs near CPA sites

Genomic category distribution

The primary per-sample output includes a horizontal bar chart (genomic_distribution.png) showing the fraction of reads and number of unique clusters falling into each genomic category — directly comparable to the style of Xu et al. 2009 Figure 1B/C:

3' UTR            ████████████████████████████████████████  ~90%
snoRNA +/- 300 bp ██  ~2%
intergenic        ███  ~3%
CUTs              █  ~1%
SUTs / XUTs       █  ~1%
5' UTR / CDS      ██  ~2%
antisense CDS     █  <1%

The x-axis uses a broken scale so the dominant 3' UTR bar and the minority categories are both visible. Multiple conditions (e.g. WT vs rrp6Δ) are overlaid as separate bars.

Category definitions and classification priority:

Category	Feature types	Source
3' UTR	`UTR3` annotation (or 3' 10% heuristic)	SGD
snoRNA +/- 300 bp	`snoRNA_gene` ± 300 bp on either strand	SGD
CUTs	`CUT` features	Xu et al. 2009
SUTs / XUTs	`SUT` or `XUT` features	Xu 2009 / van Dijk 2011
5' UTR / CDS	`UTR5` or `CDS` annotation	SGD
antisense CDS	Position overlaps CDS on opposite strand	SGD
intergenic / intronic	None of the above	—

When a position matches multiple categories, the highest-priority rule wins (order as listed above).

The companion genomic_distribution_3prime_summary.tsv table contains:

condition  category  display_label        reads   reads_pct  clusters
WT         UTR3      3' UTR               182034  91.2       4812
WT         snoRNA    snoRNA +/- 300 bp    3940    1.97       208
...

Transcript body distribution

A second figure (genomic_distribution_body.png) classifies each read by RNA biotype based on the majority-overlap of its full alignment span with annotated features. Replicates are merged by condition; one pie chart is shown per condition.

Biotype assignments and priority:

Biotype	Feature types matched	Source
protein-coding	`mRNA`, `CDS`	SGD
CUTs	`CUT`	Xu et al. 2009
SUTs / XUTs	`SUT`, `XUT`	Xu 2009 / van Dijk 2011
snoRNA	`snoRNA_gene`, `snoRNA`	SGD
tRNA	`tRNA_gene`, `tRNA`	SGD
rRNA	`rRNA_gene`, `rRNA`	SGD
LTR / retrotransposon	`LTR_retrotransposon`, `transposable_element_gene`	SGD
pseudogene	`pseudogene`	SGD
antisense	Read span overlaps an annotated feature on opposite strand	—
intergenic	No annotated feature overlaps the alignment span	—

The read is assigned to whichever overlapping feature has the most base-pair overlap. Ties are broken by the priority order above (protein-coding > CUT > SUT_XUT > ...).

The companion genomic_distribution_body_summary.tsv table contains:

condition  category      display_label    reads   reads_pct
WT         protein_coding  protein-coding  175000  87.5
WT         CUT             CUTs             4500   2.25
...

NET-seq Refinement (Optional)

For organisms with NET-seq data, RECTIFY resolves remaining A-tract ambiguity by assigning reads proportionally across NET-seq peaks rather than snapping to a single position.

Bundled yeast data (--Scer): pre-deconvolved pan-mutant NET-seq (WT + DST1Δ consensus, NNLS deconvolution applied once offline). No runtime deconvolution — reads are assigned directly using the pre-computed signal.

Custom data: provide raw NET-seq bigWigs with --netseq-dir. NNLS deconvolution is applied at runtime to recover true CPA positions from the oligo-adenylation spreading pattern.

Installation

# PyPI
pip install rectify-rna

# With visualization support
pip install rectify-rna[visualize]

# Conda (includes MEME Suite for motif discovery)
conda install -c conda-forge -c bioconda rectify-rna

Commands

All-in-one

Command	Description
`rectify run-all`	Full pipeline: align (if FASTQ) → correct → analyze → aggregate junctions. Skips completed steps automatically on re-run.

Individual steps

Run steps independently to re-process from any point in the pipeline.

Command	Description
`rectify align`	Align FASTQ with the aligner panel (minimap2, gapmm2, mapPacBio; all with DRS-optimized settings) and select the best alignment per read
`rectify correct`	Correct 5' ends, 3' ends, and junctions — indel correction at poly(A) boundaries, A-tract ambiguity resolution, NET-seq refinement
`rectify analyze`	Downstream analysis: CPA clustering, DESeq2, GO enrichment, motif discovery
`rectify export`	Export corrected positions to bigWig/bedGraph tracks
`rectify extract`	Extract per-read features from BAM to TSV (5'/3' ends, junctions, poly(A) length)
`rectify aggregate`	Aggregate reads into 3' end, 5' end, and junction-centered datasets
`rectify netseq`	Process NET-seq BAM files (3' end extraction, NNLS deconvolution; for standalone analysis or for assigning DRS 3' ends in A-tracts to their likely CPA sites)

Examples

# Multiple samples via manifest — typical multi-condition experiment
rectify run-all --manifest samples.tsv --Scer --filter-spikein ENO2 -o results/

# Single sample, bundled yeast genome
rectify run-all reads.fastq.gz --Scer -o results/

# Non-DRS protocol where poly(A) tail is not sequenced
rectify run-all reads.fastq.gz --genome genome.fa --annotation genes.gff \
    --no-polya-sequenced -o results/

# Re-run correction only (alignment already done)
rectify correct reads.bam --genome genome.fa --netseq-dir my_netseq/ -o corrected.tsv

# Re-run analysis only (correction already done)
rectify analyze corrected.tsv --annotation genes.gff --output-dir results/

# Process NET-seq data
rectify netseq netseq.bam --genome genome.fa --gff genes.gff -o netseq_output/

NET-seq Processing

RECTIFY includes a dedicated pipeline for processing NET-seq (Native Elongating Transcript sequencing) data. NET-seq captures nascent RNA 3' ends, which undergo oligo-adenylation during library preparation, creating characteristic signal spreading at A-tract regions.

The Problem: Oligo(A) Signal Spreading

Oligo(A) Spreading

With oligo(A) tails (~5.5 bp mean), reads can prime at ANY downstream A in genomic A-tracts. This causes signal to "spread" downstream, obscuring the true CPA site.

RECTIFY's Solution: NNLS Deconvolution

Oligo(A) Deconvolution

Using the Point-Spread-Function (PSF) derived from 5000+ 0A sites (sites with no downstream genomic A's, where the true position is known):

Build convolution matrix: A[i,j] = P(observe at j | true peak at i)
Solve NNLS with L2 regularization: min ||Ax - observed||² + λ||x||²
Recover true peak positions from deconvolved signal

Result: Sharper, more accurate 3' end signal with A-tract ambiguity resolved.

NET-seq Command

# Basic usage
rectify netseq input.bam --genome genome.fa --gff genes.gff -o output/

# With exclusion region control
rectify netseq input.bam --genome genome.fa --gff genes.gff \
    --include-rdna \        # Don't exclude rDNA locus
    --include-pol3 \        # Don't exclude Pol III genes (tRNAs)
    --exclude-mito \        # Exclude mitochondrial genome
    -o output/

# Disable deconvolution (raw 3' ends only)
rectify netseq input.bam --genome genome.fa --no-deconvolution -o output/

# Process multiple samples
rectify netseq sample1.bam sample2.bam sample3.bam \
    --genome genome.fa --gff genes.gff -o output/

Output Files

File	Description
`{sample}.unified_reads.parquet`	Per-read records (25 columns, same schema as nanopore)
`{sample}.raw.plus.bedgraph`	Raw 3' ends, plus strand, RPM-normalized
`{sample}.raw.minus.bedgraph`	Raw 3' ends, minus strand, RPM-normalized
`{sample}.deconv.plus.bedgraph`	Deconvolved signal, plus strand, RPM-normalized
`{sample}.deconv.minus.bedgraph`	Deconvolved signal, minus strand, RPM-normalized

Exclusion Regions

By default, RECTIFY excludes regions with non-standard transcription:

rDNA locus (chrXII ~450,000-490,000 in yeast): Highly repetitive, Pol I transcribed
Pol III genes (tRNAs, SNR6, RDN5, RPR1, SCR1): Different transcription termination mechanism
Flanking regions (100 bp by default): Buffer around excluded genes

These regions are auto-detected from GFF annotation. Use --include-rdna and --include-pol3 flags to include them if needed.

Strand-Aware Coordinate Mapping

NET-seq reads are short (~40-76 bp) and represent the 3' end of nascent RNA:

Strand	3' end position	3' soft-clip	Oligo(A) detection
+	`reference_end - 1` (rightmost)	Right clip	Count A's
-	`reference_start` (leftmost)	Left clip	Count T's (= RNA A's)

Visualization

Install with: pip install rectify-rna[visualize]

Metagene: single condition

Aggregate 3' end signal around a set of loci (TRT sites, CPA clusters, gene TES/TSS, motif matches).

import numpy as np
import matplotlib.pyplot as plt
from rectify.visualize import (
    MetagenePipeline,
    position_index_from_tsv,
    loci_from_pickle,
    loci_from_gff,
    plot_metagene_line,
    add_metagene_annotations,
    set_publication_style,
    despine,
    WONG_COLORS,
)

# Build a position index from RECTIFY corrected 3' end TSVs
# Multiple replicates merge automatically
index, total_reads = position_index_from_tsv(
    ["wt_rep1/corrected_3ends.tsv",
     "wt_rep2/corrected_3ends.tsv",
     "wt_rep3/corrected_3ends.tsv"],
    position_col='corrected_3prime',
)

# Load loci — choose the loader that matches your input
trt_loci = loci_from_pickle("cache/trt_signals_v3.pkl")            # TRT/CPA pkl cache
tes_loci = loci_from_gff("genes.gff", feature_type='gene', center='end')   # gene 3' ends
tss_loci = loci_from_gff("genes.gff", feature_type='gene', center='start') # gene 5' ends

# Compute metagene (strand coordinate transformation + reversal handled internally)
pipeline = MetagenePipeline()
result = pipeline.compute_center_profile(
    trt_loci, index,
    window=(-50, 50),
    total_reads=total_reads,
    verify_strands=True,   # raises StrandOrientationError on strand bug — never skip this
    cap_percentile=50,     # suppress outlier loci (e.g. TDH3) before peak detection
)

set_publication_style()
fig, ax = plt.subplots(figsize=(5, 3), constrained_layout=True)

x    = result['x']
mean = np.mean(result['profile_matrix'], axis=0)
sem  = np.std(result['profile_matrix'],  axis=0) / np.sqrt(result['n_loci'])

plot_metagene_line(ax, x, mean, sem,
                   color=WONG_COLORS['blue'],
                   label=f"WT (n={result['n_loci']})")
add_metagene_annotations(ax, positions=[0], labels=["TRT start"])
ax.set_xlabel("Position (bp from TRT start)")
ax.set_ylabel("RPM")
despine(ax)
ax.legend(fontsize=8)
fig.savefig("trt_metagene_wt.png", dpi=150, bbox_inches='tight')

Always set cap_percentile=50 with window ≥ 100 bp — a single highly expressed gene can dominate the aggregate and mask the biological peak.

Metagene: multi-condition ridge plot

Compare WT vs mutants at TRT sites, stacked as a ridge plot (conditions offset vertically so shapes can be compared without overlap).

import numpy as np
import matplotlib.pyplot as plt
from rectify.visualize import (
    MetagenePipeline,
    position_index_from_tsv,
    loci_from_pickle,
    plot_ridge_profiles,
    apply_window_sum_capping,
    set_publication_style,
    WONG_COLORS,
)

trt_loci = loci_from_pickle("cache/trt_signals_v3.pkl")
pipeline  = MetagenePipeline()

# Build per-condition profile matrices
CONDITIONS = {
    'wt':     ["wt_rep1/corrected_3ends.tsv",     "wt_rep2/corrected_3ends.tsv"],
    'rna15':  ["rna15_rep1/corrected_3ends.tsv",  "rna15_rep2/corrected_3ends.tsv"],
    'ysh1':   ["ysh1_rep1/corrected_3ends.tsv",   "ysh1_rep2/corrected_3ends.tsv"],
}

profiles = {}
for cond, paths in CONDITIONS.items():
    idx, n = position_index_from_tsv(paths, position_col='corrected_3prime')
    r = pipeline.compute_center_profile(
        trt_loci, idx, window=(-50, 50), total_reads=n,
        verify_strands=True, cap_percentile=50,
    )
    profiles[cond] = r['profile_matrix']

# Optional: cap outliers consistently across conditions before ridge plot
profiles_capped = {k: apply_window_sum_capping(v, percentile=90)[0]
                   for k, v in profiles.items()}

x = np.arange(-50, 51)
COLORS = {'wt': WONG_COLORS['blue'], 'rna15': WONG_COLORS['vermillion'],
          'ysh1': WONG_COLORS['green']}
LABELS = {'wt': 'WT', 'rna15': 'RNA15-AA', 'ysh1': 'YSH1-AA'}

set_publication_style()
fig, ax = plt.subplots(figsize=(6, 5), constrained_layout=True)
plot_ridge_profiles(ax, x, profiles_capped, colors=COLORS, labels=LABELS,
                    order=['wt', 'rna15', 'ysh1'])
ax.set_xlabel("Position (bp from TRT start)")
ax.axvline(0, color='#CC0000', linestyle='--', linewidth=0.8, alpha=0.7)
fig.savefig("trt_metagene_ridge.png", dpi=150, bbox_inches='tight')

Stacked read browser

Genome-browser-style view of individual nanopore reads, colored by transcript category. Uses LineCollection for fast batch rendering — 400 reads × 5 conditions renders as ~30 draw calls instead of 2000.

import pandas as pd
import matplotlib.pyplot as plt
from rectify.visualize import plot_stacked_read_panel

# ── Read classification is your responsibility (domain-specific) ───────────────
CAT_COLORS = {
    'CDS-internal':   '#d62728',  # red
    '3UTR-premature': '#ff9896',  # coral
    'intronic':       '#9467bd',  # purple
    'canonical':      '#2ca02c',  # green
    'readthrough':    '#e6ac00',  # gold
    'other':          '#aaaaaa',  # gray
}

# Load reads from RECTIFY TSV, filter to gene region
reads_df = pd.read_csv("wt_rep1/corrected_3ends.tsv", sep='\t',
                       usecols=['read_id', 'chrom', 'strand',
                                'corrected_3prime', 'five_prime_position',
                                'alignment_start', 'alignment_end',
                                'qc_flags', 'junctions'])
reads_df = reads_df[
    (reads_df['qc_flags'] == 'PASS') &
    (reads_df['chrom'] == 'chrXIV') &
    (reads_df['strand'] == '+') &
    (reads_df['alignment_start'] < 417000) &
    (reads_df['alignment_end']   > 413000)
]

# Sort by 3' end so reads with the same termination site appear contiguous
reads_df = reads_df.sort_values('corrected_3prime')

# Classify and map to color (your logic here)
reads_df['color'] = reads_df.apply(classify_read, axis=1).map(CAT_COLORS)

# ── Draw ───────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(
    3, 1,                                         # gene track + 2 conditions
    figsize=(14, 8),
    gridspec_kw={'height_ratios': [1, 3, 3]},
    constrained_layout=True,                      # NOT tight_layout()
)

# condition panel
n_rows = plot_stacked_read_panel(
    axes[1], reads_df,
    color_col='color',
    start_col='alignment_start',
    end_col='alignment_end',
    junction_col='junctions',  # "start-end,start-end" RECTIFY format
    gap=50,
)

# mark internal pA sites
for pos in [414500, 415200]:
    axes[1].axvline(pos, color='#d62728', linestyle='--', linewidth=1.0, alpha=0.8)

axes[1].set_xlim(413000, 417000)
axes[1].set_ylabel("WT", rotation=0, ha='right', va='center')

Locus loaders

Function	Input	Use for
`loci_from_pickle(path)`	`.pkl` cache	TRT/CPA caches
`loci_from_gff(path, feature_type, center)`	GFF3	TSS (`center='start'`), TES (`center='end'`)
`loci_from_motif_scan(sequences, motif)`	dict of sequences	Regex motif, both strands (e.g. `T{6,}`)
`loci_from_bed(path, center)`	BED file	Custom interval lists
`loci_from_tsv(path)`	TSV with chrom/strand/center	Pre-computed loci
`position_index_from_tsv(paths)`	RECTIFY `corrected_3ends.tsv`	3' end signal
`position_index_from_bigwig(path)`	bigWig	NET-seq, PAR-CLIP, any coverage

Figure utilities

from rectify.visualize import set_publication_style, save_multi_format, WONG_COLORS

set_publication_style()   # consistent fonts, tick sizes, rc params for all figures

# Color-blind-safe palette (use for all new figures)
blue       = WONG_COLORS['blue']        # #0072B2
orange     = WONG_COLORS['orange']      # #E69F00
green      = WONG_COLORS['green']       # #009E73
vermillion = WONG_COLORS['vermillion']  # #D55E00

# Save PNG + PDF + SVG in one call
save_multi_format(fig, "figures/fig1_trt_metagene")
# → figures/fig1_trt_metagene.png, .pdf, .svg

See PLOT_SKILLS.md for the full API reference and list of pitfalls.

Supported Technologies

Nanopore direct RNA-seq (minimap2)
QuantSeq (oligo-dT short-read)
PacBio Iso-Seq
NET-seq (nascent RNA 3' end sequencing)
Any poly(A)-tailed RNA-seq

Citation

Original RECTIFY:

Roy KR, Chanfreau GF. Robust mapping of polyadenylated and non-polyadenylated RNA 3' ends at nucleotide resolution by 3'-end sequencing. Methods. 2020;176:4-13. PMID: 31128237

RECTIFY 2.0: Manuscript in preparation

License

MIT License - See LICENSE for details

Contact

Kevin R. Roy - kevinrjroy@gmail.com
GitHub: k-roy/RECTIFY

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.7.8

Apr 9, 2026

This version

2.7.7

Apr 5, 2026

2.3.0

Mar 19, 2026

2.2.0

Mar 19, 2026

2.1.3

Mar 18, 2026

2.1.2

Mar 18, 2026

2.1.1

Mar 17, 2026

2.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rectify_rna-2.7.7.tar.gz (94.1 MB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rectify_rna-2.7.7-py3-none-any.whl (94.1 MB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file rectify_rna-2.7.7.tar.gz.

File metadata

Download URL: rectify_rna-2.7.7.tar.gz
Upload date: Apr 5, 2026
Size: 94.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.7.7.tar.gz
Algorithm	Hash digest
SHA256	`4a59308b15ec883e6e245e31f506c08fbeb424b634296052a21963caada647e9`
MD5	`f27c93ac9154283d12de521a5076c9a8`
BLAKE2b-256	`8972a65ab14b04cdecf0225872ea5e95a1f6e3b4569e50bb2e02b3c7aed6ced8`

See more details on using hashes here.

File details

Details for the file rectify_rna-2.7.7-py3-none-any.whl.

File metadata

Download URL: rectify_rna-2.7.7-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 94.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.7.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61e6e34cfd72e1dfdd092f8888cde4d599573efbc323d7ce0b89ade319955090`
MD5	`a4d41c3f5a1efa5b3b4029b9b7ace2cb`
BLAKE2b-256	`fa11d268fe34032a856a2052f9433c6f08e118d4510c535bfd845b7e732fe8cc`

See more details on using hashes here.

rectify-rna 2.7.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RECTIFY

Quick Start

Key Features

Bundled Data (Yeast)

3' End Correction: Indel Artifacts in Poly(A) Regions

Soft-Clip Rescue at Homopolymer Boundaries

5' End Correction: Splice Junction Soft-Clips

Multi-Aligner Consensus Pipeline

3' False Junction Handling

Adaptive Clustering and Differential Expression

Output

Genomic category distribution

Transcript body distribution

NET-seq Refinement (Optional)

Installation

Commands

All-in-one

Individual steps

Examples

NET-seq Processing

The Problem: Oligo(A) Signal Spreading

RECTIFY's Solution: NNLS Deconvolution

NET-seq Command

Output Files

Exclusion Regions

Strand-Aware Coordinate Mapping

Visualization

Metagene: single condition

Metagene: multi-condition ridge plot

Stacked read browser

Locus loaders

Figure utilities

Supported Technologies

Citation

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes