Skip to main content

Unified RNA 3' End Correction Framework for poly(A)-tailed RNA sequencing

Project description

RECTIFY

RNA 5' and 3' End Correction Tool with Intron reFinement and ambiguitY resolution

PyPI version License: MIT Python 3.8+

Precision transcript structure mapping for direct RNA nanopore sequencing data. Accurate 5' ends, 3' ends, and splice junctions through refinement of alignments at transcript termini, correction of artifacts due to sequencing errors, and selection of the optimal alignments from a panel of aligners (minimap2, gapmm2, and mapPacBio).


Quick Start

pip install rectify-rna

# Single sample — bundled yeast genome, no external files needed
rectify run-all reads.fastq.gz --Scer -o results/

# Multiple samples via manifest (typical usage)
rectify run-all --manifest samples.tsv --Scer -o results/

The manifest samples.tsv is a tab-separated file (no header):

wt_rep1.fastq.gz      WT    rep1
wt_rep2.fastq.gz      WT    rep2
wt_rep3.fastq.gz      WT    rep3
rna15_rep1.fastq.gz   rna15 rep1
rna15_rep2.fastq.gz   rna15 rep2
rna15_rep3.fastq.gz   rna15 rep3

Columns: filename (resolved relative to the manifest), group (condition label for DESeq2 contrasts), bio_rep.

Outputs (results/):

results/
├── <sample_id>/                        # Per-sample
│   ├── <sample_id>.consensus.bam       # Triple-aligner consensus alignment
│   ├── corrected_3ends.tsv             # Per-read corrected 3' ends, confidence, poly(A) length, fraction
│   ├── corrected_3ends_stats.tsv       # Correction statistics
│   ├── corrected_3ends_report.html     # Per-sample QC report
│   ├── junctions/junctions.tsv         # Splice junctions with partial-rescue evidence
│   └── PROVENANCE.json
│
└── combined/                           # Cross-sample analysis (≥2 samples in manifest)
    ├── cpa_clusters.tsv                # CPA site clusters with per-sample read counts
    ├── tables/
    │   ├── deseq2_genes_*.tsv          # Gene-level differential expression (≥2 conditions)
    │   ├── deseq2_clusters_*.tsv       # Cluster-level differential expression (≥2 conditions)
    │   └── shift_results.tsv           # APA shift analysis (≥2 conditions)
    ├── go_enrichment.tsv               # GO enrichment for DE genes (≥2 conditions)
    ├── motif_results/                  # Enriched sequence motifs near CPA sites
    ├── report.html                     # Combined QC and results report
    └── PROVENANCE.json

Key Features

Feature Description
Spike-in Filtering Removes synthetic spike-in reads (e.g. ENO2) by k-mer sequence matching, restricted to the spike-in gene locus to prevent false positives
5' End Junction Recovery Attempts to rescue 5' soft-clipped bases in each aligner's output by extending through splice junctions
3' End A-tract Estimation Estimates true CPA position for each aligner using downstream A-tract depth; used to score and break ties in consensus selection
Multi-Aligner Consensus Scores rescued alignments from minimap2, mapPacBio, and gapmm2 — penalizing 5' soft-clips and 3' A-tract depth — and selects the best per read
3' End Correction (Walk-back) Refines 3' end on the consensus BAM: fixes CIGAR deletion artifacts and walks upstream to the true CPA site; false splice junctions (N operations) are discarded automatically
Poly(A) Measurement Reports tail length (aligned + soft-clipped)
Junction Ambiguity Resolution Resolves reads matching multiple junctions using proportional assignment
NET-seq Refinement Resolves A-tract ambiguity using nascent RNA data; reads assigned proportionally across peaks
Adaptive Clustering Groups CPA sites with valley-based algorithm
Dual-Resolution DESeq2 Gene-level and cluster-level differential expression
APA Shift Analysis Detects proximal/distal CPA site usage changes
Visualization Metagene plots, genome browser figures (pip install rectify-rna[visualize])

Bundled Data (Yeast)

For S. cerevisiae, RECTIFY includes the S288C genome, a merged annotation (see below), GO terms, and pre-deconvolved WT NET-seq data (pan-mutant consensus, NNLS-deconvolved offline)—no external files needed.

Bundled annotation (--Scer) fuses four sources into a single GFF3:

Source Features Reference
SGD R64-5-1 All protein-coding genes, tRNAs, snoRNAs, rDNA, etc. SGD (yeastgenome.org)
CUTs (925) Cryptic Unstable Transcripts Xu et al. 2009, Nature 457:1033 (PMID:19169243)
SUTs (847) Stable Unannotated Transcripts Xu et al. 2009, Nature 457:1033 (PMID:19169243)
XUTs (1,658) Xrn1-sensitive Unstable Transcripts van Dijk et al. 2011, Nature 475:114 (PMID:21697827)

CUT/SUT coordinates are SGD-curated and lifted to R64-1-1. XUT coordinates are from the original van Dijk 2011 supplementary data (R64/sacCer3). All features use Roman numeral chromosome names (chrIchrXVI, chrMito).


3' End Correction: Indel Artifacts in Poly(A) Regions

When poly(A) tails align to genomic A-tracts, aligners like minimap2 introduce indel artifacts to maximize alignment score. This shifts the apparent 3' end downstream of the true cleavage site.

3' End Indel Correction

The Problem: The aligner introduces deletions to extend alignment of poly(A) tail bases into downstream genomic A-tracts, shifting the apparent 3' end.

RECTIFY's Solution: Walk upstream from the soft-clip boundary through the aligned region:

  1. Skip positions where genome = A (ambiguous with poly(A) tail)
  2. Skip deletions (D in CIGAR) - these are alignment artifacts
  3. Skip T sequencing errors in the tail
  4. STOP at first non-A/T agreement between genome and read

Result: True 3' end at the correct position. Poly(A) length = soft-clipped + aligned A's + absorbed deletions.

Key insight for IGV users: For minus strand reads, the poly(A) tail appears as poly(T) extending leftward. RECTIFY corrects by shifting rightward toward the true CPA.


Soft-Clip Rescue at Homopolymer Boundaries

Nanopore basecallers systematically under-call homopolymer runs (e.g., calling 8 U's instead of 10). When this happens at CPA sites with upstream T-tracts, the aligner soft-clips the non-T bases instead of placing them correctly.

Soft-Clip Rescue

The Problem: The basecaller under-calls the T-tract, so when the aligner reaches the first non-T base, it can't fit it into the shortened homopolymer and soft-clips it instead.

RECTIFY's Solution:

  1. Detect soft-clips adjacent to homopolymer boundaries
  2. Skip over remaining reference homopolymer bases (the under-called T's)
  3. Match soft-clipped bases to reference sequence beyond the homopolymer
  4. Extend the 3' end to include the rescued bases

5' End Correction: Splice Junction Soft-Clips

Long reads spanning splice junctions often have soft-clipped bases at the 5' end where the aligner fails to find the exact junction boundary. RECTIFY recovers the true splice site.

5' Junction Rescue

The Problem: Soft-clipped bases at the 5' end actually match the upstream exon, but the aligner couldn't extend through the splice junction.

RECTIFY's Solution:

  1. Identify reads with 5' soft-clips near annotated splice sites
  2. Check if soft-clipped sequence matches upstream exon
  3. Extend the alignment to the canonical splice donor (GT)

Result: Accurate read 5' end and recovered splice junction. Note: Due to 5'-to-3' degradation in direct RNA sequencing, the read's 5' end often does not represent the true TSS.


Multi-Aligner Consensus Pipeline

Different aligners make different tradeoffs at splice junctions. RECTIFY runs three aligners in parallel and selects the best alignment per read.

Multi-Aligner Consensus

The Problem: Different aligners handle the same read differently. Some soft-clip at splice boundaries while others find the junction.

RECTIFY's Solution:

  1. Run all 3 aligners (minimap2, mapPacBio, gapmm2) on the same reads
  2. Attempt to rescue each alignment's 5' soft-clips by extending through known junctions
  3. Score the (potentially rescued) alignments by canonical splice sites (GT-AG) and annotation matches
  4. Select the highest-scoring alignment per read
  5. Output: Single consensus BAM with best alignment per read

Note: 3' false junctions from poly(A) artifacts are handled separately by walk back correction (see "3' False Junction Handling" below).

Usage:

# Multi-aligner consensus alignment (default, aligners run in parallel)
rectify align reads.fastq.gz --genome genome.fa --annotation genes.gff -o aligned.bam

# Parallel aligners with proportional thread allocation (minimap2 gets fewer threads
# since it's faster; mapPacBio and gapmm2 get more to finish at the same time)
rectify align reads.fastq.gz --genome genome.fa --annotation genes.gff \
    --parallel-aligners --threads 16 -o aligned.bam

# Single aligner mode (faster, less accurate)
rectify align reads.fastq.gz --genome genome.fa --aligner minimap2 -o aligned.bam

3' False Junction Handling

Poly(A) tails can create spurious "junctions" when the aligner introduces a skip (N) operation to align tail bases to a downstream genomic A-tract. RECTIFY's walk back correction completely handles this artifact.

False Junction Walk Back

The Problem: The aligner introduces an N (skip) to extend the poly(A) tail alignment into a downstream A-tract, creating a spurious junction that doesn't exist in the transcript.

RECTIFY's Solution: The walk back algorithm finds the true 3' end by walking upstream through ALL aligned A's until it finds the first non-A agreement between genome and read. Crucially, it DISCARDS any N (skip) operations it encounters.

Result:

  • Walk back finds the true CPA at the EXON/A boundary
  • The false junction (N operation) is completely ignored
  • No special false junction detection needed

This means 3' false junctions are NOT a problem for downstream analysis—walk back correction handles them automatically as part of finding the true cleavage and polyadenylation site.


Adaptive Clustering and Differential Expression

After correction, RECTIFY groups nearby CPA sites into clusters using a valley-based algorithm, then runs DESeq2 at both gene and cluster resolution.

Adaptive Clustering

Algorithm:

  1. Find peaks (local maxima in 3' end signal)
  2. Find valleys (local minima between peaks)
  3. Set boundaries at midpoint between peak and valley (capped at ±10bp)

Why cluster-level analysis matters:

  • Genes often have MULTIPLE CPA sites (alternative polyadenylation)
  • Conditions may shift usage between proximal/distal sites
  • Cluster-level DESeq2 detects isoform-specific changes that gene-level misses

Dual-resolution output:

Level Detects Example
Gene Total expression changes HSP82 is 2-fold down in heat shock
Cluster CPA site usage changes FAS1 shifts from distal to proximal site

Output

Each read gets a corrected position with confidence scores:

read_id   │ chrom │ strand │ original │ corrected │ shift │ confidence │ polya_len │ fraction │ qc_flags
read001   │ chrI  │   +    │  147592  │   147585  │  -7   │    HIGH    │    42     │  1.0000  │   PASS
read002   │ chrI  │   +    │  147594  │   147591  │  -3   │   MEDIUM   │    38     │  1.0000  │   PASS
read003   │ chrII │   +    │  283109  │   283104  │  -5   │    LOW     │    31     │  0.6500  │ AG_RICH

The fraction column reflects proportional NET-seq assignment: when a read falls in an A-tract with multiple NET-seq peaks, it is split across peaks rather than snapped to a single position. Fractions sum to 1.0 per input read.

The rectify analyze command produces:

  • clusters.tsv - CPA site clusters with read counts per sample
  • deseq2_gene_results.tsv - Gene-level differential expression
  • deseq2_cluster_results.tsv - Cluster-level differential expression
  • shift_results.tsv - Genes with significant APA shifts
  • go_enrichment.tsv - GO enrichment for DE genes
  • motif_results/ - Enriched sequence motifs near CPA sites

Genomic category distribution

The primary per-sample output includes a horizontal bar chart (genomic_distribution.png) showing the fraction of reads and number of unique clusters falling into each genomic category — directly comparable to the style of Xu et al. 2009 Figure 1B/C:

3' UTR            ████████████████████████████████████████  ~90%
snoRNA +/- 300 bp ██  ~2%
intergenic        ███  ~3%
CUTs              █  ~1%
SUTs / XUTs       █  ~1%
5' UTR / CDS      ██  ~2%
antisense CDS     █  <1%

The x-axis uses a broken scale so the dominant 3' UTR bar and the minority categories are both visible. Multiple conditions (e.g. WT vs rrp6Δ) are overlaid as separate bars.

Category definitions and classification priority:

Category Feature types Source
3' UTR UTR3 annotation (or 3' 10% heuristic) SGD
snoRNA +/- 300 bp snoRNA_gene ± 300 bp on either strand SGD
CUTs CUT features Xu et al. 2009
SUTs / XUTs SUT or XUT features Xu 2009 / van Dijk 2011
5' UTR / CDS UTR5 or CDS annotation SGD
antisense CDS Position overlaps CDS on opposite strand SGD
intergenic / intronic None of the above

When a position matches multiple categories, the highest-priority rule wins (order as listed above).

The companion genomic_distribution_3prime_summary.tsv table contains:

condition  category  display_label        reads   reads_pct  clusters
WT         UTR3      3' UTR               182034  91.2       4812
WT         snoRNA    snoRNA +/- 300 bp    3940    1.97       208
...

Transcript body distribution

A second figure (genomic_distribution_body.png) classifies each read by RNA biotype based on the majority-overlap of its full alignment span with annotated features. Replicates are merged by condition; one pie chart is shown per condition.

Biotype assignments and priority:

Biotype Feature types matched Source
protein-coding mRNA, CDS SGD
CUTs CUT Xu et al. 2009
SUTs / XUTs SUT, XUT Xu 2009 / van Dijk 2011
snoRNA snoRNA_gene, snoRNA SGD
tRNA tRNA_gene, tRNA SGD
rRNA rRNA_gene, rRNA SGD
LTR / retrotransposon LTR_retrotransposon, transposable_element_gene SGD
pseudogene pseudogene SGD
antisense Read span overlaps an annotated feature on opposite strand
intergenic No annotated feature overlaps the alignment span

The read is assigned to whichever overlapping feature has the most base-pair overlap. Ties are broken by the priority order above (protein-coding > CUT > SUT_XUT > ...).

The companion genomic_distribution_body_summary.tsv table contains:

condition  category      display_label    reads   reads_pct
WT         protein_coding  protein-coding  175000  87.5
WT         CUT             CUTs             4500   2.25
...

NET-seq Refinement (Optional)

For organisms with NET-seq data, RECTIFY resolves remaining A-tract ambiguity by assigning reads proportionally across NET-seq peaks rather than snapping to a single position.

Bundled yeast data (--Scer): pre-deconvolved pan-mutant NET-seq (WT + DST1Δ consensus, NNLS deconvolution applied once offline). No runtime deconvolution — reads are assigned directly using the pre-computed signal.

Custom data: provide raw NET-seq bigWigs with --netseq-dir. NNLS deconvolution is applied at runtime to recover true CPA positions from the oligo-adenylation spreading pattern.


Installation

# PyPI
pip install rectify-rna

# With visualization support
pip install rectify-rna[visualize]

# Conda (includes MEME Suite for motif discovery)
conda install -c conda-forge -c bioconda rectify-rna

Commands

All-in-one

Command Description
rectify run-all Full pipeline: align (if FASTQ) → correct → analyze → aggregate junctions. Skips completed steps automatically on re-run.

Individual steps

Run steps independently to re-process from any point in the pipeline.

Command Description
rectify align Align FASTQ with the aligner panel (minimap2, gapmm2, mapPacBio; all with DRS-optimized settings) and select the best alignment per read
rectify correct Correct 5' ends, 3' ends, and junctions — indel correction at poly(A) boundaries, A-tract ambiguity resolution, NET-seq refinement
rectify analyze Downstream analysis: CPA clustering, DESeq2, GO enrichment, motif discovery
rectify export Export corrected positions to bigWig/bedGraph tracks
rectify extract Extract per-read features from BAM to TSV (5'/3' ends, junctions, poly(A) length)
rectify aggregate Aggregate reads into 3' end, 5' end, and junction-centered datasets
rectify netseq Process NET-seq BAM files (3' end extraction, NNLS deconvolution; for standalone analysis or for assigning DRS 3' ends in A-tracts to their likely CPA sites)

Examples

# Multiple samples via manifest — typical multi-condition experiment
rectify run-all --manifest samples.tsv --Scer --filter-spikein ENO2 -o results/

# Single sample, bundled yeast genome
rectify run-all reads.fastq.gz --Scer -o results/

# Non-DRS protocol where poly(A) tail is not sequenced
rectify run-all reads.fastq.gz --genome genome.fa --annotation genes.gff \
    --no-polya-sequenced -o results/

# Re-run correction only (alignment already done)
rectify correct reads.bam --genome genome.fa --netseq-dir my_netseq/ -o corrected.tsv

# Re-run analysis only (correction already done)
rectify analyze corrected.tsv --annotation genes.gff --output-dir results/

# Process NET-seq data
rectify netseq netseq.bam --genome genome.fa --gff genes.gff -o netseq_output/

NET-seq Processing

RECTIFY includes a dedicated pipeline for processing NET-seq (Native Elongating Transcript sequencing) data. NET-seq captures nascent RNA 3' ends, which undergo oligo-adenylation during library preparation, creating characteristic signal spreading at A-tract regions.

The Problem: Oligo(A) Signal Spreading

Oligo(A) Spreading

With oligo(A) tails (~5.5 bp mean), reads can prime at ANY downstream A in genomic A-tracts. This causes signal to "spread" downstream, obscuring the true CPA site.

RECTIFY's Solution: NNLS Deconvolution

Oligo(A) Deconvolution

Using the Point-Spread-Function (PSF) derived from 5000+ 0A sites (sites with no downstream genomic A's, where the true position is known):

  1. Build convolution matrix: A[i,j] = P(observe at j | true peak at i)
  2. Solve NNLS with L2 regularization: min ||Ax - observed||² + λ||x||²
  3. Recover true peak positions from deconvolved signal

Result: Sharper, more accurate 3' end signal with A-tract ambiguity resolved.

NET-seq Command

# Basic usage
rectify netseq input.bam --genome genome.fa --gff genes.gff -o output/

# With exclusion region control
rectify netseq input.bam --genome genome.fa --gff genes.gff \
    --include-rdna \        # Don't exclude rDNA locus
    --include-pol3 \        # Don't exclude Pol III genes (tRNAs)
    --exclude-mito \        # Exclude mitochondrial genome
    -o output/

# Disable deconvolution (raw 3' ends only)
rectify netseq input.bam --genome genome.fa --no-deconvolution -o output/

# Process multiple samples
rectify netseq sample1.bam sample2.bam sample3.bam \
    --genome genome.fa --gff genes.gff -o output/

Output Files

File Description
{sample}.unified_reads.parquet Per-read records (25 columns, same schema as nanopore)
{sample}.raw.plus.bedgraph Raw 3' ends, plus strand, RPM-normalized
{sample}.raw.minus.bedgraph Raw 3' ends, minus strand, RPM-normalized
{sample}.deconv.plus.bedgraph Deconvolved signal, plus strand, RPM-normalized
{sample}.deconv.minus.bedgraph Deconvolved signal, minus strand, RPM-normalized

Exclusion Regions

By default, RECTIFY excludes regions with non-standard transcription:

  • rDNA locus (chrXII ~450,000-490,000 in yeast): Highly repetitive, Pol I transcribed
  • Pol III genes (tRNAs, SNR6, RDN5, RPR1, SCR1): Different transcription termination mechanism
  • Flanking regions (100 bp by default): Buffer around excluded genes

These regions are auto-detected from GFF annotation. Use --include-rdna and --include-pol3 flags to include them if needed.

Strand-Aware Coordinate Mapping

NET-seq reads are short (~40-76 bp) and represent the 3' end of nascent RNA:

Strand 3' end position 3' soft-clip Oligo(A) detection
+ reference_end - 1 (rightmost) Right clip Count A's
- reference_start (leftmost) Left clip Count T's (= RNA A's)

Visualization

Install with: pip install rectify-rna[visualize]

Metagene: single condition

Aggregate 3' end signal around a set of loci (TRT sites, CPA clusters, gene TES/TSS, motif matches).

import numpy as np
import matplotlib.pyplot as plt
from rectify.visualize import (
    MetagenePipeline,
    position_index_from_tsv,
    loci_from_pickle,
    loci_from_gff,
    plot_metagene_line,
    add_metagene_annotations,
    set_publication_style,
    despine,
    WONG_COLORS,
)

# Build a position index from RECTIFY corrected 3' end TSVs
# Multiple replicates merge automatically
index, total_reads = position_index_from_tsv(
    ["wt_rep1/corrected_3ends.tsv",
     "wt_rep2/corrected_3ends.tsv",
     "wt_rep3/corrected_3ends.tsv"],
    position_col='corrected_3prime',
)

# Load loci — choose the loader that matches your input
trt_loci = loci_from_pickle("cache/trt_signals_v3.pkl")            # TRT/CPA pkl cache
tes_loci = loci_from_gff("genes.gff", feature_type='gene', center='end')   # gene 3' ends
tss_loci = loci_from_gff("genes.gff", feature_type='gene', center='start') # gene 5' ends

# Compute metagene (strand coordinate transformation + reversal handled internally)
pipeline = MetagenePipeline()
result = pipeline.compute_center_profile(
    trt_loci, index,
    window=(-50, 50),
    total_reads=total_reads,
    verify_strands=True,   # raises StrandOrientationError on strand bug — never skip this
    cap_percentile=50,     # suppress outlier loci (e.g. TDH3) before peak detection
)

set_publication_style()
fig, ax = plt.subplots(figsize=(5, 3), constrained_layout=True)

x    = result['x']
mean = np.mean(result['profile_matrix'], axis=0)
sem  = np.std(result['profile_matrix'],  axis=0) / np.sqrt(result['n_loci'])

plot_metagene_line(ax, x, mean, sem,
                   color=WONG_COLORS['blue'],
                   label=f"WT (n={result['n_loci']})")
add_metagene_annotations(ax, positions=[0], labels=["TRT start"])
ax.set_xlabel("Position (bp from TRT start)")
ax.set_ylabel("RPM")
despine(ax)
ax.legend(fontsize=8)
fig.savefig("trt_metagene_wt.png", dpi=150, bbox_inches='tight')

Always set cap_percentile=50 with window ≥ 100 bp — a single highly expressed gene can dominate the aggregate and mask the biological peak.


Metagene: multi-condition ridge plot

Compare WT vs mutants at TRT sites, stacked as a ridge plot (conditions offset vertically so shapes can be compared without overlap).

import numpy as np
import matplotlib.pyplot as plt
from rectify.visualize import (
    MetagenePipeline,
    position_index_from_tsv,
    loci_from_pickle,
    plot_ridge_profiles,
    apply_window_sum_capping,
    set_publication_style,
    WONG_COLORS,
)

trt_loci = loci_from_pickle("cache/trt_signals_v3.pkl")
pipeline  = MetagenePipeline()

# Build per-condition profile matrices
CONDITIONS = {
    'wt':     ["wt_rep1/corrected_3ends.tsv",     "wt_rep2/corrected_3ends.tsv"],
    'rna15':  ["rna15_rep1/corrected_3ends.tsv",  "rna15_rep2/corrected_3ends.tsv"],
    'ysh1':   ["ysh1_rep1/corrected_3ends.tsv",   "ysh1_rep2/corrected_3ends.tsv"],
}

profiles = {}
for cond, paths in CONDITIONS.items():
    idx, n = position_index_from_tsv(paths, position_col='corrected_3prime')
    r = pipeline.compute_center_profile(
        trt_loci, idx, window=(-50, 50), total_reads=n,
        verify_strands=True, cap_percentile=50,
    )
    profiles[cond] = r['profile_matrix']

# Optional: cap outliers consistently across conditions before ridge plot
profiles_capped = {k: apply_window_sum_capping(v, percentile=90)[0]
                   for k, v in profiles.items()}

x = np.arange(-50, 51)
COLORS = {'wt': WONG_COLORS['blue'], 'rna15': WONG_COLORS['vermillion'],
          'ysh1': WONG_COLORS['green']}
LABELS = {'wt': 'WT', 'rna15': 'RNA15-AA', 'ysh1': 'YSH1-AA'}

set_publication_style()
fig, ax = plt.subplots(figsize=(6, 5), constrained_layout=True)
plot_ridge_profiles(ax, x, profiles_capped, colors=COLORS, labels=LABELS,
                    order=['wt', 'rna15', 'ysh1'])
ax.set_xlabel("Position (bp from TRT start)")
ax.axvline(0, color='#CC0000', linestyle='--', linewidth=0.8, alpha=0.7)
fig.savefig("trt_metagene_ridge.png", dpi=150, bbox_inches='tight')

Stacked read browser

Genome-browser-style view of individual nanopore reads, colored by transcript category. Uses LineCollection for fast batch rendering — 400 reads × 5 conditions renders as ~30 draw calls instead of 2000.

import pandas as pd
import matplotlib.pyplot as plt
from rectify.visualize import plot_stacked_read_panel

# ── Read classification is your responsibility (domain-specific) ───────────────
CAT_COLORS = {
    'CDS-internal':   '#d62728',  # red
    '3UTR-premature': '#ff9896',  # coral
    'intronic':       '#9467bd',  # purple
    'canonical':      '#2ca02c',  # green
    'readthrough':    '#e6ac00',  # gold
    'other':          '#aaaaaa',  # gray
}

# Load reads from RECTIFY TSV, filter to gene region
reads_df = pd.read_csv("wt_rep1/corrected_3ends.tsv", sep='\t',
                       usecols=['read_id', 'chrom', 'strand',
                                'corrected_3prime', 'five_prime_position',
                                'alignment_start', 'alignment_end',
                                'qc_flags', 'junctions'])
reads_df = reads_df[
    (reads_df['qc_flags'] == 'PASS') &
    (reads_df['chrom'] == 'chrXIV') &
    (reads_df['strand'] == '+') &
    (reads_df['alignment_start'] < 417000) &
    (reads_df['alignment_end']   > 413000)
]

# Sort by 3' end so reads with the same termination site appear contiguous
reads_df = reads_df.sort_values('corrected_3prime')

# Classify and map to color (your logic here)
reads_df['color'] = reads_df.apply(classify_read, axis=1).map(CAT_COLORS)

# ── Draw ───────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(
    3, 1,                                         # gene track + 2 conditions
    figsize=(14, 8),
    gridspec_kw={'height_ratios': [1, 3, 3]},
    constrained_layout=True,                      # NOT tight_layout()
)

# condition panel
n_rows = plot_stacked_read_panel(
    axes[1], reads_df,
    color_col='color',
    start_col='alignment_start',
    end_col='alignment_end',
    junction_col='junctions',  # "start-end,start-end" RECTIFY format
    gap=50,
)

# mark internal pA sites
for pos in [414500, 415200]:
    axes[1].axvline(pos, color='#d62728', linestyle='--', linewidth=1.0, alpha=0.8)

axes[1].set_xlim(413000, 417000)
axes[1].set_ylabel("WT", rotation=0, ha='right', va='center')

Locus loaders

Function Input Use for
loci_from_pickle(path) .pkl cache TRT/CPA caches
loci_from_gff(path, feature_type, center) GFF3 TSS (center='start'), TES (center='end')
loci_from_motif_scan(sequences, motif) dict of sequences Regex motif, both strands (e.g. T{6,})
loci_from_bed(path, center) BED file Custom interval lists
loci_from_tsv(path) TSV with chrom/strand/center Pre-computed loci
position_index_from_tsv(paths) RECTIFY corrected_3ends.tsv 3' end signal
position_index_from_bigwig(path) bigWig NET-seq, PAR-CLIP, any coverage

Figure utilities

from rectify.visualize import set_publication_style, save_multi_format, WONG_COLORS

set_publication_style()   # consistent fonts, tick sizes, rc params for all figures

# Color-blind-safe palette (use for all new figures)
blue       = WONG_COLORS['blue']        # #0072B2
orange     = WONG_COLORS['orange']      # #E69F00
green      = WONG_COLORS['green']       # #009E73
vermillion = WONG_COLORS['vermillion']  # #D55E00

# Save PNG + PDF + SVG in one call
save_multi_format(fig, "figures/fig1_trt_metagene")
# → figures/fig1_trt_metagene.png, .pdf, .svg

See PLOT_SKILLS.md for the full API reference and list of pitfalls.


Supported Technologies

  • Nanopore direct RNA-seq (minimap2)
  • QuantSeq (oligo-dT short-read)
  • PacBio Iso-Seq
  • NET-seq (nascent RNA 3' end sequencing)
  • Any poly(A)-tailed RNA-seq

Citation

Original RECTIFY:

Roy KR, Chanfreau GF. Robust mapping of polyadenylated and non-polyadenylated RNA 3' ends at nucleotide resolution by 3'-end sequencing. Methods. 2020;176:4-13. PMID: 31128237

RECTIFY 2.0: Manuscript in preparation


License

MIT License - See LICENSE for details

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rectify_rna-2.7.7.tar.gz (94.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rectify_rna-2.7.7-py3-none-any.whl (94.1 MB view details)

Uploaded Python 3

File details

Details for the file rectify_rna-2.7.7.tar.gz.

File metadata

  • Download URL: rectify_rna-2.7.7.tar.gz
  • Upload date:
  • Size: 94.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.7.7.tar.gz
Algorithm Hash digest
SHA256 4a59308b15ec883e6e245e31f506c08fbeb424b634296052a21963caada647e9
MD5 f27c93ac9154283d12de521a5076c9a8
BLAKE2b-256 8972a65ab14b04cdecf0225872ea5e95a1f6e3b4569e50bb2e02b3c7aed6ced8

See more details on using hashes here.

File details

Details for the file rectify_rna-2.7.7-py3-none-any.whl.

File metadata

  • Download URL: rectify_rna-2.7.7-py3-none-any.whl
  • Upload date:
  • Size: 94.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for rectify_rna-2.7.7-py3-none-any.whl
Algorithm Hash digest
SHA256 61e6e34cfd72e1dfdd092f8888cde4d599573efbc323d7ce0b89ade319955090
MD5 a4d41c3f5a1efa5b3b4029b9b7ace2cb
BLAKE2b-256 fa11d268fe34032a856a2052f9433c6f08e118d4510c535bfd845b7e732fe8cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page