Skip to main content

Multi-aligner consensus framework for nanopore RNA-seq splice junction detection

Project description

Nanopore COMPASS

COMPASS (Comparison Of Multiple alignment Programs for Alternative Splice site discovery) adapted for Oxford Nanopore long-read RNA sequencing data.

Overview

Nanopore COMPASS is a multi-aligner consensus framework for detecting splice junctions and alternative splice sites (ASS) in nanopore direct RNA sequencing data. It addresses nanopore-specific challenges including high error rates (~5-15%), homopolymer errors, and alignment artifacts.

Key Features

  • 4-aligner consensus strategy for robust junction detection
  • Nanopore error-tolerant scoring (reduced penalties for indels, homopolymer errors)
  • Quality-based filtering using perfect flanking sequence matches
  • Ambiguous junction resolution for homopolymer regions
  • RPG (ribosomal protein gene) analysis comparing highly-expressed genes vs genome-wide
  • Organism-specific configuration (yeast, human, extensible)

Original COMPASS

This implementation builds on the original COMPASS framework for short-read data:

Multi-Aligner Strategy

Nanopore COMPASS uses 4 complementary aligners with orthogonal algorithms:

Aligner Algorithm Type Primary Strength Speed
Minimap2 Seed-and-chain High sensitivity, general-purpose Fast (9-40x vs GraphMap2)
uLTRA Graph + collinear chaining Small exons (11-20nt), annotation-guided Medium
deSALT De Bruijn graph Complex isoforms, alternative splicing Medium-slow
GraphMap2 3-phase mapping High accuracy validation Slow

Consensus Selection Hierarchy:

  1. Lowest error-weighted alignment score
  2. Annotation match (if tie)
  3. Most junctions detected per read
  4. Aligner precedence: GraphMap2 > uLTRA > deSALT > Minimap2
  5. Random (if still tied)

Installation

Dependencies

Python (≥3.8):

  • pysam
  • pandas
  • numpy
  • pyyaml
  • matplotlib
  • seaborn

Aligners:

  • minimap2 - Fast splice-aware aligner
  • uLTRA - Annotation-guided alignment
  • deSALT - De Bruijn graph aligner
  • GraphMap2 - High-accuracy 3-phase mapping

R packages (optional, for plots):

  • ggplot2
  • UpSetR
  • ComplexHeatmap

Setup

# Clone or navigate to repository
cd /oak/stanford/groups/larsms/Users/kevinroy/software/nanopore-compass

# Add to Python path
export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/larsms/Users/kevinroy/software/nanopore-compass/src

# Install Python dependencies (conda recommended)
conda create -n nanopore-compass python=3.10 pysam pandas pyyaml matplotlib seaborn
conda activate nanopore-compass

# Install aligners (via conda or from source)
conda install -c bioconda minimap2 graphmap2
# uLTRA and deSALT: see respective GitHub repositories for installation

Quick Start

1. Prepare Configuration

Edit config/yeast.yaml to specify:

  • Reference genome FASTA path
  • Annotation GFF path
  • Introns file path (COMPASS format)
  • Aligner paths and parameters
  • Filtering thresholds

2. Run Alignments

# Option A: Run all aligners sequentially
./bin/align_and_process.sh \
  --config config/yeast.yaml \
  --reads sample.fastq \
  --output output_dir \
  --sample sample_name

# Option B: Run individual aligners
minimap2 -ax splice -uf -k14 -G 5000 ref.fa reads.fq > minimap2.sam
ultra -i annotations.gtf --ont ref.fa reads.fq -o ultra.sam
deSALT aln -r ref.fa -a annotations.gtf -k 15 reads.fq > desalt.sam
graphmap2 align -r ref.fa -d reads.fq -o graphmap2.sam --extcigar

3. Extract and Merge Junctions

from pathlib import Path
from compass_functions import JunctionExtractor
from junction_comparison import aggregate_multi_aligner_junctions
from utils import load_intron_annotations

# Extract junctions from each aligner
aligners = ['minimap2', 'ultra', 'desalt', 'graphmap2']
junction_tables = {}

for aligner in aligners:
    bam_path = Path(f"output_dir/{aligner}.sorted.bam")
    fasta_path = Path("ref.fa")

    extractor = JunctionExtractor(bam_path, fasta_path)
    junctions = extractor.extract_all_junctions()
    junction_tables[aligner] = junctions
    extractor.close()

# Aggregate across aligners
aggregated = aggregate_multi_aligner_junctions(junction_tables)

# Load annotations
annotated_introns = load_intron_annotations(Path("introns.tsv"))

# Filter for high confidence
from junction_comparison import filter_high_confidence_junctions

high_confidence = filter_high_confidence_junctions(
    aggregated,
    min_reads=5,
    min_aligners=2,
    min_perfect_matches=5
)

print(f"Total junctions: {len(aggregated)}")
print(f"High-confidence: {len(high_confidence)}")

4. Analyze Results

from junction_comparison import compare_aligner_performance

# Compare aligner performance
stats = compare_aligner_performance(junction_tables, annotated_introns)

for aligner, metrics in stats.items():
    print(f"\n{aligner}:")
    print(f"  Total junctions: {metrics['total_junctions']}")
    print(f"  Exact annotated: {metrics['exact_annotated_pct']:.1f}%")
    print(f"  Novel: {metrics['novel_pct']:.1f}%")

Filtering Strategy

Multi-Tier Filtering for Alternative Splice Sites

Tier 1: Minimum Read Support

  • Threshold: ≥5 reads per junction
  • Eliminates random alignment errors

Tier 2: Multi-Sample Consistency

  • Threshold: Junction in ≥2 biological replicates
  • Eliminates sample-specific artifacts

Tier 3: Flanking Sequence Quality (PRIMARY FILTER)

  • Threshold: ≥5-6bp perfect matches on BOTH sides of junction
  • Distinguishes true junctions from alignment artifacts
  • Key insight: True ASS even 2bp from annotation should have clean flanks

Tier 4: Splice Site Motif Scoring (for classification, not exclusion)

  • Score motifs but don't exclude based on non-canonical
  • Non-canonical 3'SS (non-YAG) can be genuine (Meyer et al. 2023)

Tier 5: Multi-Aligner Consensus

  • High confidence: ≥3 aligners OR ≥2 aligners including GraphMap2
  • Medium confidence: ≥2 aligners with orthogonal algorithms
  • Low confidence: Single aligner only

Recommended Thresholds

Conservative (default):

  • Min reads: 5
  • Min samples: 2
  • Perfect matches: 5bp both sides
  • Min aligners: 2

Publication-quality (stringent):

  • Min reads: 10
  • Min samples: 3
  • Perfect matches: 6bp both sides
  • Min aligners: 2 (with GraphMap2 validation)

Project Structure

nanopore-compass/
├── bin/                          # Executable scripts
│   ├── nanopore-compass          # Main pipeline
│   ├── align_and_process.sh      # Multi-aligner wrapper
│   └── combine_junctions.py      # Merge sample junctions
│
├── src/                          # Core modules
│   ├── utils.py                  # Shared utilities
│   ├── sequence_analysis.py      # Splice site scoring, motifs
│   ├── compass_functions.py      # Junction extraction, ambiguity resolution
│   └── junction_comparison.py    # Multi-aligner consensus
│
├── scripts/                      # Analysis scripts
│   ├── rpg_analysis.py           # RPG vs non-RPG comparison
│   ├── ass_detection.py          # Alternative splice site detection
│   ├── quality_control.R         # QC plots
│   └── aligner_comparison.R      # Multi-aligner visualization
│
├── config/                       # Configuration files
│   ├── yeast.yaml                # S. cerevisiae parameters
│   └── human.yaml                # H. sapiens parameters
│
├── reference/                    # Reference data
│   ├── S_cerevisiae_all_introns.tsv
│   └── S_cerevisiae_rpg_genes.txt
│
└── test/                         # Test data and scripts
    └── test_data/

Configuration

Configuration files use YAML format with organism-specific parameters:

Key Sections:

  • reference files: Genome FASTA, annotations, intron catalog
  • junction_parameters: Intron size constraints, quality thresholds
  • splice_sites: Organism-specific consensus sequences
  • aligners: Paths, parameters, and settings for each aligner
  • filtering: Multi-tier thresholds for junction validation
  • error_weights: Nanopore-specific error tolerance

See config/yeast.yaml for complete example.

Advanced Usage

RPG (Ribosomal Protein Gene) Analysis

Compare splicing characteristics between highly-expressed RPG genes and genome-wide:

from scripts.rpg_analysis import compare_rpg_vs_nonrpg

results = compare_rpg_vs_nonrpg(
    junctions_file="high_confidence_junctions.tsv",
    rpg_genes_file="reference/S_cerevisiae_rpg_genes.txt"
)

# Results include:
# - Fidelity comparison (exact annotated %)
# - Alternative splicing frequency
# - Splice site motif strength
# - Condition-specific effects (WT vs mutant)

Alternative Splice Site Detection

from scripts.ass_detection import detect_alternative_splice_sites

ass_events = detect_alternative_splice_sites(
    junctions_file="high_confidence_junctions.tsv",
    annotated_introns="reference/S_cerevisiae_all_introns.tsv",
    min_distance=10  # bp from annotated sites
)

# ASS types:
# - alt_5ss: Alternative 5' splice site
# - alt_3ss: Alternative 3' splice site
# - cryptic: Cryptic introns (within exons)
# - exon_skip: Exon skipping events

Ambiguous Junction Resolution

from compass_functions import resolve_ambiguous_junctions
from sequence_analysis import SpliceMotifScorer

# Initialize motif scorer
scorer = SpliceMotifScorer(
    five_ss_consensus="GTATGT",
    five_ss_penalties=[4, 3, 1, 1, 2, 1],
    three_ss_consensus="YYYAG",
    three_ss_penalties=[1, 1, 1, 3, 4]
)

# Resolve ambiguous junctions
resolved = resolve_ambiguous_junctions(
    junction_dict=raw_junctions,
    fasta_path=Path("ref.fa"),
    motif_scorer=scorer,
    max_shift=10
)

Nanopore-Specific Adaptations

Error-Weighted Scoring

# Nanopore error model
score = (
    mismatches * 1.0 +                # Full weight
    indels * 0.5 +                    # Reduced (common in nanopore)
    indels_in_homopolymer * 0.2 +     # Minimal (expected errors)
    soft_clips * 0.3                  # Moderate penalty
)

Quality-Based Filtering

Instead of distance-based filtering, use perfect flanking sequence matches:

# Extract from BAM alignment
upstream_matches, downstream_matches = count_perfect_matches_at_junction(read, junction_pos)

# Filter criteria
if upstream_matches >= 5 and downstream_matches >= 5:
    # High-quality junction (true biological event)
    pass
else:
    # Likely alignment artifact
    reject()

This approach correctly identifies true ASS even when only 2-3bp from annotated sites.


Parallel Processing for Large Datasets

For datasets with 100M+ reads, use the parallelization pipeline for ~25x speedup.

Design Principle

Pre-alignment FASTQ chunking is required for COMPASS 4-way consensus:

  • Same read can map to different genomic locations in different aligners
  • COMPASS compares the SAME read across all 4 aligners
  • Pre-alignment chunking guarantees same read order in all 4 output BAMs

Workflow

BAM → FASTQ → chunk (25x) → align with 4 aligners → 4-way consensus → merge

Usage

# Run complete parallel pipeline
bash scripts/run_parallel_pipeline.sh \
    --bam input.bam \
    --sample sample_name \
    --chunks 25 \
    --output parallel_output/

# This will:
# 1. Convert BAM to FASTQ (bedtools bamtofastq)
# 2. Split FASTQ into 25 chunks
# 3. Submit SLURM array job to align each chunk with all 4 aligners
# 4. Submit SLURM array job to process each chunk with 4-way consensus
# 5. Merge all chunks into final junction table

Performance

Single aligner (100M reads):

  • Serial: 250 min (4.2 hours)
  • Parallel (25 cores): 10 minutes
  • Speedup: ~25x

4 aligners (100M reads):

  • Serial: 1000 min (16.7 hours)
  • Parallel (25 cores): 40-50 minutes
  • Speedup: ~20-25x

Manual Workflow (Step-by-Step)

For more control, run steps individually:

# 1. Convert BAM to FASTQ
bedtools bamtofastq -i input.bam -fq input.fastq

# 2. Split FASTQ
python scripts/split_fastq.py \
    --fastq input.fastq \
    --chunks 25 \
    --output fastq_chunks/

# 3. Align each chunk (SLURM array job)
sbatch --array=1-25 slurm/align_chunk_4way.sh

# 4. Process with consensus (SLURM array job)
sbatch --array=1-25 slurm/process_chunk_consensus.sh

# 5. Merge results
python scripts/merge_junction_chunks.py \
    --input junction_chunks/ \
    --output final_junctions.tsv

See docs/PARALLELIZATION.md for complete technical details.


Output Files

Junction Tables

Per-aligner junctions: {aligner}_junctions.tsv

  • Columns: chrom, five_ss, three_ss, count, five_dinuc, three_dinuc, is_canonical, is_exact_annotated, etc.

Aggregated junctions: consensus_junctions.tsv

  • Additional columns: num_aligners, aligner_support, confidence, validators

Filtered junctions: high_confidence_junctions.tsv

  • Subset passing all filtering criteria

Statistics

Aligner comparison: aligner_performance.tsv

  • Metrics per aligner: total junctions, exact annotated %, novel %

Filtering summary: filtering_stats.tsv

  • Junction counts at each filtering tier

RPG analysis (if enabled): rpg_vs_nonrpg_summary.tsv

  • Comparative metrics between RPG and non-RPG introns

Performance

Computational Requirements (yeast genome, 1M reads):

  • Memory: 16-32 GB RAM
  • Storage: ~50 GB for intermediate files
  • Time: 4-8 hours (4 aligners, 8 threads each)

Speed by Aligner:

  • Minimap2: ~5-10 minutes
  • uLTRA: ~30-60 minutes
  • deSALT: ~1-2 hours
  • GraphMap2: ~4-6 hours (slowest but most accurate)

Optimization:

  • Run aligners in parallel (separate SLURM jobs)
  • Use Minimap2 + uLTRA for fast initial analysis
  • Add GraphMap2 for high-confidence validation

Citation

If you use Nanopore COMPASS, please cite:

Original COMPASS:

Roy KR, et al. (2023) COMPASS: Comparison Of Multiple Alignment Programs for Alternative Splice site discovery. RNA, 29(11):1833-1847. PMID: 37956322

Nanopore Adaptations:

[Your publication using this framework]

Aligners:

  • Minimap2: Li H. (2018) Bioinformatics 34(18):3094-3100
  • uLTRA: Sahlin K & Mäkinen V. (2021) Genome Biol 22:142
  • deSALT: Liu B, et al. (2019) Genome Biol 20:274
  • GraphMap2: Sović I, et al. (2016) Nat Commun 7:11307

License

[Specify license - consider MIT or GPL to match original COMPASS]

Contact

Kevin R. Roy Larsmon Lab, Stanford University [Contact information]

References

  • Original COMPASS: https://github.com/k-roy/COMPASS
  • Non-canonical splice sites: Meyer et al. (2023), PMC10711555
  • Yeast genome: Saccharomyces Genome Database (SGD)
  • Nanopore basecalling: Oxford Nanopore Technologies

Version History

v1.0.0 (2026-03-09)

  • Initial implementation
  • 4-aligner consensus strategy
  • Nanopore error model
  • Yeast configuration
  • RPG analysis module
  • Alternative splice site detection

Last updated: 2026-03-09

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanocompass-0.1.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nanocompass-0.1.0-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file nanocompass-0.1.0.tar.gz.

File metadata

  • Download URL: nanocompass-0.1.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for nanocompass-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c38b1df97d5a04b30b8bb6dcc2d9bb657d805cccbd5736c11663b25f56df4cc5
MD5 45f5e56646217a975ff26fd29019b1f3
BLAKE2b-256 b62670ab5f73f6015f7ec729c74f3c04020485d1a7be05350aed966ff9c2da92

See more details on using hashes here.

File details

Details for the file nanocompass-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nanocompass-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for nanocompass-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ea01b61949f0c5c7bed7931a35a1ed97740bc811984840870e8ebf79b6112cb
MD5 395be29fac6a780b0ef716ce5590f70d
BLAKE2b-256 ab18d992e3452f6c645ca2b89d9759c771f002dc2d68a0471312aa9c2a83fb17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page