Multi-aligner consensus framework for nanopore RNA-seq splice junction detection
Project description
Nanopore COMPASS
COMPASS (Comparison Of Multiple alignment Programs for Alternative Splice site discovery) adapted for Oxford Nanopore long-read RNA sequencing data.
Overview
Nanopore COMPASS is a multi-aligner consensus framework for detecting splice junctions and alternative splice sites (ASS) in nanopore direct RNA sequencing data. It addresses nanopore-specific challenges including high error rates (~5-15%), homopolymer errors, and alignment artifacts.
Key Features
- 4-aligner consensus strategy for robust junction detection
- Nanopore error-tolerant scoring (reduced penalties for indels, homopolymer errors)
- Quality-based filtering using perfect flanking sequence matches
- Ambiguous junction resolution for homopolymer regions
- RPG (ribosomal protein gene) analysis comparing highly-expressed genes vs genome-wide
- Organism-specific configuration (yeast, human, extensible)
Original COMPASS
This implementation builds on the original COMPASS framework for short-read data:
- Paper: Roy et al., RNA (2023), PMID: 37956322
- GitHub: https://github.com/k-roy/COMPASS
Multi-Aligner Strategy
Nanopore COMPASS uses 4 complementary aligners with orthogonal algorithms:
| Aligner | Algorithm Type | Primary Strength | Speed |
|---|---|---|---|
| Minimap2 | Seed-and-chain | High sensitivity, general-purpose | Fast (9-40x vs GraphMap2) |
| uLTRA | Graph + collinear chaining | Small exons (11-20nt), annotation-guided | Medium |
| deSALT | De Bruijn graph | Complex isoforms, alternative splicing | Medium-slow |
| GraphMap2 | 3-phase mapping | High accuracy validation | Slow |
Consensus Selection Hierarchy:
- Lowest error-weighted alignment score
- Annotation match (if tie)
- Most junctions detected per read
- Aligner precedence: GraphMap2 > uLTRA > deSALT > Minimap2
- Random (if still tied)
Installation
Dependencies
Python (≥3.8):
- pysam
- pandas
- numpy
- pyyaml
- matplotlib
- seaborn
Aligners:
- minimap2 - Fast splice-aware aligner
- uLTRA - Annotation-guided alignment
- deSALT - De Bruijn graph aligner
- GraphMap2 - High-accuracy 3-phase mapping
R packages (optional, for plots):
- ggplot2
- UpSetR
- ComplexHeatmap
Setup
# Clone or navigate to repository
cd /oak/stanford/groups/larsms/Users/kevinroy/software/nanopore-compass
# Add to Python path
export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/larsms/Users/kevinroy/software/nanopore-compass/src
# Install Python dependencies (conda recommended)
conda create -n nanopore-compass python=3.10 pysam pandas pyyaml matplotlib seaborn
conda activate nanopore-compass
# Install aligners (via conda or from source)
conda install -c bioconda minimap2 graphmap2
# uLTRA and deSALT: see respective GitHub repositories for installation
Quick Start
1. Prepare Configuration
Edit config/yeast.yaml to specify:
- Reference genome FASTA path
- Annotation GFF path
- Introns file path (COMPASS format)
- Aligner paths and parameters
- Filtering thresholds
2. Run Alignments
# Option A: Run all aligners sequentially
./bin/align_and_process.sh \
--config config/yeast.yaml \
--reads sample.fastq \
--output output_dir \
--sample sample_name
# Option B: Run individual aligners
minimap2 -ax splice -uf -k14 -G 5000 ref.fa reads.fq > minimap2.sam
ultra -i annotations.gtf --ont ref.fa reads.fq -o ultra.sam
deSALT aln -r ref.fa -a annotations.gtf -k 15 reads.fq > desalt.sam
graphmap2 align -r ref.fa -d reads.fq -o graphmap2.sam --extcigar
3. Extract and Merge Junctions
from pathlib import Path
from compass_functions import JunctionExtractor
from junction_comparison import aggregate_multi_aligner_junctions
from utils import load_intron_annotations
# Extract junctions from each aligner
aligners = ['minimap2', 'ultra', 'desalt', 'graphmap2']
junction_tables = {}
for aligner in aligners:
bam_path = Path(f"output_dir/{aligner}.sorted.bam")
fasta_path = Path("ref.fa")
extractor = JunctionExtractor(bam_path, fasta_path)
junctions = extractor.extract_all_junctions()
junction_tables[aligner] = junctions
extractor.close()
# Aggregate across aligners
aggregated = aggregate_multi_aligner_junctions(junction_tables)
# Load annotations
annotated_introns = load_intron_annotations(Path("introns.tsv"))
# Filter for high confidence
from junction_comparison import filter_high_confidence_junctions
high_confidence = filter_high_confidence_junctions(
aggregated,
min_reads=5,
min_aligners=2,
min_perfect_matches=5
)
print(f"Total junctions: {len(aggregated)}")
print(f"High-confidence: {len(high_confidence)}")
4. Analyze Results
from junction_comparison import compare_aligner_performance
# Compare aligner performance
stats = compare_aligner_performance(junction_tables, annotated_introns)
for aligner, metrics in stats.items():
print(f"\n{aligner}:")
print(f" Total junctions: {metrics['total_junctions']}")
print(f" Exact annotated: {metrics['exact_annotated_pct']:.1f}%")
print(f" Novel: {metrics['novel_pct']:.1f}%")
Filtering Strategy
Multi-Tier Filtering for Alternative Splice Sites
Tier 1: Minimum Read Support
- Threshold: ≥5 reads per junction
- Eliminates random alignment errors
Tier 2: Multi-Sample Consistency
- Threshold: Junction in ≥2 biological replicates
- Eliminates sample-specific artifacts
Tier 3: Flanking Sequence Quality (PRIMARY FILTER)
- Threshold: ≥5-6bp perfect matches on BOTH sides of junction
- Distinguishes true junctions from alignment artifacts
- Key insight: True ASS even 2bp from annotation should have clean flanks
Tier 4: Splice Site Motif Scoring (for classification, not exclusion)
- Score motifs but don't exclude based on non-canonical
- Non-canonical 3'SS (non-YAG) can be genuine (Meyer et al. 2023)
Tier 5: Multi-Aligner Consensus
- High confidence: ≥3 aligners OR ≥2 aligners including GraphMap2
- Medium confidence: ≥2 aligners with orthogonal algorithms
- Low confidence: Single aligner only
Recommended Thresholds
Conservative (default):
- Min reads: 5
- Min samples: 2
- Perfect matches: 5bp both sides
- Min aligners: 2
Publication-quality (stringent):
- Min reads: 10
- Min samples: 3
- Perfect matches: 6bp both sides
- Min aligners: 2 (with GraphMap2 validation)
Project Structure
nanopore-compass/
├── bin/ # Executable scripts
│ ├── nanopore-compass # Main pipeline
│ ├── align_and_process.sh # Multi-aligner wrapper
│ └── combine_junctions.py # Merge sample junctions
│
├── src/ # Core modules
│ ├── utils.py # Shared utilities
│ ├── sequence_analysis.py # Splice site scoring, motifs
│ ├── compass_functions.py # Junction extraction, ambiguity resolution
│ └── junction_comparison.py # Multi-aligner consensus
│
├── scripts/ # Analysis scripts
│ ├── rpg_analysis.py # RPG vs non-RPG comparison
│ ├── ass_detection.py # Alternative splice site detection
│ ├── quality_control.R # QC plots
│ └── aligner_comparison.R # Multi-aligner visualization
│
├── config/ # Configuration files
│ ├── yeast.yaml # S. cerevisiae parameters
│ └── human.yaml # H. sapiens parameters
│
├── reference/ # Reference data
│ ├── S_cerevisiae_all_introns.tsv
│ └── S_cerevisiae_rpg_genes.txt
│
└── test/ # Test data and scripts
└── test_data/
Configuration
Configuration files use YAML format with organism-specific parameters:
Key Sections:
reference files: Genome FASTA, annotations, intron catalogjunction_parameters: Intron size constraints, quality thresholdssplice_sites: Organism-specific consensus sequencesaligners: Paths, parameters, and settings for each alignerfiltering: Multi-tier thresholds for junction validationerror_weights: Nanopore-specific error tolerance
See config/yeast.yaml for complete example.
Advanced Usage
RPG (Ribosomal Protein Gene) Analysis
Compare splicing characteristics between highly-expressed RPG genes and genome-wide:
from scripts.rpg_analysis import compare_rpg_vs_nonrpg
results = compare_rpg_vs_nonrpg(
junctions_file="high_confidence_junctions.tsv",
rpg_genes_file="reference/S_cerevisiae_rpg_genes.txt"
)
# Results include:
# - Fidelity comparison (exact annotated %)
# - Alternative splicing frequency
# - Splice site motif strength
# - Condition-specific effects (WT vs mutant)
Alternative Splice Site Detection
from scripts.ass_detection import detect_alternative_splice_sites
ass_events = detect_alternative_splice_sites(
junctions_file="high_confidence_junctions.tsv",
annotated_introns="reference/S_cerevisiae_all_introns.tsv",
min_distance=10 # bp from annotated sites
)
# ASS types:
# - alt_5ss: Alternative 5' splice site
# - alt_3ss: Alternative 3' splice site
# - cryptic: Cryptic introns (within exons)
# - exon_skip: Exon skipping events
Ambiguous Junction Resolution
from compass_functions import resolve_ambiguous_junctions
from sequence_analysis import SpliceMotifScorer
# Initialize motif scorer
scorer = SpliceMotifScorer(
five_ss_consensus="GTATGT",
five_ss_penalties=[4, 3, 1, 1, 2, 1],
three_ss_consensus="YYYAG",
three_ss_penalties=[1, 1, 1, 3, 4]
)
# Resolve ambiguous junctions
resolved = resolve_ambiguous_junctions(
junction_dict=raw_junctions,
fasta_path=Path("ref.fa"),
motif_scorer=scorer,
max_shift=10
)
Nanopore-Specific Adaptations
Error-Weighted Scoring
# Nanopore error model
score = (
mismatches * 1.0 + # Full weight
indels * 0.5 + # Reduced (common in nanopore)
indels_in_homopolymer * 0.2 + # Minimal (expected errors)
soft_clips * 0.3 # Moderate penalty
)
Quality-Based Filtering
Instead of distance-based filtering, use perfect flanking sequence matches:
# Extract from BAM alignment
upstream_matches, downstream_matches = count_perfect_matches_at_junction(read, junction_pos)
# Filter criteria
if upstream_matches >= 5 and downstream_matches >= 5:
# High-quality junction (true biological event)
pass
else:
# Likely alignment artifact
reject()
This approach correctly identifies true ASS even when only 2-3bp from annotated sites.
Parallel Processing for Large Datasets
For datasets with 100M+ reads, use the parallelization pipeline for ~25x speedup.
Design Principle
Pre-alignment FASTQ chunking is required for COMPASS 4-way consensus:
- Same read can map to different genomic locations in different aligners
- COMPASS compares the SAME read across all 4 aligners
- Pre-alignment chunking guarantees same read order in all 4 output BAMs
Workflow
BAM → FASTQ → chunk (25x) → align with 4 aligners → 4-way consensus → merge
Usage
# Run complete parallel pipeline
bash scripts/run_parallel_pipeline.sh \
--bam input.bam \
--sample sample_name \
--chunks 25 \
--output parallel_output/
# This will:
# 1. Convert BAM to FASTQ (bedtools bamtofastq)
# 2. Split FASTQ into 25 chunks
# 3. Submit SLURM array job to align each chunk with all 4 aligners
# 4. Submit SLURM array job to process each chunk with 4-way consensus
# 5. Merge all chunks into final junction table
Performance
Single aligner (100M reads):
- Serial: 250 min (4.2 hours)
- Parallel (25 cores): 10 minutes
- Speedup: ~25x
4 aligners (100M reads):
- Serial: 1000 min (16.7 hours)
- Parallel (25 cores): 40-50 minutes
- Speedup: ~20-25x
Manual Workflow (Step-by-Step)
For more control, run steps individually:
# 1. Convert BAM to FASTQ
bedtools bamtofastq -i input.bam -fq input.fastq
# 2. Split FASTQ
python scripts/split_fastq.py \
--fastq input.fastq \
--chunks 25 \
--output fastq_chunks/
# 3. Align each chunk (SLURM array job)
sbatch --array=1-25 slurm/align_chunk_4way.sh
# 4. Process with consensus (SLURM array job)
sbatch --array=1-25 slurm/process_chunk_consensus.sh
# 5. Merge results
python scripts/merge_junction_chunks.py \
--input junction_chunks/ \
--output final_junctions.tsv
See docs/PARALLELIZATION.md for complete technical details.
Output Files
Junction Tables
Per-aligner junctions: {aligner}_junctions.tsv
- Columns: chrom, five_ss, three_ss, count, five_dinuc, three_dinuc, is_canonical, is_exact_annotated, etc.
Aggregated junctions: consensus_junctions.tsv
- Additional columns: num_aligners, aligner_support, confidence, validators
Filtered junctions: high_confidence_junctions.tsv
- Subset passing all filtering criteria
Statistics
Aligner comparison: aligner_performance.tsv
- Metrics per aligner: total junctions, exact annotated %, novel %
Filtering summary: filtering_stats.tsv
- Junction counts at each filtering tier
RPG analysis (if enabled): rpg_vs_nonrpg_summary.tsv
- Comparative metrics between RPG and non-RPG introns
Performance
Computational Requirements (yeast genome, 1M reads):
- Memory: 16-32 GB RAM
- Storage: ~50 GB for intermediate files
- Time: 4-8 hours (4 aligners, 8 threads each)
Speed by Aligner:
- Minimap2: ~5-10 minutes
- uLTRA: ~30-60 minutes
- deSALT: ~1-2 hours
- GraphMap2: ~4-6 hours (slowest but most accurate)
Optimization:
- Run aligners in parallel (separate SLURM jobs)
- Use Minimap2 + uLTRA for fast initial analysis
- Add GraphMap2 for high-confidence validation
Citation
If you use Nanopore COMPASS, please cite:
Original COMPASS:
Roy KR, et al. (2023) COMPASS: Comparison Of Multiple Alignment Programs for Alternative Splice site discovery. RNA, 29(11):1833-1847. PMID: 37956322
Nanopore Adaptations:
[Your publication using this framework]
Aligners:
- Minimap2: Li H. (2018) Bioinformatics 34(18):3094-3100
- uLTRA: Sahlin K & Mäkinen V. (2021) Genome Biol 22:142
- deSALT: Liu B, et al. (2019) Genome Biol 20:274
- GraphMap2: Sović I, et al. (2016) Nat Commun 7:11307
License
[Specify license - consider MIT or GPL to match original COMPASS]
Contact
Kevin R. Roy Larsmon Lab, Stanford University [Contact information]
References
- Original COMPASS: https://github.com/k-roy/COMPASS
- Non-canonical splice sites: Meyer et al. (2023), PMC10711555
- Yeast genome: Saccharomyces Genome Database (SGD)
- Nanopore basecalling: Oxford Nanopore Technologies
Version History
v1.0.0 (2026-03-09)
- Initial implementation
- 4-aligner consensus strategy
- Nanopore error model
- Yeast configuration
- RPG analysis module
- Alternative splice site detection
Last updated: 2026-03-09
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanocompass-0.1.0.tar.gz.
File metadata
- Download URL: nanocompass-0.1.0.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c38b1df97d5a04b30b8bb6dcc2d9bb657d805cccbd5736c11663b25f56df4cc5
|
|
| MD5 |
45f5e56646217a975ff26fd29019b1f3
|
|
| BLAKE2b-256 |
b62670ab5f73f6015f7ec729c74f3c04020485d1a7be05350aed966ff9c2da92
|
File details
Details for the file nanocompass-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nanocompass-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ea01b61949f0c5c7bed7931a35a1ed97740bc811984840870e8ebf79b6112cb
|
|
| MD5 |
395be29fac6a780b0ef716ce5590f70d
|
|
| BLAKE2b-256 |
ab18d992e3452f6c645ca2b89d9759c771f002dc2d68a0471312aa9c2a83fb17
|