Skip to main content

RNA-seq Analysis Pipeline Testing and Optimization Resource with ML-powered recommendations and adaptive threshold optimization

Project description

RAPTOR v2.1.1

RAPTOR

RNA-seq Analysis Pipeline Testing and Optimization Resource

Making free science for everybody around the world ๐ŸŒ

PyPI version Python 3.8+ MIT License DOI Release v2.1.1

Quick Start โ€ข Features โ€ข Installation โ€ข Documentation โ€ข Pipelines โ€ข Citation


๐Ÿ†• What's New in v2.1.1

Adaptive Threshold Optimizer (ATO)

Stop using arbitrary thresholds! The new Adaptive Threshold Optimizer determines data-driven significance cutoffs for differential expression analysis.

from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd

df = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(df, goal='discovery')

print(f"Optimal logFC: {result.logfc_threshold:.2f}")
print(f"Significant genes: {result.n_significant}")
print(f"\n{result.methods_text}")  # Publication-ready!

Key Features:

  • Multiple p-value adjustment methods (BH, BY, Storey q-value, Holm, Bonferroni)
  • Five logFC optimization methods (MAD, mixture model, power-based, percentile, consensus)
  • ฯ€โ‚€ estimation for true null proportion
  • Three analysis goals: discovery, balanced, validation
  • Auto-generated publication methods text
  • Interactive dashboard integration

What is RAPTOR?

RAPTOR is a comprehensive framework for benchmarking and optimizing RNA-seq differential expression analysis pipelines. Instead of guessing which pipeline works best for your data, RAPTOR provides evidence-based, ML-powered recommendations through systematic comparison of 8 popular pipelines.

Why RAPTOR?

Challenge RAPTOR Solution
Which pipeline should I use? โœ… ML recommendations with 87% accuracy
What thresholds should I use? โœ… Adaptive Threshold Optimizer (NEW!)
Is my data quality good enough? โœ… Quality assessment with batch effect detection
How do I know results are reliable? โœ… Ensemble analysis combining multiple pipelines
What resources do I need? โœ… Resource monitoring with predictions
How do I present results? โœ… Automated reports publication-ready

Features

Adaptive Threshold Optimizer (NEW!)

  • Data-driven logFC and p-value thresholds
  • Multiple statistical methods
  • Publication-ready methods text
  • Interactive dashboard page

ML-Based Recommendations

  • 87% prediction accuracy
  • Confidence scoring (0-100%)
  • Learns from 10,000+ analyses
  • Explains its reasoning

Quality Assessment

  • 6-component quality scoring
  • Batch effect detection
  • Outlier identification
  • Actionable recommendations

Ensemble Analysis

  • 5 combination methods
  • 33% fewer false positives
  • High-confidence gene lists
  • Consensus validation

Interactive Dashboard

  • Web-based interface (no coding!)
  • Real-time visualizations
  • Drag-and-drop data upload
  • One-click reports

Resource Monitoring

  • Real-time CPU/memory tracking
  • <1% performance overhead
  • Resource predictions
  • Cost estimation for cloud

Quick Start

Option 1: Interactive Dashboard (Recommended)

# Install
pip install raptor-rnaseq

# Launch dashboard
raptor dashboard

# Opens at http://localhost:8501
# Upload data โ†’ Get ML recommendation โ†’ Use ๐ŸŽฏ Threshold Optimizer โ†’ Done!

Option 2: Command Line

# Profile your data and get ML recommendation
raptor profile --counts counts.csv --metadata metadata.csv --use-ml

# Run recommended pipeline
raptor run --pipeline 3 --data fastq/ --output results/

# Optimize thresholds (NEW!)
raptor optimize-thresholds --input results.csv --goal balanced

# Generate report
raptor report --results results/ --output report.html

Option 3: Python API

from raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds

# Profile your data
profiler = RNAseqDataProfiler(counts, metadata)
profile = profiler.run_full_profile()

# Get ML recommendation
recommender = MLPipelineRecommender()
recommendation = recommender.recommend(profile)

print(f"Recommended: Pipeline {recommendation['pipeline_id']}")
print(f"Confidence: {recommendation['confidence']:.1%}")

# After running pipeline, optimize thresholds (NEW!)
de_results = pd.read_csv('de_results.csv')
result = optimize_thresholds(de_results, goal='balanced')
print(f"Optimal |logFC|: {result.logfc_threshold:.2f}")
print(result.methods_text)

Installation

Requirements

  • Python: 3.8 or higher
  • R: 4.0 or higher (for DE analysis)
  • RAM: 8GB minimum (16GB recommended)
  • Disk: 10GB free space

Install from PyPI (Recommended)

pip install raptor-rnaseq

With optional dependencies:

# With dashboard support
pip install raptor-rnaseq[dashboard]

# With all features
pip install raptor-rnaseq[all]

Install from GitHub

# Clone repository
git clone https://github.com/AyehBlk/RAPTOR.git
cd RAPTOR

# Install Python dependencies
pip install -r requirements.txt

# Verify installation
python install.py

Conda Environment

conda env create -f environment.yml
conda activate raptor

Pipelines

RAPTOR benchmarks 8 RNA-seq analysis pipelines:

ID Pipeline Aligner Quantifier DE Tool Speed ML Rank
1 STAR-RSEM-DESeq2 STAR RSEM DESeq2 โญโญ #2
2 HISAT2-StringTie-Ballgown HISAT2 StringTie Ballgown โญโญโญ #5
3 Salmon-edgeR โญ Salmon Salmon edgeR โญโญโญโญโญ #1
4 Kallisto-Sleuth Kallisto Kallisto Sleuth โญโญโญโญโญ #3
5 STAR-HTSeq-limma STAR HTSeq limma-voom โญโญ #4
6 STAR-featureCounts-NOISeq STAR featureCounts NOISeq โญโญ #6
7 Bowtie2-RSEM-EBSeq Bowtie2 RSEM EBSeq โญโญ #7
8 HISAT2-Cufflinks-Cuffdiff HISAT2 Cufflinks Cuffdiff โญ #8

โญ Pipeline 3 (Salmon-edgeR) is the ML model's most frequently recommended pipeline due to its optimal speed/accuracy balance.


Repository Structure

RAPTOR/
โ”œโ”€โ”€ raptor/                 # Core Python package
โ”‚   โ”œโ”€โ”€ profiler.py         # Data profiling
โ”‚   โ”œโ”€โ”€ recommender.py      # Rule-based recommendations
โ”‚   โ”œโ”€โ”€ ml_recommender.py   # ML recommendations
โ”‚   โ”œโ”€โ”€ threshold_optimizer/ # ๐Ÿ†• Adaptive Threshold Optimizer (v2.1.1)
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ ato.py          # Core ATO class
โ”‚   โ”‚   โ””โ”€โ”€ visualization.py # ATO visualizations
โ”‚   โ”œโ”€โ”€ data_quality_assessment.py
โ”‚   โ”œโ”€โ”€ ensemble_analysis.py
โ”‚   โ”œโ”€โ”€ resource_monitoring.py
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ dashboard/              # Interactive web dashboard
โ”œโ”€โ”€ pipelines/              # Pipeline configurations (8 pipelines)
โ”œโ”€โ”€ scripts/                # Workflow scripts (00-10)
โ”œโ”€โ”€ examples/               # Example scripts & demos
โ”œโ”€โ”€ tests/                  # Test suite
โ”œโ”€โ”€ docs/                   # Documentation
โ”œโ”€โ”€ config/                 # Configuration templates
โ”œโ”€โ”€ install.py              # Master installer
โ”œโ”€โ”€ launch_dashboard.py     # Dashboard launcher
โ”œโ”€โ”€ requirements.txt        # Python dependencies
โ””โ”€โ”€ setup.py                # Package setup

Documentation

Getting Started

Document Description
INSTALLATION.md Detailed installation guide
QUICK_START.md 5-minute quick start
DASHBOARD.md Interactive dashboard guide

Core Features

Document Description
THRESHOLD_OPTIMIZER.md ๐Ÿ†• Adaptive threshold optimization
PROFILE_RECOMMEND.md Data profiling & recommendations
QUALITY_ASSESSMENT.md Quality scoring & batch effects
BENCHMARKING.md Pipeline benchmarking

Advanced Features

Document Description
ENSEMBLE.md Multi-pipeline ensemble analysis
RESOURCE_MONITORING.md Resource tracking
CLOUD_DEPLOYMENT.md AWS/GCP/Azure deployment

Reference

Document Description
PIPELINES.md Pipeline details & selection guide
API.md Python API reference
FAQ.md Frequently asked questions
CHANGELOG.md Version history

Usage Examples

Example 1: Quick Threshold Optimization (NEW!)

from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd

# Load DE results
df = pd.read_csv('deseq2_results.csv')

# Optimize thresholds
result = optimize_thresholds(df, goal='balanced')

print(f"Optimal |logFC|: {result.logfc_threshold:.3f}")
print(f"Significant genes: {result.n_significant}")
print(f"ฯ€โ‚€ estimate: {result.pi0:.3f}")

# Get publication methods text
print(result.methods_text)

# Save results
result.results_df.to_csv('optimized_results.csv')

Example 2: Full Workflow

from raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd

# 1. Profile data
counts = pd.read_csv('counts.csv', index_col=0)
metadata = pd.read_csv('metadata.csv')

profiler = RNAseqDataProfiler(counts, metadata, use_ml=True)
profile = profiler.profile(quality_check=True)
print(f"Quality Score: {profile['quality_score']}/100")

# 2. Get ML recommendation
recommender = MLPipelineRecommender()
recommendations = recommender.recommend(profile, n=3)
print(f"Recommended: {recommendations[0]['pipeline_name']}")

# 3. [Run recommended pipeline - produces DE results]
# raptor run --pipeline 3 ...

# 4. Optimize thresholds (NEW in v2.1.1)
de_results = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(
    de_results,
    logfc_col='log2FoldChange',
    pvalue_col='pvalue',
    goal='balanced'
)

print(f"\n๐ŸŽฏ Optimized Thresholds:")
print(f"   LogFC: |{result.logfc_threshold:.3f}|")
print(f"   Significant: {result.n_significant} genes")

# 5. Save results with methods text
result.results_df.to_csv('final_results.csv')
with open('methods.txt', 'w') as f:
    f.write(result.methods_text)

Example 3: Ensemble Analysis with ATO

from raptor.ensemble_analysis import EnsembleAnalyzer
from raptor.threshold_optimizer import optimize_thresholds

# Combine results from multiple pipelines
analyzer = EnsembleAnalyzer()
consensus = analyzer.combine_results(
    results_dict={'deseq2': df1, 'edger': df2, 'limma': df3},
    method='weighted_vote',
    min_agreement=2
)

# Use ATO for uniform thresholds across ensemble
result = optimize_thresholds(consensus['combined'], goal='balanced')
print(f"Consensus DE genes: {result.n_significant}")

Performance

ML Recommendation Accuracy

Metric Value
Overall Accuracy 87%
Top-3 Accuracy 96%
Prediction Time <0.1s
Training Data 10,000+ analyses

Threshold Optimizer Benefits

Metric Traditional With ATO
Threshold justification Arbitrary Data-driven
Methods text Manual Auto-generated
False positives Higher Optimized
Reproducibility Variable Standardized

Contributing

We welcome contributions! RAPTOR is open-source and aims to make free science accessible to everyone.

# Fork and clone
git clone https://github.com/YOUR_USERNAME/RAPTOR.git

# Create feature branch
git checkout -b feature/amazing-feature

# Make changes and test
pytest tests/

# Submit pull request

See CONTRIBUTING.md for guidelines.


Citation

If you use RAPTOR in your research, please cite:

@software{bolouki2025raptor,
  author       = {Bolouki, Ayeh},
  title        = {RAPTOR: RNA-seq Analysis Pipeline Testing and Optimization Resource},
  year         = {2025},
  version      = {2.1.1},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17607161},
  url          = {https://github.com/AyehBlk/RAPTOR}
}

DOI


License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License
Copyright (c) 2025 Ayeh Bolouki

Contact

Ayeh Bolouki

  • ๐Ÿ›๏ธ GIGA, University of Liรจge, Belgium
  • ๐Ÿ“ง Email: ayehbolouki1988@gmail.com
  • ๐Ÿ™ GitHub: @AyehBlk
  • ๐Ÿ”ฌ Research: Computational Biology, Bioinformatics, Multi-omics Analysis

Acknowledgments

  • The Bioconductor community for the R package ecosystem
  • All users who provided feedback

โญ Star this repository if you find RAPTOR useful!

GitHub Stars

RAPTOR v2.1.1 - Making pipeline selection evidence-based, not guesswork ๐Ÿฆ–

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raptor_rnaseq-2.1.2.tar.gz (157.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

raptor_rnaseq-2.1.2-py3-none-any.whl (164.3 kB view details)

Uploaded Python 3

File details

Details for the file raptor_rnaseq-2.1.2.tar.gz.

File metadata

  • Download URL: raptor_rnaseq-2.1.2.tar.gz
  • Upload date:
  • Size: 157.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for raptor_rnaseq-2.1.2.tar.gz
Algorithm Hash digest
SHA256 ed5a9426c1d6a69ebf72707cdb76ac059c6c63a846c445d80456bc7098135a37
MD5 35ad4c1db161b1f5e4a2541b86f0f957
BLAKE2b-256 ef6fcb12d97b02a051ab2a0606b6f48e79a0254b94f32bc17cec7ce4c662e3ff

See more details on using hashes here.

File details

Details for the file raptor_rnaseq-2.1.2-py3-none-any.whl.

File metadata

  • Download URL: raptor_rnaseq-2.1.2-py3-none-any.whl
  • Upload date:
  • Size: 164.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for raptor_rnaseq-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 592270f4f4ab3a119a086081df35fcbc7fa006441656d18e99e52e19d1b07b46
MD5 fc065e435db47c1b8051be62051f338b
BLAKE2b-256 703ea22729624004b783ca23c50f0ff9d158d9d38ca67c8c8bb245dbc07755dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page