RNA-seq Analysis Pipeline Testing and Optimization Resource with ML-powered recommendations and adaptive threshold optimization
Project description
RAPTOR
RNA-seq Analysis Pipeline Testing and Optimization Resource
Making free science for everybody around the world ๐
Quick Start โข Features โข Installation โข Documentation โข Pipelines โข Citation
๐ What's New in v2.1.1
Adaptive Threshold Optimizer (ATO)
Stop using arbitrary thresholds! The new Adaptive Threshold Optimizer determines data-driven significance cutoffs for differential expression analysis.
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
df = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(df, goal='discovery')
print(f"Optimal logFC: {result.logfc_threshold:.2f}")
print(f"Significant genes: {result.n_significant}")
print(f"\n{result.methods_text}") # Publication-ready!
Key Features:
- Multiple p-value adjustment methods (BH, BY, Storey q-value, Holm, Bonferroni)
- Five logFC optimization methods (MAD, mixture model, power-based, percentile, consensus)
- ฯโ estimation for true null proportion
- Three analysis goals: discovery, balanced, validation
- Auto-generated publication methods text
- Interactive dashboard integration
What is RAPTOR?
RAPTOR is a comprehensive framework for benchmarking and optimizing RNA-seq differential expression analysis pipelines. Instead of guessing which pipeline works best for your data, RAPTOR provides evidence-based, ML-powered recommendations through systematic comparison of 8 popular pipelines.
Why RAPTOR?
| Challenge | RAPTOR Solution |
|---|---|
| Which pipeline should I use? | โ ML recommendations with 87% accuracy |
| What thresholds should I use? | โ Adaptive Threshold Optimizer (NEW!) |
| Is my data quality good enough? | โ Quality assessment with batch effect detection |
| How do I know results are reliable? | โ Ensemble analysis combining multiple pipelines |
| What resources do I need? | โ Resource monitoring with predictions |
| How do I present results? | โ Automated reports publication-ready |
Features
Adaptive Threshold Optimizer (NEW!)
ML-Based Recommendations
Quality Assessment
|
Ensemble Analysis
Interactive Dashboard
Resource Monitoring
|
Quick Start
Option 1: Interactive Dashboard (Recommended)
# Install
pip install raptor-rnaseq
# Launch dashboard
raptor dashboard
# Opens at http://localhost:8501
# Upload data โ Get ML recommendation โ Use ๐ฏ Threshold Optimizer โ Done!
Option 2: Command Line
# Profile your data and get ML recommendation
raptor profile --counts counts.csv --metadata metadata.csv --use-ml
# Run recommended pipeline
raptor run --pipeline 3 --data fastq/ --output results/
# Optimize thresholds (NEW!)
raptor optimize-thresholds --input results.csv --goal balanced
# Generate report
raptor report --results results/ --output report.html
Option 3: Python API
from raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds
# Profile your data
profiler = RNAseqDataProfiler(counts, metadata)
profile = profiler.run_full_profile()
# Get ML recommendation
recommender = MLPipelineRecommender()
recommendation = recommender.recommend(profile)
print(f"Recommended: Pipeline {recommendation['pipeline_id']}")
print(f"Confidence: {recommendation['confidence']:.1%}")
# After running pipeline, optimize thresholds (NEW!)
de_results = pd.read_csv('de_results.csv')
result = optimize_thresholds(de_results, goal='balanced')
print(f"Optimal |logFC|: {result.logfc_threshold:.2f}")
print(result.methods_text)
Installation
Requirements
- Python: 3.8 or higher
- R: 4.0 or higher (for DE analysis)
- RAM: 8GB minimum (16GB recommended)
- Disk: 10GB free space
Install from PyPI (Recommended)
pip install raptor-rnaseq
With optional dependencies:
# With dashboard support
pip install raptor-rnaseq[dashboard]
# With all features
pip install raptor-rnaseq[all]
Install from GitHub
# Clone repository
git clone https://github.com/AyehBlk/RAPTOR.git
cd RAPTOR
# Install Python dependencies
pip install -r requirements.txt
# Verify installation
python install.py
Conda Environment
conda env create -f environment.yml
conda activate raptor
Pipelines
RAPTOR benchmarks 8 RNA-seq analysis pipelines:
| ID | Pipeline | Aligner | Quantifier | DE Tool | Speed | ML Rank |
|---|---|---|---|---|---|---|
| 1 | STAR-RSEM-DESeq2 | STAR | RSEM | DESeq2 | โญโญ | #2 |
| 2 | HISAT2-StringTie-Ballgown | HISAT2 | StringTie | Ballgown | โญโญโญ | #5 |
| 3 | Salmon-edgeR โญ | Salmon | Salmon | edgeR | โญโญโญโญโญ | #1 |
| 4 | Kallisto-Sleuth | Kallisto | Kallisto | Sleuth | โญโญโญโญโญ | #3 |
| 5 | STAR-HTSeq-limma | STAR | HTSeq | limma-voom | โญโญ | #4 |
| 6 | STAR-featureCounts-NOISeq | STAR | featureCounts | NOISeq | โญโญ | #6 |
| 7 | Bowtie2-RSEM-EBSeq | Bowtie2 | RSEM | EBSeq | โญโญ | #7 |
| 8 | HISAT2-Cufflinks-Cuffdiff | HISAT2 | Cufflinks | Cuffdiff | โญ | #8 |
โญ Pipeline 3 (Salmon-edgeR) is the ML model's most frequently recommended pipeline due to its optimal speed/accuracy balance.
Repository Structure
RAPTOR/
โโโ raptor/ # Core Python package
โ โโโ profiler.py # Data profiling
โ โโโ recommender.py # Rule-based recommendations
โ โโโ ml_recommender.py # ML recommendations
โ โโโ threshold_optimizer/ # ๐ Adaptive Threshold Optimizer (v2.1.1)
โ โ โโโ __init__.py
โ โ โโโ ato.py # Core ATO class
โ โ โโโ visualization.py # ATO visualizations
โ โโโ data_quality_assessment.py
โ โโโ ensemble_analysis.py
โ โโโ resource_monitoring.py
โ โโโ ...
โโโ dashboard/ # Interactive web dashboard
โโโ pipelines/ # Pipeline configurations (8 pipelines)
โโโ scripts/ # Workflow scripts (00-10)
โโโ examples/ # Example scripts & demos
โโโ tests/ # Test suite
โโโ docs/ # Documentation
โโโ config/ # Configuration templates
โโโ install.py # Master installer
โโโ launch_dashboard.py # Dashboard launcher
โโโ requirements.txt # Python dependencies
โโโ setup.py # Package setup
Documentation
Getting Started
| Document | Description |
|---|---|
| INSTALLATION.md | Detailed installation guide |
| QUICK_START.md | 5-minute quick start |
| DASHBOARD.md | Interactive dashboard guide |
Core Features
| Document | Description |
|---|---|
| THRESHOLD_OPTIMIZER.md | ๐ Adaptive threshold optimization |
| PROFILE_RECOMMEND.md | Data profiling & recommendations |
| QUALITY_ASSESSMENT.md | Quality scoring & batch effects |
| BENCHMARKING.md | Pipeline benchmarking |
Advanced Features
| Document | Description |
|---|---|
| ENSEMBLE.md | Multi-pipeline ensemble analysis |
| RESOURCE_MONITORING.md | Resource tracking |
| CLOUD_DEPLOYMENT.md | AWS/GCP/Azure deployment |
Reference
| Document | Description |
|---|---|
| PIPELINES.md | Pipeline details & selection guide |
| API.md | Python API reference |
| FAQ.md | Frequently asked questions |
| CHANGELOG.md | Version history |
Usage Examples
Example 1: Quick Threshold Optimization (NEW!)
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
# Load DE results
df = pd.read_csv('deseq2_results.csv')
# Optimize thresholds
result = optimize_thresholds(df, goal='balanced')
print(f"Optimal |logFC|: {result.logfc_threshold:.3f}")
print(f"Significant genes: {result.n_significant}")
print(f"ฯโ estimate: {result.pi0:.3f}")
# Get publication methods text
print(result.methods_text)
# Save results
result.results_df.to_csv('optimized_results.csv')
Example 2: Full Workflow
from raptor import RNAseqDataProfiler, MLPipelineRecommender
from raptor.threshold_optimizer import optimize_thresholds
import pandas as pd
# 1. Profile data
counts = pd.read_csv('counts.csv', index_col=0)
metadata = pd.read_csv('metadata.csv')
profiler = RNAseqDataProfiler(counts, metadata, use_ml=True)
profile = profiler.profile(quality_check=True)
print(f"Quality Score: {profile['quality_score']}/100")
# 2. Get ML recommendation
recommender = MLPipelineRecommender()
recommendations = recommender.recommend(profile, n=3)
print(f"Recommended: {recommendations[0]['pipeline_name']}")
# 3. [Run recommended pipeline - produces DE results]
# raptor run --pipeline 3 ...
# 4. Optimize thresholds (NEW in v2.1.1)
de_results = pd.read_csv('deseq2_results.csv')
result = optimize_thresholds(
de_results,
logfc_col='log2FoldChange',
pvalue_col='pvalue',
goal='balanced'
)
print(f"\n๐ฏ Optimized Thresholds:")
print(f" LogFC: |{result.logfc_threshold:.3f}|")
print(f" Significant: {result.n_significant} genes")
# 5. Save results with methods text
result.results_df.to_csv('final_results.csv')
with open('methods.txt', 'w') as f:
f.write(result.methods_text)
Example 3: Ensemble Analysis with ATO
from raptor.ensemble_analysis import EnsembleAnalyzer
from raptor.threshold_optimizer import optimize_thresholds
# Combine results from multiple pipelines
analyzer = EnsembleAnalyzer()
consensus = analyzer.combine_results(
results_dict={'deseq2': df1, 'edger': df2, 'limma': df3},
method='weighted_vote',
min_agreement=2
)
# Use ATO for uniform thresholds across ensemble
result = optimize_thresholds(consensus['combined'], goal='balanced')
print(f"Consensus DE genes: {result.n_significant}")
Performance
ML Recommendation Accuracy
| Metric | Value |
|---|---|
| Overall Accuracy | 87% |
| Top-3 Accuracy | 96% |
| Prediction Time | <0.1s |
| Training Data | 10,000+ analyses |
Threshold Optimizer Benefits
| Metric | Traditional | With ATO |
|---|---|---|
| Threshold justification | Arbitrary | Data-driven |
| Methods text | Manual | Auto-generated |
| False positives | Higher | Optimized |
| Reproducibility | Variable | Standardized |
Contributing
We welcome contributions! RAPTOR is open-source and aims to make free science accessible to everyone.
# Fork and clone
git clone https://github.com/YOUR_USERNAME/RAPTOR.git
# Create feature branch
git checkout -b feature/amazing-feature
# Make changes and test
pytest tests/
# Submit pull request
See CONTRIBUTING.md for guidelines.
Citation
If you use RAPTOR in your research, please cite:
@software{bolouki2025raptor,
author = {Bolouki, Ayeh},
title = {RAPTOR: RNA-seq Analysis Pipeline Testing and Optimization Resource},
year = {2025},
version = {2.1.1},
publisher = {Zenodo},
doi = {10.5281/zenodo.17607161},
url = {https://github.com/AyehBlk/RAPTOR}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Ayeh Bolouki
Contact
Ayeh Bolouki
- ๐๏ธ GIGA, University of Liรจge, Belgium
- ๐ง Email: ayehbolouki1988@gmail.com
- ๐ GitHub: @AyehBlk
- ๐ฌ Research: Computational Biology, Bioinformatics, Multi-omics Analysis
Acknowledgments
- The Bioconductor community for the R package ecosystem
- All users who provided feedback
โญ Star this repository if you find RAPTOR useful!
RAPTOR v2.1.1 - Making pipeline selection evidence-based, not guesswork ๐ฆ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file raptor_rnaseq-2.1.2.tar.gz.
File metadata
- Download URL: raptor_rnaseq-2.1.2.tar.gz
- Upload date:
- Size: 157.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed5a9426c1d6a69ebf72707cdb76ac059c6c63a846c445d80456bc7098135a37
|
|
| MD5 |
35ad4c1db161b1f5e4a2541b86f0f957
|
|
| BLAKE2b-256 |
ef6fcb12d97b02a051ab2a0606b6f48e79a0254b94f32bc17cec7ce4c662e3ff
|
File details
Details for the file raptor_rnaseq-2.1.2-py3-none-any.whl.
File metadata
- Download URL: raptor_rnaseq-2.1.2-py3-none-any.whl
- Upload date:
- Size: 164.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
592270f4f4ab3a119a086081df35fcbc7fa006441656d18e99e52e19d1b07b46
|
|
| MD5 |
fc065e435db47c1b8051be62051f338b
|
|
| BLAKE2b-256 |
703ea22729624004b783ca23c50f0ff9d158d9d38ca67c8c8bb245dbc07755dd
|