RNA-seq Analysis Pipeline Testing and Optimization Resource
Project description
RAPTOR
RNA-seq Analysis Pipeline Testing and Optimization Resource
Making free science for everybody around the world ๐
Quick Start โข Features โข Installation โข Architecture โข Documentation โข Pipelines โข Citation
๐ฆ What is RAPTOR?
RAPTOR is a comprehensive framework for RNA-seq analysis that makes sophisticated differential expression workflows accessible to everyone. Stop wondering which pipeline to use or what thresholds to setโRAPTOR provides ML-powered recommendations and ensemble methods for robust, reproducible results.
Why RAPTOR?
| Challenge | RAPTOR Solution |
|---|---|
| Which pipeline should I use? | โ ML recommendations based on 32 dataset features |
| Which DE method (DESeq2/edgeR/limma)? | โ Ensemble analysis combines all methods |
| What thresholds should I use? | โ 4 optimization methods for data-driven cutoffs |
| Is my data quality good enough? | โ 6 outlier detection methods with consensus |
| How do I know results are reliable? | โ Ensemble consensus with direction checking |
| What if methods disagree? | โ Brown's method accounts for correlation |
โจ Features
๐ฏ Ensemble Analysis (NEW!)
โ๏ธ Parameter Optimization (NEW!)
๐ 32-Feature Data Profiling
|
๐ค ML-Powered Recommendations
๐ฌ 6 Production Pipelines
๐ Quality Assessment
|
๐จ Interactive Dashboard
- Web-based interface (no coding!)
- Real-time visualizations
- Drag-and-drop data upload
- One-click ensemble analysis
- Export publication-ready reports
๐ Quick Start
Option 1: Interactive Dashboard (Recommended)
# Install
pip install raptor-rnaseq
# Launch dashboard
streamlit run raptor/dashboard/app.py
# Opens at http://localhost:8501
# Upload data โ Profile โ Get recommendation โ Run ensemble โ Done!
Option 2: Command Line
# 1. Quality check
raptor qc --counts counts.csv --metadata metadata.csv
# 2. Profile your data
raptor profile --counts counts.csv --metadata metadata.csv --group-column condition
# 3. Get ML recommendation
raptor recommend --profile profile.json --method ml
# 4. Import DE results from different methods
raptor import-de --input deseq2.csv --method deseq2
raptor import-de --input edger.csv --method edger
raptor import-de --input limma.csv --method limma
# 5. Optimize thresholds (NEW!)
raptor optimize --de-result de_results.csv --method fdr-control --fdr-target 0.05
# 6. Ensemble analysis - combine all methods (NEW!)
raptor ensemble-compare --deseq2 deseq2.csv --edger edger.csv --limma limma.csv
Option 3: Python API
from raptor import (
quick_quality_check,
profile_data_quick,
recommend_pipeline,
optimize_with_fdr_control,
ensemble_brown
)
# 1. Quality check
qc_report = quick_quality_check('counts.csv', 'metadata.csv')
print(f"Outliers: {qc_report.outliers}")
# 2. Profile data (32 features extracted)
profile = profile_data_quick('counts.csv', 'metadata.csv', group_column='condition')
print(f"BCV: {profile.bcv:.3f} ({profile.bcv_category})")
# 3. Get ML recommendation
recommendation = recommend_pipeline(profile_file='profile.json', method='ml')
print(f"Recommended: {recommendation.pipeline_name} (confidence: {recommendation.confidence:.2f})")
# 4. After running DE analysis, optimize thresholds (NEW!)
result = optimize_with_fdr_control(de_result, fdr_target=0.05)
print(f"Optimal thresholds: {result.optimal_threshold}")
# 5. Ensemble analysis - combine DESeq2, edgeR, limma (NEW!)
consensus = ensemble_brown({
'deseq2': deseq2_result,
'edger': edger_result,
'limma': limma_result
})
print(f"Consensus DE genes: {len(consensus.consensus_genes)}")
๐ฆ Installation
Requirements
- Python: 3.8 - 3.12
- R: 4.0+ (optional, for Module 6 DE analysis)
- RAM: 4GB minimum (16GB recommended for pipelines)
- Disk: 500MB (Python package) / 5-8GB (with bioinformatics tools)
Install from PyPI (Recommended)
# Basic installation
pip install raptor-rnaseq
# With dashboard support
pip install raptor-rnaseq[dashboard]
# With all features
pip install raptor-rnaseq[all]
# Development installation
pip install raptor-rnaseq[dev]
Conda Installation
Core environment (Python only, ~500MB, 5-10 min):
conda env create -f environment.yml
conda activate raptor
Full environment (with STAR, Salmon, Kallisto, R, ~5-8GB, 30-60 min):
conda env create -f environment-full.yml
conda activate raptor-full
See docs/CONDA_ENVIRONMENTS.md for detailed comparison.
Install from Source
# Clone repository
git clone https://github.com/AyehBlk/RAPTOR.git
cd RAPTOR
# Install in editable mode
pip install -e .
# Or with development tools
pip install -e .[dev]
# Verify installation
raptor --version
pytest tests/
๐๏ธ Architecture
RAPTOR is organized into 9 modules spanning 4 analysis stages:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RAPTOR v2.2.0 โ
โ RNA-seq Analysis Pipeline Framework โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Stage 1: Data Preparation & QC
โโโ Module 1: Quick Quantification (Salmon/Kallisto)
โโโ Module 2: Quality Assessment (6 outlier methods)
โโโ Module 3: Data Profiling (32 features)
Stage 2: Pipeline Selection
โโโ Module 4: ML Recommender (Random Forest)
โโโ Module 5: Production Pipelines (6 methods)
โโโ Salmon โญ (recommended)
โโโ Kallisto (fastest)
โโโ STAR + featureCounts
โโโ STAR + RSEM
โโโ STAR + Salmon (unique: BAM + bootstraps)
โโโ HISAT2 + featureCounts
Stage 3: Differential Expression
โโโ Module 6: DE Analysis (R: DESeq2, edgeR, limma)
โโโ Module 7: DE Import (standardize any format)
Stage 4: Advanced Analysis โญ NEW in v2.2.0
โโโ Module 8: Parameter Optimization (4 methods)
โ โโโ Ground Truth Optimization
โ โโโ FDR Control Optimization
โ โโโ Stability Optimization
โ โโโ Reproducibility Optimization
โโโ Module 9: Ensemble Analysis (5 methods)
โโโ Fisher's Method
โโโ Brown's Method
โโโ Robust Rank Aggregation
โโโ Voting Consensus
โโโ Weighted Ensemble
๐งฌ Pipelines
RAPTOR supports 6 production RNA-seq quantification pipelines:
| Pipeline | Memory | Time | Produces | Best For | Recommended |
|---|---|---|---|---|---|
| salmon | 8 GB | 10-20 min | genes + isoforms + bootstraps | Standard DE analysis | โญ YES |
| kallisto | 4 GB | 5-10 min | genes + isoforms + bootstraps | Speed priority | โ |
| star_featurecounts | 32 GB | 40-70 min | BAM + genes | Gene-level publication | โ |
| star_rsem | 32 GB | 60-120 min | BAM + genes + isoforms | Isoform analysis | โ |
| star_salmon | 32 GB | 50-90 min | BAM + genes + isoforms + bootstraps | Unique: BAM + bootstraps | โ |
| hisat2_featurecounts | 16 GB | 30-60 min | BAM + genes | Low memory systems | โ |
โญ Salmon is recommended for most use cases due to optimal speed/accuracy balance and bootstrap support.
Pipeline Features
All pipelines support:
- โ Paired-end and single-end reads
- โ Automatic parameter optimization
- โ QC report generation
- โ Multi-threading
- โ Sample sheet-based workflows
Pipeline selection:
# List available pipelines
raptor pipeline list
# Get detailed info
raptor pipeline run --name salmon --help
# Run with ML recommendation
raptor recommend --profile profile.json --method ml
# Recommended: salmon (confidence: 0.89)
raptor pipeline run --name salmon --samples samples.csv --index salmon_index/
๐๏ธ Repository Structure
RAPTOR/
โโโ raptor/ # Core Python package
โ โโโ __init__.py # Package initialization (v2.2.0)
โ โโโ cli.py # Command-line interface (11 commands)
โ โโโ quality_assessment.py # Module 2: QC (6 methods)
โ โโโ profiler.py # Module 3: Profiling (32 features)
โ โโโ recommender.py # Module 4: Rule-based
โ โโโ ml_recommender.py # Module 4: ML-based
โ โโโ de_import.py # Module 7: DE import
โ โโโ parameter_optimization.py # Module 8: Optimization โญ NEW
โ โโโ ensemble.py # Module 9: Ensemble โญ NEW
โ โโโ simulation.py # Simulation tools
โ โ
โ โโโ pipelines/ # Module 5: Production pipelines
โ โ โโโ base.py
โ โ โโโ salmon/
โ โ โโโ kallisto/
โ โ โโโ star_featurecounts/
โ โ โโโ star_rsem/
โ โ โโโ star_salmon/
โ โ โโโ hisat2_featurecounts/
โ โ
โ โโโ external_modules/ # Module 6: R integration
โ โ โโโ module6_de_analysis/
โ โ โโโ r_scripts/ # DESeq2, edgeR, limma
โ โ
โ โโโ dashboard/ # Interactive Streamlit app
โ โ โโโ app.py
โ โ โโโ pages/ # 9 dashboard pages
โ โ โโโ components/
โ โ โโโ utils/
โ โ
โ โโโ utils/ # Utilities
โ โโโ validation.py
โ โโโ errors.py
โ โโโ sample_sheet.py
โ
โโโ docs/ # Documentation
โ โโโ MODULE_1_Quick_Quantification.md
โ โโโ MODULE_2_Quality_Assessment.md
โ โโโ MODULE_3_Data_Profiling.md
โ โโโ MODULE_3_QUICK_REFERENCE.md
โ โโโ MODULE_4_Pipeline_Recommender.md
โ โโโ MODULE_7_DE_Import.md
โ โโโ MODULE_8_Parameter_Optimization.md โญ NEW
โ โโโ MODULE_9_Ensemble_Analysis.md โญ NEW
โ โโโ CONDA_ENVIRONMENTS.md
โ โโโ RAPTOR_QUICK_REFERENCE.md # Cheat sheet
โ โโโ RAPTOR_API_DOCUMENTATION.md # Python API
โ
โโโ examples/ # Example scripts
โ โโโ 02_quality_assessment.py
โ โโโ 03_data_profiler.py
โ โโโ 04_recommender.py
โ โโโ 07_DE_Import.py
โ โโโ 08_Parameter_Optimization.py โญ NEW
โ โโโ 09_Ensemble_Analysis.py โญ NEW
โ
โโโ tests/ # Test suite (85%+ coverage)
โ โโโ test_profiler.py
โ โโโ test_quality_assessment.py
โ โโโ test_parameter_optimization.py โญ NEW
โ โโโ test_ensemble.py โญ NEW
โ โโโ ...
โ
โโโ templates/ # Sample sheets
โ โโโ sample_sheet_paired.csv
โ โโโ sample_sheet_single.csv
โ
โโโ .github/ # GitHub templates
โ โโโ ISSUE_TEMPLATE/
โ
โโโ setup.py # Package setup
โโโ requirements.txt # Python dependencies
โโโ environment.yml # Conda environment (core)
โโโ environment-full.yml # Conda environment (complete)
โโโ CITATION.cff # Citation metadata
โโโ CHANGELOG.md # Version history
โโโ CONTRIBUTING.md # Contribution guidelines
โโโ LICENSE # MIT License
๐ Documentation
Getting Started
| Document | Description |
|---|---|
| Quick Start | 5-minute quick start guide |
| Installation | Detailed installation instructions |
| CONDA_ENVIRONMENTS.md | Conda setup (core vs full) |
Core Features (v2.2.0)
| Document | Description |
|---|---|
| MODULE_2_Quality_Assessment.md | QC with 6 outlier methods |
| MODULE_3_Data_Profiling.md | 32-feature profiling |
| MODULE_3_QUICK_REFERENCE.md | Profiling cheat sheet |
| MODULE_4_Pipeline_Recommender.md | ML recommendations |
| MODULE_7_DE_Import.md | Import & standardize DE results |
| MODULE_8_Parameter_Optimization.md | โญ 4 optimization methods |
| MODULE_9_Ensemble_Analysis.md | โญ 5 ensemble methods |
Reference
| Document | Description |
|---|---|
| RAPTOR_QUICK_REFERENCE.md | Command cheat sheet |
| RAPTOR_API_DOCUMENTATION.md | Complete Python API |
| examples/ | Example scripts for all modules |
| CHANGELOG.md | Version history |
๐ก Usage Examples
Example 1: Complete Workflow (v2.2.0)
from raptor import (
quick_quality_check,
profile_data_quick,
recommend_pipeline,
import_deseq2,
import_edger,
import_limma,
optimize_with_fdr_control,
ensemble_brown
)
# 1. Quality Check
print("Step 1: Quality Assessment...")
qc_report = quick_quality_check('counts.csv', 'metadata.csv')
if len(qc_report.outliers) > 0:
print(f"โ ๏ธ Warning: {len(qc_report.outliers)} outliers detected")
else:
print("โ
No outliers detected")
# 2. Profile Data (32 features)
print("\nStep 2: Data Profiling...")
profile = profile_data_quick('counts.csv', 'metadata.csv', group_column='condition')
print(f" BCV: {profile.bcv:.3f} ({profile.bcv_category})")
print(f" Sample size: {profile.n_samples}")
# 3. Get ML Recommendation
print("\nStep 3: ML Recommendation...")
rec = recommend_pipeline(profile_file='results/profile/data_profile.json', method='ml')
print(f" Recommended: {rec.pipeline_name} (confidence: {rec.confidence:.2f})")
# 4. [Run recommended pipeline, then DE analysis in R]
# 5. Import DE Results
print("\nStep 4: Import DE Results...")
deseq2 = import_deseq2('deseq2_results.csv')
edger = import_edger('edger_results.csv')
limma = import_limma('limma_results.csv')
# 6. Optimize Thresholds (NEW!)
print("\nStep 5: Optimize Thresholds...")
opt_result = optimize_with_fdr_control(deseq2, fdr_target=0.05)
print(f" Optimal FDR: {opt_result.optimal_threshold['padj']:.3f}")
print(f" Optimal |logFC|: {opt_result.optimal_threshold['lfc']:.3f}")
# 7. Ensemble Analysis (NEW!)
print("\nStep 6: Ensemble Analysis (Brown's Method)...")
consensus = ensemble_brown({
'deseq2': deseq2,
'edger': edger,
'limma': limma
})
print(f" Consensus genes: {len(consensus.consensus_genes)}")
print(f" Direction consistency: {consensus.direction_consistency.mean():.1%}")
# 8. Export Results
consensus.to_csv('consensus_genes.csv')
print("\nโ
Analysis complete!")
Example 2: Ensemble Analysis Only
from raptor import import_de_result, ensemble_fisher, ensemble_brown, ensemble_rra
# Import results from different tools
deseq2 = import_de_result('deseq2_results.csv', method='deseq2')
edger = import_de_result('edger_results.csv', method='edger')
limma = import_de_result('limma_results.csv', method='limma')
# Try multiple ensemble methods
results = {}
# Fisher's Method (classic)
results['fisher'] = ensemble_fisher({'deseq2': deseq2, 'edger': edger, 'limma': limma})
# Brown's Method (recommended - accounts for correlation)
results['brown'] = ensemble_brown({'deseq2': deseq2, 'edger': edger, 'limma': limma})
# Robust Rank Aggregation
results['rra'] = ensemble_rra({'deseq2': deseq2, 'edger': edger, 'limma': limma})
# Compare results
for method, result in results.items():
print(f"{method}: {len(result.consensus_genes)} consensus genes")
# Use Brown's method (best for correlated methods)
final_result = results['brown']
final_result.to_csv('final_consensus.csv')
Example 3: CLI Workflow
#!/bin/bash
# Complete RAPTOR v2.2.0 workflow using CLI
# Step 1: QC
raptor qc --counts counts.csv --metadata metadata.csv --output qc_results/
# Step 2: Profile
raptor profile --counts counts.csv --metadata metadata.csv --group-column condition
# Step 3: Recommend
raptor recommend --profile profile.json --method ml
# Step 4: Import DE results
raptor import-de --input deseq2_results.csv --method deseq2 --output imported/
raptor import-de --input edger_results.csv --method edger --output imported/
raptor import-de --input limma_results.csv --method limma --output imported/
# Step 5: Optimize thresholds (NEW!)
raptor optimize --de-result imported/deseq2.csv --method fdr-control --fdr-target 0.05
# Step 6: Ensemble analysis (NEW!)
raptor ensemble-compare \
--deseq2 imported/deseq2.csv \
--edger imported/edger.csv \
--limma imported/limma.csv \
--output ensemble_results/
echo "โ
Complete! Check ensemble_results/ for consensus genes."
๐ Performance
Module Performance
| Module | Time | Memory | Key Output |
|---|---|---|---|
| Module 2: QC | 1-5 min | 4 GB | 6 methods consensus |
| Module 3: Profiler | 1-2 min | 4 GB | 32 features + BCV |
| Module 4: Recommender | <10 sec | <1 GB | ML recommendation |
| Module 8: Optimization | 5-30 min | 4 GB | Optimal thresholds |
| Module 9: Ensemble | <1 min | 2 GB | Consensus genes |
Ensemble Analysis Benefits
| Metric | Single Method | Ensemble (Brown's) |
|---|---|---|
| False Positive Rate | Higher | 33% lower |
| Reproducibility | Variable | Higher |
| Confidence | Method-specific | Consensus-based |
| Publication Impact | Good | Better |
๐ค Contributing
We welcome contributions! RAPTOR is open-source and aims to make free science accessible to everyone.
# Fork and clone
git clone https://github.com/YOUR_USERNAME/RAPTOR.git
cd RAPTOR
# Create feature branch
git checkout -b feature/amazing-feature
# Make changes and test
pytest tests/
# Submit pull request
See CONTRIBUTING.md for detailed guidelines.
Ways to Contribute
- ๐ Report bugs via Issues
- โจ Request features
- ๐ Improve documentation
- ๐ง Submit pull requests
- ๐ก Share use cases and feedback
- โญ Star the repository
๐ Citation
If you use RAPTOR in your research, please cite:
@software{bolouki2026raptor,
author = {Bolouki, Ayeh},
title = {RAPTOR: RNA-seq Analysis Pipeline Testing and Optimization Resource},
year = {2026},
version = {2.2.0},
publisher = {Zenodo},
doi = {10.5281/zenodo.17607161},
url = {https://github.com/AyehBlk/RAPTOR}
}
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2026 Ayeh Bolouki
๐ง Contact
Ayeh Bolouki
- ๐๏ธ GIGA, University of Liรจge, Belgium
- ๐ง Email: ayehbolouki1988@gmail.com
- ๐ GitHub: @AyehBlk
- ๐ฌ Research: Computational Biology, Bioinformatics, Multi-omics Analysis
Support
- ๐ Documentation: docs/
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
- ๐ง Email: ayehbolouki1988@gmail.com
๐ Acknowledgments
RAPTOR builds on the excellent work of the RNA-seq community:
- Bioconductor community for the R package ecosystem
- DESeq2 (Love et al., 2014) - Differential expression analysis
- edgeR (Robinson et al., 2010) - Empirical analysis of DGE
- limma (Ritchie et al., 2015) - Linear models for microarray and RNA-seq
- Salmon (Patro et al., 2017) - Wicked-fast transcript quantification
- Kallisto (Bray et al., 2016) - Near-optimal probabilistic RNA-seq quantification
- STAR (Dobin et al., 2013) - Ultrafast universal RNA-seq aligner
- All users who provided feedback and suggestions
โญ Star this repository if you find RAPTOR useful!
RAPTOR v2.2.0 - Making pipeline selection evidence-based, not guesswork ๐ฆ
Making free science for everybody around the world ๐
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file raptor_rnaseq-2.2.0.tar.gz.
File metadata
- Download URL: raptor_rnaseq-2.2.0.tar.gz
- Upload date:
- Size: 911.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d802d518731497656dd0c65d562fd5f4f10277e4479d01734963de961a0bd0b
|
|
| MD5 |
1dafe59371d533c6199895c9d3f7603b
|
|
| BLAKE2b-256 |
da0e1ed10378ce47481442d4aa1237f918b59ca279734e22acbf1819b11c4ae0
|
File details
Details for the file raptor_rnaseq-2.2.0-py3-none-any.whl.
File metadata
- Download URL: raptor_rnaseq-2.2.0-py3-none-any.whl
- Upload date:
- Size: 715.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5a0b3421a678f1d65c820daa93c55b9766e044f1fc1cdb35997e63f5649fe95
|
|
| MD5 |
593c2eec987b455436d34aef3daa8320
|
|
| BLAKE2b-256 |
6f0789cc8f6cf911cafcdbbd6b2ee084b08395f20ebb824f7f16818a29a5a7e7
|