siRNAforge - Multi-species gene to siRNA design, off-target prediction, and ranking. Comprehensive siRNA design toolkit for gene silencing
Project description
๐งฌ siRNAforge โ Comprehensive siRNA Design Tool
siRNAforge is a modern, comprehensive toolkit for designing high-quality siRNAs with integrated off-target analysis. Built with Python 3.9-3.12, it combines cutting-edge bioinformatics algorithms with robust software engineering practices to provide a complete gene silencing solution for researchers and biotechnology applications.
โจ Key Features
- ๐ฏ Algorithm-driven design - Comprehensive siRNA design with multi-component thermodynamic scoring
- ๐ Multi-species off-target analysis - BWA-MEM2 alignment (transcriptome + miRNA seed modes) across human, rat, rhesus genomes
- ๐ Advanced scoring system - Composite scoring with seed-region specificity and secondary structure prediction
- ๐งช ViennaRNA integration - Secondary structure prediction for enhanced design accuracy
- ๐งฌ Chemical modifications metadata - Track 2'-O-methyl, 2'-fluoro, PS linkages, overhangs, and provenance
- ๐ฌ Nextflow pipeline integration - Scalable, containerized workflow execution with automatic parallelization
- ๐ Modern Python architecture - Type-safe code with Pydantic models, async/await support, and rich CLI
- โก Lightning-fast dependency management - Built with
uvfor sub-second installs and virtual environment management - ๐ณ Fully containerized - Docker images with all bioinformatics dependencies pre-installed
- ๐งฌ Multi-database support - Ensembl, RefSeq, GENCODE integration for comprehensive transcript retrieval
Note: Supports Python 3.9-3.12. Python 3.13+ not yet supported due to ViennaRNA dependency constraints.
๐ Quick Start
Installation Options
๐ณ Docker (Recommended - Complete Environment):
# Pull the pre-built image with all dependencies
docker pull ghcr.io/austin-s-h/sirnaforge:latest
# Quick workflow example
docker run -v $(pwd):/workspace -w /workspace \
ghcr.io/austin-s-h/sirnaforge:latest \
sirnaforge workflow TP53 --output-dir results --genome-species human
# With custom parameters
docker run -v $(pwd):/workspace -w /workspace \
ghcr.io/austin-s-h/sirnaforge:latest \
sirnaforge workflow BRCA1 --gc-min 40 --gc-max 60 --sirna-length 21 --top-n 50
๐ Conda Environment (Alternative - Local Development):
# Install micromamba (recommended - fastest), Mambaforge, or Miniconda
# micromamba (fastest option):
curl -LsSf https://micro.mamba.pm/install.sh | bash
# Or Mambaforge:
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
# Create siRNAforge development environment
make conda-env
# Activate the environment
micromamba activate sirnaforge-dev # or conda activate sirnaforge-dev
# Install Python dependencies
make install-dev
# Run tests to verify installation
make test-local-python
๐ฅ๏ธ Local Development Installation:
# Install uv (lightning-fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup with development dependencies
git clone https://github.com/austin-s-h/sirnaforge
cd sirnaforge
make install-dev
# Run sanity checks to verify installation
make test-local-python
Essential Dependencies for Off-target Analysis
The Docker image includes all bioinformatics dependencies via conda environment (docker/environment-nextflow.yml):
- โ Nextflow (โฅ25.04.0) - Workflow orchestration and parallelization
- โ BWA-MEM2 (โฅ2.2.1) - High-performance genome alignment (transcriptome + miRNA seed analysis)
- โ SAMtools (โฅ1.19.2) - SAM/BAM file processing and indexing
- โ ViennaRNA (โฅ2.7.0) - RNA secondary structure prediction
- โ AWS CLI (โฅ2.0) - Automated genome reference downloads
- โ Java 17 - Nextflow runtime environment
For local development without Docker:
# Option 1: Use conda environment (includes all tools)
make conda-env
micromamba activate sirnaforge-dev # or conda activate sirnaforge-dev
# Option 2: Install bioinformatics tools via micromamba
curl -LsSf https://micro.mamba.pm/install.sh | bash
micromamba env create -f docker/environment-nextflow.yml
micromamba activate sirnaforge-env
Usage Examples
๐ฏ Complete Workflow (Gene Query to Results):
# Basic workflow with default parameters
uv run sirnaforge workflow TP53 --output-dir results
# Advanced workflow with custom parameters
uv run sirnaforge workflow BRCA1 \
--genome-species "human,rat,rhesus" \
--gc-min 40 --gc-max 60 \
--sirna-length 21 \
--top-n 50 \
--output-dir brca1_analysis
# Workflow from a pre-existing FASTA file (local path or remote URL)
uv run sirnaforge workflow --input-fasta transcripts.fasta \
--output-dir custom_analysis \
--offtarget-n 25 \
custom_gene_name
# Remote FASTA example
uv run sirnaforge workflow --input-fasta https://example.org/transcripts.fasta \
--output-dir remote_input_run \
remote_dataset
๐ Individual Component Usage:
# Search for gene transcripts across databases
uv run sirnaforge search TP53 --output transcripts.fasta --database ensembl
# Design siRNAs from transcript sequences
uv run sirnaforge design transcripts.fasta --output results.csv --top-n 20
# Validate input files before processing
uv run sirnaforge validate candidates.fasta
# Display configuration and system information
uv run sirnaforge config
# Show detailed help for any command
uv run sirnaforge --help
uv run sirnaforge workflow --help
Python API
๐ง Programmatic Access for Custom Workflows:
import asyncio
from pathlib import Path
from sirnaforge.workflow import run_sirna_workflow
from sirnaforge.core.design import SiRNADesigner
from sirnaforge.models.sirna import DesignParameters, FilterCriteria
from sirnaforge.data.gene_search import search_gene_sync
# Complete async workflow with custom parameters
async def design_sirnas_custom():
results = await run_sirna_workflow(
gene_query="TP53",
output_dir="results",
database="ensembl",
top_n_candidates=50,
top_n_offtarget=15,
genome_species=["human", "rat", "rhesus"],
gc_min=40.0,
gc_max=60.0,
sirna_length=21,
)
return results
# Run the workflow
results = asyncio.run(design_sirnas_custom())
print(f"โ
Designed {len(results.get('top_candidates', []))} siRNA candidates")
# Individual component usage for custom pipelines
def custom_design_pipeline():
# 1. Search for gene transcripts
transcripts = search_gene_sync(
gene_query="BRCA1",
database="ensembl",
output_file="transcripts.fasta"
)
# 2. Configure design parameters
design_params = DesignParameters(
sirna_length=21,
filters=FilterCriteria(
gc_min=40,
gc_max=60,
avoid_patterns=["AAAA", "TTTT", "GGGG", "CCCC"]
)
)
# 3. Initialize designer and generate candidates
designer = SiRNADesigner(design_params)
design_results = designer.design_from_file("transcripts.fasta")
# 4. Process results
for candidate in design_results.top_candidates[:10]:
print(f"Candidate {candidate.id}:")
print(f" Guide: {candidate.guide_sequence}")
print(f" Score: {candidate.composite_score:.2f}")
print(f" GC%: {candidate.gc_content:.1f}")
print(f" Transcripts: {len(candidate.transcript_ids)}")
print()
return design_results
# Example: Batch processing multiple genes
async def batch_design_genes(genes: list[str]):
results = {}
for gene in genes:
print(f"Processing {gene}...")
gene_results = await run_sirna_workflow(
gene_query=gene,
output_dir=f"results_{gene.lower()}",
top_n_candidates=20
)
results[gene] = gene_results
return results
# Process multiple cancer-related genes
cancer_genes = ["TP53", "BRCA1", "BRCA2", "EGFR", "MYC"]
batch_results = asyncio.run(batch_design_genes(cancer_genes))
๐๏ธ Architecture & Workflow
Complete Pipeline Overview
Gene Query โ Transcript Search โ ORF Validation โ siRNA Design โ Off-target Analysis โ Ranked Results
โ โ โ โ โ โ
Multi-database Canonical Coding Frame Thermodynamic Multi-species BWA Scored & Filtered
Gene Search Isoform Validation + Structure Alignment (seed & siRNA Candidates
(Ensembl/ Selection Scoring transcriptome) with Off-target
RefSeq/GENCODE) Predictions
Core Components
๐ Gene Search & Data Layer (sirnaforge.data.*)
- Multi-database integration: Ensembl, RefSeq, GENCODE APIs with automatic fallback
- Canonical transcript selection: Prioritizes protein-coding, longest transcripts
- Robust error handling: Network timeouts, API rate limiting, malformed responses
- Async/await support: Non-blocking I/O for improved performance
๐งฌ ORF Analysis (sirnaforge.data.orf_analysis)
- Reading frame validation: Ensures proper coding sequence targeting
- Quality control reporting: Detailed validation logs and metrics
- Multi-transcript support: Handles gene isoforms and splice variants
๐ฏ siRNA Design Engine (sirnaforge.core.design)
-
Algorithm-based candidate generation: Systematic 19-23 nucleotide window scanning
-
Multi-component scoring system:
- Thermodynamic properties: GC content (30-60%), melting temperature optimization
- Secondary structure prediction: ViennaRNA integration for accessibility scoring
- Position-specific penalties: 5' and 3' end optimization
- Off-target risk assessment: Simplified seed-region analysis
-
Composite scoring: Weighted combination of all scoring components
-
Transcript consolidation: Deduplicates guide sequences across multiple transcript isoforms
-
๐ Off-target Analysis (
sirnaforge.core.off_target)- Adaptive BWA-MEM2 modes: Sensitive genome-wide alignment plus ultra-short miRNA seed analysis using tuned parameters
-
Multi-species support: Human, rat, rhesus macaque genome analysis
-
Advanced scoring: Position-weighted mismatch penalties with seed-region emphasis
-
Scalable processing: Batch candidate analysis with parallel execution
๐ฌ Nextflow Pipeline Integration (nextflow_pipeline/)
- Containerized execution: Docker/Singularity support with pre-built environments
- Automatic resource management: Dynamic CPU/memory allocation based on workload
- Cloud-ready: AWS S3 genome reference integration with automatic downloading
- Fault tolerance: Resume capability and error recovery mechanisms
- Parallel processing: Multi-genome, multi-candidate simultaneous analysis
โก Modern Python Architecture
- Type safety: Full mypy compliance with Pydantic models for data validation
- Async/await: Non-blocking I/O throughout the pipeline for improved throughput
- Rich CLI: Beautiful terminal interface with progress bars, tables, and error formatting
- Comprehensive testing: Unit, integration, and pipeline tests with pytest
- Developer experience: Pre-commit hooks, automated formatting (black), linting (ruff)
Repository Structure
sirnaforge/
โโโ ๐ฆ src/sirnaforge/ # Main package (modern src-layout)
โ โโโ ๐ฏ core/ # Core algorithms and analysis engines
โ โ โโโ design.py # siRNA design, scoring, and candidate generation
โ โ โโโ off_target.py # BWA-MEM2 off-target analysis (transcriptome + miRNA seed)
โ โ โโโ thermodynamics.py # ViennaRNA integration & structure prediction
โ โโโ ๐ models/ # Type-safe Pydantic data models
โ โ โโโ sirna.py # siRNA candidates, parameters, results
โ โ โโโ transcript.py # Transcript and gene representations
โ โโโ ๐พ data/ # Data access and integration layer
โ โ โโโ gene_search.py # Multi-database API integration
โ โ โโโ orf_analysis.py # Reading frame and coding validation
โ โ โโโ base.py # Common utilities (FASTA parsing, etc.)
โ โโโ ๐ง pipeline/ # Nextflow workflow integration
โ โ โโโ nextflow/ # Nextflow execution and config management
โ โ โโโ resources.py # Resource and test data management
โ โโโ ๐ ๏ธ utils/ # Cross-cutting utilities
โ โ โโโ logging_utils.py # Structured logging configuration
โ โโโ ๐ cli.py # Rich CLI interface with Typer
โ โโโ workflow.py # High-level workflow orchestration
โโโ ๐งช tests/ # Comprehensive test suite
โ โโโ unit/ # Component-specific unit tests
โ โโโ integration/ # Cross-component integration tests
โ โโโ pipeline/ # Nextflow pipeline validation tests
โ โโโ docker/ # Container integration tests
โโโ ๐ nextflow_pipeline/ # Nextflow DSL2 workflow
โ โโโ main.nf # Main workflow orchestration
โ โโโ nextflow.config # Execution and resource configuration
โ โโโ modules/local/ # Custom process definitions
โ โโโ subworkflows/local/ # Reusable workflow components
โโโ ๐ณ docker/ # Container definitions and environments
โ โโโ Dockerfile # Multi-stage production image
โ โโโ environment-nextflow.yml # Conda environment specification
โโโ ๐ docs/ # Documentation and examples
โ โโโ api_reference.rst # API documentation
โ โโโ tutorials/ # Step-by-step guides
โ โโโ examples/ # Working code examples
โโโ ๐ง Configuration files
โโโ pyproject.toml # Python packaging and tool configuration
โโโ Makefile # Development workflow automation
โโโ uv.lock # Reproducible dependency resolution
## ๐ Output Formats & Results
siRNAforge generates comprehensive, structured outputs for downstream analysis and experimental validation:
### Workflow Output Structure
output_directory/ โโโ ๐ transcripts/ # Retrieved transcript sequences โ โโโ {gene}_transcripts.fasta # All retrieved transcript isoforms โ โโโ temp_for_design.fasta # Filtered sequences for design โโโ ๐ orf_reports/ # Open reading frame validation โ โโโ {gene}_orf_validation.txt # Coding sequence quality report โโโ ๐ sirnaforge/ # Core siRNA design results โ โโโ {gene}_sirna_results.csv # Complete candidate table โ โโโ {gene}_top_candidates.fasta # Top-ranked sequences for validation โ โโโ {gene}_candidate_summary.txt # Human-readable summary โโโ ๐ off_target/ # Off-target analysis results โ โโโ basic_analysis.json # Simplified off-target metrics โ โโโ input_candidates.fasta # Candidates sent for analysis โ โโโ results/ # Detailed Nextflow pipeline outputs โ โโโ aggregated/ # Combined multi-species results โ โโโ individual_results/ # Per-candidate detailed analysis โโโ ๐ workflow_manifest.json # Complete workflow configuration โโโ ๐ workflow_summary.json # High-level results summary
### Key Output Files
**๐ฏ `{gene}_sirna_results.csv`** - Complete candidate table with all scoring metrics:
```csv
id,guide_sequence,antisense_sequence,transcript_ids,position,gc_content,melting_temp,thermodynamic_score,secondary_structure_score,off_target_score,composite_score
TP53_001,GUAACAUUUGAGCCUUCUGA,UCAGAAGGCUCAAAUGUUAC,"ENST00000269305;ENST00000455263",245,47.6,52.3,0.85,0.92,0.78,4.22
TP53_002,CAUCAACUGAUUGUGCUGC,GCAGCACAAUCAGUUGAUG,"ENST00000269305",512,52.6,54.1,0.91,0.88,0.82,4.45
...
๐งฌ {gene}_top_candidates.fasta - Ready-to-order sequences for experimental validation:
>TP53_001 score=4.22 gc=47.6% transcripts=2
GUAACAUUUGAGCCUUCUGA
>TP53_002 score=4.45 gc=52.6% transcripts=1
CAUCAACUGAUUGUGCUGC
๐ {gene}_candidate_summary.txt - Human-readable summary report:
siRNAforge Design Summary for TP53
Generated: 2025-09-08 14:30:22
=================================
Input Statistics:
- Transcripts processed: 3
- Total sequence length: 2,847 bp
- Coding sequences: 1,182 bp
Design Results:
- Candidates generated: 1,156
- Passed filters: 234
- Top candidates selected: 50
Top 5 Candidates:
1. TP53_001: GUAACAUUUGAGCCUUCUGA (Score: 4.22, GC: 47.6%)
2. TP53_002: CAUCAACUGAUUGUGCUGC (Score: 4.45, GC: 52.6%)
...
๐ Off-target Analysis Outputs:
{
"analysis_summary": {
"candidates_analyzed": 10,
"total_off_targets": 15,
"high_confidence_hits": 3
},
"by_species": {
"human": {"transcriptome_hits": 8, "mirna_hits": 2},
"rat": {"transcriptome_hits": 3, "mirna_hits": 1},
"rhesus": {"transcriptome_hits": 1, "mirna_hits": 0}
},
"candidates": [
{
"candidate_id": "TP53_001",
"guide_sequence": "GUAACAUUUGAGCCUUCUGA",
"off_target_score": 0.78,
"species_analysis": {
"human": {"hits": 5, "seed_matches": 2},
"rat": {"hits": 2, "seed_matches": 0}
}
}
]
}
Integration with Analysis Tools
๐ฌ For Laboratory Validation:
- FASTA files can be directly submitted to oligonucleotide synthesis providers
- CSV files import into Excel/R/Python for further analysis
- Candidate rankings support experimental prioritization
๐ฅ๏ธ For Computational Analysis:
- JSON outputs enable programmatic result processing
- Structured CSV format supports statistical analysis and machine learning
- Off-target data facilitates safety assessment and regulatory compliance
๐ For Visualization and Reporting:
- Summary reports provide publication-ready candidate lists
- Score distributions support quality control assessment
- Multi-species comparisons enable cross-species research applications
๐ฌ Nextflow Pipeline Integration
The integrated Nextflow pipeline provides scalable, containerized off-target analysis:
Pipeline Features
- Multi-Species Analysis - Human, rat, rhesus macaque genomes
- Parallel Processing - Each siRNA candidate processed independently
- Auto Index Management - Downloads and builds BWA indices on demand
- Cloud Ready - AWS Batch, Kubernetes, SLURM support
- Comprehensive Results - TSV, JSON, and HTML outputs
Usage Examples
# Standalone pipeline execution
nextflow run nextflow_pipeline/main.nf \
--input candidates.fasta \
--genome_species "human,rat,rhesus" \
--outdir results
# With custom genome indices
nextflow run nextflow_pipeline/main.nf \
--input candidates.fasta \
--genome_indices "human:/path/to/human/index" \
--profile docker
# Using S3-hosted indices
nextflow run nextflow_pipeline/main.nf \
--input candidates.fasta \
--download_indexes true \
--profile aws
Pipeline Output Structure
results/
โโโ aggregated/ # Final combined results
โ โโโ combined_mirna_analysis.tsv
โ โโโ combined_transcriptome_analysis.tsv
โ โโโ combined_summary.json
โ โโโ analysis_report.html
โโโ individual_results/ # Per-candidate results
โโโ candidate_0001/
โโโ candidate_0002/
โโโ ...
๐ ๏ธ Development & Quality Assurance
Modern Development Environment with uv
siRNAforge leverages uv for lightning-fast dependency management and development workflows:
# Complete development setup (recommended)
git clone https://github.com/austin-s-h/sirnaforge
cd sirnaforge
make install-dev # Installs all dev dependencies
# Core development commands
make test-local-python # Fastest Python-only tests (markers=local_python)
make test-fast # Quick pytest suite excluding slow markers
make lint # Ruff (lint + format --check) and mypy
make check # lint-fix + test-fast for pre-commit parity
make docs # Build Sphinx documentation
make docker # Build the production Docker image
# Selective dependency installation
uv sync --group analysis # Jupyter, plotting, pandas extras
uv sync --group pipeline # Nextflow, Docker integration
uv sync --group docs # Sphinx documentation tools
uv sync --group lint # Pre-commit, mypy, ruff, black
# Production deployment (minimal dependencies)
uv sync --no-dev
Conda Environment Management
For local development with bioinformatics tools, siRNAforge provides conda environment management:
# Create complete development environment
make conda-env
# Update existing environment with new dependencies
make conda-env-update
# Remove environment (cleanup)
make conda-env-clean
# Activate environment for development
conda activate sirnaforge-dev
# Deactivate when done
conda deactivate
The conda environment includes all bioinformatics tools (BWA-MEM2, SAMtools, ViennaRNA, etc.) plus Python development dependencies, providing a complete local development setup without Docker.
Quality Assurance & Testing
๐งช Comprehensive Test Suite:
# Run all tests with coverage reporting
make test
# Output: >95% code coverage across all modules
# Fast development testing (unit tests only)
make test-fast
# Integration tests (includes external APIs)
uv run pytest tests/integration/ -v
# Pipeline tests (requires Docker/Nextflow)
uv run pytest tests/pipeline/ -v
# Specific test categories
uv run pytest tests/unit/test_design.py::test_scoring_algorithm -v
๐ Code Quality Tools:
# Type checking with mypy (strict mode)
uv run mypy src/
# Result: Success: no issues found in 20 source files
# Code formatting with black
uv run black src tests
make format
# Linting with ruff (fast Python linter)
uv run ruff check src tests
make lint
# All quality checks together
make lint # Includes ruff, black, mypy, nextflow lint
Available Dependency Groups
| Group | Purpose | Key Tools |
|---|---|---|
dev |
Core development (auto-installed) | pytest, black, ruff |
test |
Testing frameworks | pytest-cov, pytest-xdist |
lint |
Code quality | mypy, ruff, black |
analysis |
Data science workflows | jupyter, matplotlib, pandas |
pipeline |
Nextflow integration | workflow tools, containers |
docs |
Documentation generation | sphinx, sphinx-rtd-theme |
Code Quality Standards
- Type Safety: Full mypy coverage with Pydantic models
- Formatting: Black + Ruff for consistent style
- Testing: Comprehensive pytest suite with >90% coverage
- CI/CD: GitHub Actions with multi-Python testing
- Security: Bandit + Safety dependency scanning
โก Performance & System Requirements
Performance Benchmarks
๐งฌ siRNA Design Performance:
- Small genes (1-5 transcripts): ~2-5 seconds
- Medium genes (5-20 transcripts): ~10-30 seconds
- Large genes (20+ transcripts): ~1-2 minutes
- Batch processing (10 genes): ~5-15 minutes
๐ Off-target Analysis Performance:
- Per candidate (single species): ~30-60 seconds
- Multi-species (3 genomes): ~2-5 minutes per candidate
- Batch analysis (50 candidates): ~1-3 hours (parallelized)
System Requirements
๐ง Minimum Requirements:
- CPU: 2 cores, 2.0 GHz
- RAM: 4 GB (8 GB recommended for off-target analysis)
- Storage: 2 GB free space (+ 50 GB for genome indices)
- Network: Internet connection for gene searches and genome downloads
โก Recommended Configuration:
- CPU: 8+ cores, 3.0 GHz (for parallel Nextflow execution)
- RAM: 16-32 GB (for large-scale off-target analysis)
- Storage: SSD with 100+ GB (for genome indices and temporary files)
- Network: High-bandwidth connection for S3 genome downloads
๐ณ Docker Resource Allocation:
# Recommended Docker settings
docker run --cpus="4" --memory="8g" \
-v $(pwd):/workspace -w /workspace \
ghcr.io/austin-s-h/sirnaforge:latest \
sirnaforge workflow TP53 --genome-species human,rat,rhesus
๐ณ Docker Usage
Pre-built Images
# Pull latest stable release
docker pull ghcr.io/austin-s-h/sirnaforge:latest
# Run complete workflow
docker run --rm -v $(pwd):/data \
ghcr.io/austin-s-h/sirnaforge:latest \
sirnaforge workflow TP53 --output-dir /data/results
# Interactive development session
docker run -it --rm -v $(pwd):/data \
ghcr.io/austin-s-h/sirnaforge:latest bash
Building Custom Images
# Build production image
make docker
# Build with specific Python version
docker build --build-arg PYTHON_VERSION=3.11 \
-f docker/Dockerfile -t sirnaforge:py311 .
The Docker image uses micromamba with docker/environment-nextflow.yml for consistent bioinformatics tool installations across all environments.
๐งช Testing & Quality Assurance
Running Tests
| Command | Under the hood | When to use | Notes |
|---|---|---|---|
make test-local-python |
uv run --group dev pytest -v -m "local_python" |
Fastest feedback loop during development | Python-only markers, no Docker/Nextflow required |
make test-unit |
uv run --group dev pytest -v -m "unit" |
Validate core algorithms | Includes ~30 tests (~30s) |
make test-fast |
uv run --group dev pytest -v -m "not slow" |
Pre-commit or PR checks | Skips slow/integration markers |
make test |
uv run --group dev pytest -v |
Full Python suite | May include slow and docker-marked tests; expect >60s |
make test-ci |
uv run --group dev pytest -m "ci" --junitxml=pytest-report.xml --cov=sirnaforge --cov-report=term-missing --cov-report=xml:coverage.xml -v |
CI pipelines needing artifacts | Produces coverage + JUnit reports |
make test-cov |
uv run --group dev pytest --cov=sirnaforge --cov-report=html --cov-report=term-missing |
Local coverage runs | Outputs HTML coverage in htmlcov/ |
make lint |
Ruff lint + Ruff format check + MyPy | Quick code-quality gate | No automatic fixes |
make check |
make lint-fix + make test-fast |
Pre-commit parity | Applies Ruff fixes before running fast pytest subset |
Docker-powered tiers share the same pytest markers but execute inside the published image:
| Command | Container invocation | Resource profile | Purpose |
|---|---|---|---|
make docker-test-smoke |
docker run โฆ python -m pytest -q -n 1 -m 'docker and smoke' |
0.5 CPU / 256โฏMB | Minimal CI smoke (MUST PASS) |
make docker-test-fast |
docker run โฆ python -m pytest -q -n 1 -m 'docker and not slow' |
1 CPU / 2โฏGB | Dev-friendly docker coverage |
make docker-test |
docker run โฆ python -m pytest -v -n 1 -m 'docker and (docker_integration or (not smoke))' |
2 CPUs / 4โฏGB | Standard docker regression |
make docker-test-full |
docker run โฆ uv run --group dev pytest -v -n 2 |
4 CPUs / 8โฏGB | Release-grade validation |
โน๏ธ Run
make install-devonce to install development dependencies and pre-commit hooks before using these targets. The full matrix of commands, filters, and expected runtimes lives indocs/testing_guide.md.
Docker smoke snapshot
For a quick environment sanity check, make docker-test-smoke exercises the published container image with toy data in ~40โฏseconds (0.5 CPU, 256โฏMB). A passing run prints 9 passed with no failures; any remaining pytest collection warnings are tracked in the test suite and should disappear once the dataclass fix in this branch lands.
Fast CI/CD with Toy Data โก
siRNAforge now includes an improved CI/CD workflow designed for quick feedback with minimal resources:
- โก Ultra-fast execution: < 15 minutes total
- ๐ชถ Minimal resources: 256MB memory, 0.5 CPU cores
- ๐งธ Toy data: < 500 bytes of test sequences
- ๐ฅ Smoke tests: Essential functionality validation
# Trigger fast CI/CD workflow locally
pytest -m "smoke" --tb=short
# Use toy data for quick validation
ls tests/unit/data/toy_*.fasta
# Fast workflow vs comprehensive workflow
# Fast: 15 min, 256MB RAM, toy data
# Full: 60 min, 8GB RAM, real datasets
See docs/ci-cd-fast.md for detailed documentation.
Test Categories
- Unit Tests - Core algorithm validation
- Integration Tests - Component interaction testing
- Pipeline Tests - Nextflow workflow validation
- Docker Tests - Container functionality testing
๐ Documentation
Local Documentation Building
# Install documentation dependencies
uv sync --group docs
# Build HTML documentation
make docs
# Generate CLI reference
make docs-cli
# Live-reload docs during editing
make docs-dev
Generated Documentation
docs/_build/html/- Complete Sphinx HTML documentation (viamake docs)docs/CLI_REFERENCE.md- Auto-generated CLI help (viamake docs-cli)docs/api_reference.rst- Python API reference sourcedocs/modification_annotation_spec.md- Chemical modifications metadata specification
๐ See docs/getting_started.md for detailed tutorials and docs/deployment.md for deployment guides.
Chemical Modifications Metadata
siRNAforge supports structured annotation of chemical modifications, overhangs, and provenance information for siRNA sequences. This enables systematic tracking of modifications like 2'-O-methyl, 2'-fluoro, and phosphorothioate linkages.
Quick Example:
# Create metadata JSON file
cat > metadata.json << 'EOF'
{
"patisiran_ttr_guide": {
"id": "patisiran_ttr_guide",
"sequence": "AUGGAAUACUCUUGGUUAC",
"target_gene": "TTR",
"strand_role": "guide",
"overhang": "dTdT",
"chem_mods": [
{
"type": "2OMe",
"positions": [1, 4, 6, 11, 13, 16, 19]
}
],
"provenance": {
"source_type": "patent",
"identifier": "US10060921B2",
"url": "https://patents.google.com/patent/US10060921B2"
},
"confirmation_status": "confirmed"
}
}
EOF
# Annotate FASTA with metadata
sirnaforge sequences annotate sequences.fasta metadata.json -o annotated.fasta
# View sequences with metadata
sirnaforge sequences show annotated.fasta
sirnaforge sequences show annotated.fasta --format json
Features:
- ๐งช Chemical Modifications - Annotate 2'-O-methyl, 2'-fluoro, PS linkages, LNA, etc.
- ๐ Position Tracking - 1-based position numbering for each modification
- ๐ Overhang Support - DNA (dTdT) or RNA (UU) overhangs
- ๐ Provenance - Track sources (patents, publications, clinical trials)
- โ Confirmation Status - Mark validated vs. predicted sequences
- ๐๏ธ FASTA Headers - Standardized key-value encoding in headers
- ๐ JSON Sidecars - Separate metadata files for easy curation
Common Modification Types:
2OMe- 2'-O-methyl (nuclease resistance)2F- 2'-fluoro (enhanced stability)PS- Phosphorothioate (nuclease resistance)LNA- Locked Nucleic Acid (enhanced binding)MOE- 2'-O-methoxyethyl (improved pharmacokinetics)
Python API:
from sirnaforge.models.modifications import (
StrandMetadata,
ChemicalModification,
Provenance,
SourceType
)
# Create metadata
metadata = StrandMetadata(
id="my_sirna_guide",
sequence="AUCGAUCGAUCGAUCGAUCGA",
overhang="dTdT",
chem_mods=[
ChemicalModification(type="2OMe", positions=[1, 4, 6, 11])
],
provenance=Provenance(
source_type=SourceType.PUBLICATION,
identifier="PMID12345678"
)
)
# Generate FASTA with metadata
from sirnaforge.models.modifications import SequenceRecord, StrandRole
record = SequenceRecord(
target_gene="BRCA1",
strand_role=StrandRole.GUIDE,
metadata=metadata
)
print(record.to_fasta())
๐ See docs/modification_annotation_spec.md for complete specification, API reference, and examples.
๐ค Contributing
We welcome contributions to siRNAforge! Here's how to get started:
Development Setup
- Fork the repository on GitHub
- Clone your fork:
git clone https://github.com/yourusername/sirnaforge - Setup development environment:
make install-dev - Create a feature branch:
git checkout -b feature/amazing-feature
Development Workflow
# Make your changes
# ...
# Ensure code quality
make lint # Check code style and types
make format # Auto-format code
make test-local-python # Fast sanity suite
make check # Auto-fix lint + fast pytest
# Commit and push
git add .
git commit -m 'Add amazing feature'
git push origin feature/amazing-feature
Contribution Guidelines
- Code Style: Follow Black formatting and Ruff linting rules
- Type Hints: All new code must include type annotations
- Tests: Add tests for new functionality
- Documentation: Update docstrings and documentation
- Commit Messages: Use conventional commit format
Pull Request Process
- Ensure all tests pass and code is properly formatted
- Update documentation for any API changes
- Add entries to
CHANGELOG.mdfor user-facing changes - Create a pull request with a clear description
See CONTRIBUTING.md for detailed guidelines.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
siRNAforge builds upon excellent open-source tools and libraries:
- ViennaRNA Package - RNA secondary structure prediction
- BWA-MEM2 - Fast and accurate sequence alignment
- Nextflow - Workflow management and containerization
- BioPython - Python bioinformatics toolkit
- Pydantic - Data validation and type safety
- Modern Python Stack - uv, Typer, Rich for developer experience
Note: Much of the code in this repository was developed with assistance from AI agents, but all code has been reviewed, tested, and validated by human developers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sirnaforge-0.2.1.tar.gz.
File metadata
- Download URL: sirnaforge-0.2.1.tar.gz
- Upload date:
- Size: 422.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ad32b3f76a51dc5d5f0d02c188f63b52c0d3a4869d50a6b03877ce8947cc4b5
|
|
| MD5 |
2cafbe75383148089dc3a6c97b8cfbc8
|
|
| BLAKE2b-256 |
d5aee03d139395ef5a2f000e6f4db82a30b6cef6eece64ffbb3c26c64441022a
|
Provenance
The following attestation bundles were made for sirnaforge-0.2.1.tar.gz:
Publisher:
release.yml on Austin-s-h/sirnaforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sirnaforge-0.2.1.tar.gz -
Subject digest:
8ad32b3f76a51dc5d5f0d02c188f63b52c0d3a4869d50a6b03877ce8947cc4b5 - Sigstore transparency entry: 637847603
- Sigstore integration time:
-
Permalink:
Austin-s-h/sirnaforge@898713bb1d468a4907d2a7f56a281403a30462ab -
Branch / Tag:
refs/heads/master - Owner: https://github.com/Austin-s-h
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@898713bb1d468a4907d2a7f56a281403a30462ab -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file sirnaforge-0.2.1-py3-none-any.whl.
File metadata
- Download URL: sirnaforge-0.2.1-py3-none-any.whl
- Upload date:
- Size: 130.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a33ba7d8dc49118b277416142ee8e80488e753cd44550e9a9b137bca3d2e7685
|
|
| MD5 |
b80e94e5d4431163f5741b5f416c2553
|
|
| BLAKE2b-256 |
9d71be7d962920cbb64764b0e364440b50928224710caf0e84d609737606e5b3
|
Provenance
The following attestation bundles were made for sirnaforge-0.2.1-py3-none-any.whl:
Publisher:
release.yml on Austin-s-h/sirnaforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sirnaforge-0.2.1-py3-none-any.whl -
Subject digest:
a33ba7d8dc49118b277416142ee8e80488e753cd44550e9a9b137bca3d2e7685 - Sigstore transparency entry: 637847626
- Sigstore integration time:
-
Permalink:
Austin-s-h/sirnaforge@898713bb1d468a4907d2a7f56a281403a30462ab -
Branch / Tag:
refs/heads/master - Owner: https://github.com/Austin-s-h
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@898713bb1d468a4907d2a7f56a281403a30462ab -
Trigger Event:
workflow_dispatch
-
Statement type: