Skip to main content

siRNAforge - Multi-species gene to siRNA design, off-target prediction, and ranking. Comprehensive siRNA design toolkit for gene silencing

Project description

๐Ÿงฌ siRNAforge โ€” Comprehensive siRNA Design Tool

siRNAforge Logo

Multi-species gene to siRNA design, off-target prediction, and ranking ๐Ÿš€ Release siRNAforge Python 3.9โ€“3.12 uv Code style: black Ruff Docker Nextflow Tests Coverage MIT License

siRNAforge is a modern, comprehensive toolkit for designing high-quality siRNAs with integrated off-target analysis. Built with Python 3.9-3.12, it combines cutting-edge bioinformatics algorithms with robust software engineering practices to provide a complete gene silencing solution for researchers and biotechnology applications.

โœจ Key Features

  • ๐ŸŽฏ Algorithm-driven design - Comprehensive siRNA design with multi-component thermodynamic scoring
  • ๐Ÿ” Multi-species off-target analysis - BWA-MEM2 alignment (transcriptome + miRNA seed modes) across human, rat, rhesus genomes
  • ๐Ÿ“Š Advanced scoring system - Composite scoring with seed-region specificity and secondary structure prediction
  • ๐Ÿงช ViennaRNA integration - Secondary structure prediction for enhanced design accuracy
  • ๐Ÿงฌ Chemical modifications metadata - Track 2'-O-methyl, 2'-fluoro, PS linkages, overhangs, and provenance
  • ๐Ÿ”ฌ Nextflow pipeline integration - Scalable, containerized workflow execution with automatic parallelization
  • ๐Ÿ Modern Python architecture - Type-safe code with Pydantic models, async/await support, and rich CLI
  • โšก Lightning-fast dependency management - Built with uv for sub-second installs and virtual environment management
  • ๐Ÿณ Fully containerized - Docker images with all bioinformatics dependencies pre-installed
  • ๐Ÿงฌ Multi-database support - Ensembl, RefSeq, GENCODE integration for comprehensive transcript retrieval

Note: Supports Python 3.9-3.12. Python 3.13+ not yet supported due to ViennaRNA dependency constraints.

๐Ÿš€ Quick Start

Installation Options

๐Ÿณ Docker (Recommended - Complete Environment):

# Pull the pre-built image with all dependencies
docker pull ghcr.io/austin-s-h/sirnaforge:latest

# Quick workflow example
docker run -v $(pwd):/workspace -w /workspace \
  ghcr.io/austin-s-h/sirnaforge:latest \
  sirnaforge workflow TP53 --output-dir results --genome-species human

# With custom parameters
docker run -v $(pwd):/workspace -w /workspace \
  ghcr.io/austin-s-h/sirnaforge:latest \
  sirnaforge workflow BRCA1 --gc-min 40 --gc-max 60 --sirna-length 21 --top-n 50

๐Ÿ Conda Environment (Alternative - Local Development):

# Install micromamba (recommended - fastest), Mambaforge, or Miniconda
# micromamba (fastest option):
curl -LsSf https://micro.mamba.pm/install.sh | bash

# Or Mambaforge:
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh

# Create siRNAforge development environment
make conda-env

# Activate the environment
micromamba activate sirnaforge-dev  # or conda activate sirnaforge-dev

# Install Python dependencies
make install-dev

# Run tests to verify installation
make test-local-python

๐Ÿ–ฅ๏ธ Local Development Installation:

# Install uv (lightning-fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup with development dependencies
git clone https://github.com/austin-s-h/sirnaforge
cd sirnaforge
make install-dev

# Run sanity checks to verify installation
make test-local-python

Essential Dependencies for Off-target Analysis

The Docker image includes all bioinformatics dependencies via conda environment (docker/environment-nextflow.yml):

  • โœ… Nextflow (โ‰ฅ25.04.0) - Workflow orchestration and parallelization
  • โœ… BWA-MEM2 (โ‰ฅ2.2.1) - High-performance genome alignment (transcriptome + miRNA seed analysis)
  • โœ… SAMtools (โ‰ฅ1.19.2) - SAM/BAM file processing and indexing
  • โœ… ViennaRNA (โ‰ฅ2.7.0) - RNA secondary structure prediction
  • โœ… AWS CLI (โ‰ฅ2.0) - Automated genome reference downloads
  • โœ… Java 17 - Nextflow runtime environment

For local development without Docker:

# Option 1: Use conda environment (includes all tools)
make conda-env
micromamba activate sirnaforge-dev  # or conda activate sirnaforge-dev

# Option 2: Install bioinformatics tools via micromamba
curl -LsSf https://micro.mamba.pm/install.sh | bash
micromamba env create -f docker/environment-nextflow.yml
micromamba activate sirnaforge-env

Usage Examples

๐ŸŽฏ Complete Workflow (Gene Query to Results):

# Basic workflow with default parameters
uv run sirnaforge workflow TP53 --output-dir results

# Advanced workflow with custom parameters
uv run sirnaforge workflow BRCA1 \
  --genome-species "human,rat,rhesus" \
  --gc-min 40 --gc-max 60 \
  --sirna-length 21 \
  --top-n 50 \
  --output-dir brca1_analysis

# Workflow from a pre-existing FASTA file (local path or remote URL)
uv run sirnaforge workflow --input-fasta transcripts.fasta \
  --output-dir custom_analysis \
  --offtarget-n 25 \
  custom_gene_name

# Remote FASTA example
uv run sirnaforge workflow --input-fasta https://example.org/transcripts.fasta \
  --output-dir remote_input_run \
  remote_dataset

๐Ÿ” Individual Component Usage:

# Search for gene transcripts across databases
uv run sirnaforge search TP53 --output transcripts.fasta --database ensembl

# Design siRNAs from transcript sequences
uv run sirnaforge design transcripts.fasta --output results.csv --top-n 20

# Validate input files before processing
uv run sirnaforge validate candidates.fasta

# Display configuration and system information
uv run sirnaforge config

# Show detailed help for any command
uv run sirnaforge --help
uv run sirnaforge workflow --help

Python API

๐Ÿ”ง Programmatic Access for Custom Workflows:

import asyncio
from pathlib import Path
from sirnaforge.workflow import run_sirna_workflow
from sirnaforge.core.design import SiRNADesigner
from sirnaforge.models.sirna import DesignParameters, FilterCriteria
from sirnaforge.data.gene_search import search_gene_sync

# Complete async workflow with custom parameters
async def design_sirnas_custom():
    results = await run_sirna_workflow(
        gene_query="TP53",
        output_dir="results",
        database="ensembl",
        top_n_candidates=50,
        top_n_offtarget=15,
        genome_species=["human", "rat", "rhesus"],
        gc_min=40.0,
        gc_max=60.0,
        sirna_length=21,
    )
    return results

# Run the workflow
results = asyncio.run(design_sirnas_custom())
print(f"โœ… Designed {len(results.get('top_candidates', []))} siRNA candidates")

# Individual component usage for custom pipelines
def custom_design_pipeline():
    # 1. Search for gene transcripts
    transcripts = search_gene_sync(
        gene_query="BRCA1",
        database="ensembl",
        output_file="transcripts.fasta"
    )

    # 2. Configure design parameters
    design_params = DesignParameters(
        sirna_length=21,
        filters=FilterCriteria(
            gc_min=40,
            gc_max=60,
            avoid_patterns=["AAAA", "TTTT", "GGGG", "CCCC"]
        )
    )

    # 3. Initialize designer and generate candidates
    designer = SiRNADesigner(design_params)
    design_results = designer.design_from_file("transcripts.fasta")

    # 4. Process results
    for candidate in design_results.top_candidates[:10]:
        print(f"Candidate {candidate.id}:")
        print(f"  Guide: {candidate.guide_sequence}")
        print(f"  Score: {candidate.composite_score:.2f}")
        print(f"  GC%: {candidate.gc_content:.1f}")
        print(f"  Transcripts: {len(candidate.transcript_ids)}")
        print()

    return design_results

# Example: Batch processing multiple genes
async def batch_design_genes(genes: list[str]):
    results = {}
    for gene in genes:
        print(f"Processing {gene}...")
        gene_results = await run_sirna_workflow(
            gene_query=gene,
            output_dir=f"results_{gene.lower()}",
            top_n_candidates=20
        )
        results[gene] = gene_results
    return results

# Process multiple cancer-related genes
cancer_genes = ["TP53", "BRCA1", "BRCA2", "EGFR", "MYC"]
batch_results = asyncio.run(batch_design_genes(cancer_genes))

๐Ÿ—๏ธ Architecture & Workflow

Complete Pipeline Overview

Gene Query โ†’ Transcript Search โ†’ ORF Validation โ†’ siRNA Design โ†’ Off-target Analysis โ†’ Ranked Results
     โ†“              โ†“                โ†“               โ†“               โ†“                    โ†“
Multi-database   Canonical       Coding Frame   Thermodynamic   Multi-species BWA    Scored & Filtered
Gene Search      Isoform         Validation     + Structure     Alignment (seed &    siRNA Candidates
(Ensembl/        Selection                      Scoring         transcriptome)       with Off-target
RefSeq/GENCODE)                                                                    Predictions

Core Components

๐Ÿ” Gene Search & Data Layer (sirnaforge.data.*)

  • Multi-database integration: Ensembl, RefSeq, GENCODE APIs with automatic fallback
  • Canonical transcript selection: Prioritizes protein-coding, longest transcripts
  • Robust error handling: Network timeouts, API rate limiting, malformed responses
  • Async/await support: Non-blocking I/O for improved performance

๐Ÿงฌ ORF Analysis (sirnaforge.data.orf_analysis)

  • Reading frame validation: Ensures proper coding sequence targeting
  • Quality control reporting: Detailed validation logs and metrics
  • Multi-transcript support: Handles gene isoforms and splice variants

๐ŸŽฏ siRNA Design Engine (sirnaforge.core.design)

  • Algorithm-based candidate generation: Systematic 19-23 nucleotide window scanning

  • Multi-component scoring system:

    • Thermodynamic properties: GC content (30-60%), melting temperature optimization
    • Secondary structure prediction: ViennaRNA integration for accessibility scoring
    • Position-specific penalties: 5' and 3' end optimization
    • Off-target risk assessment: Simplified seed-region analysis
  • Composite scoring: Weighted combination of all scoring components

  • Transcript consolidation: Deduplicates guide sequences across multiple transcript isoforms

  • ๐Ÿ” Off-target Analysis (sirnaforge.core.off_target)

    • Adaptive BWA-MEM2 modes: Sensitive genome-wide alignment plus ultra-short miRNA seed analysis using tuned parameters
  • Multi-species support: Human, rat, rhesus macaque genome analysis

  • Advanced scoring: Position-weighted mismatch penalties with seed-region emphasis

  • Scalable processing: Batch candidate analysis with parallel execution

๐Ÿ”ฌ Nextflow Pipeline Integration (nextflow_pipeline/)

  • Containerized execution: Docker/Singularity support with pre-built environments
  • Automatic resource management: Dynamic CPU/memory allocation based on workload
  • Cloud-ready: AWS S3 genome reference integration with automatic downloading
  • Fault tolerance: Resume capability and error recovery mechanisms
  • Parallel processing: Multi-genome, multi-candidate simultaneous analysis

โšก Modern Python Architecture

  • Type safety: Full mypy compliance with Pydantic models for data validation
  • Async/await: Non-blocking I/O throughout the pipeline for improved throughput
  • Rich CLI: Beautiful terminal interface with progress bars, tables, and error formatting
  • Comprehensive testing: Unit, integration, and pipeline tests with pytest
  • Developer experience: Pre-commit hooks, automated formatting (black), linting (ruff)

Repository Structure

sirnaforge/
โ”œโ”€โ”€ ๐Ÿ“ฆ src/sirnaforge/              # Main package (modern src-layout)
โ”‚   โ”œโ”€โ”€ ๐ŸŽฏ core/                   # Core algorithms and analysis engines
โ”‚   โ”‚   โ”œโ”€โ”€ design.py              # siRNA design, scoring, and candidate generation
โ”‚   โ”‚   โ”œโ”€โ”€ off_target.py          # BWA-MEM2 off-target analysis (transcriptome + miRNA seed)
โ”‚   โ”‚   โ””โ”€โ”€ thermodynamics.py     # ViennaRNA integration & structure prediction
โ”‚   โ”œโ”€โ”€ ๐Ÿ“Š models/                 # Type-safe Pydantic data models
โ”‚   โ”‚   โ”œโ”€โ”€ sirna.py              # siRNA candidates, parameters, results
โ”‚   โ”‚   โ””โ”€โ”€ transcript.py         # Transcript and gene representations
โ”‚   โ”œโ”€โ”€ ๐Ÿ’พ data/                   # Data access and integration layer
โ”‚   โ”‚   โ”œโ”€โ”€ gene_search.py        # Multi-database API integration
โ”‚   โ”‚   โ”œโ”€โ”€ orf_analysis.py       # Reading frame and coding validation
โ”‚   โ”‚   โ””โ”€โ”€ base.py               # Common utilities (FASTA parsing, etc.)
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ง pipeline/               # Nextflow workflow integration
โ”‚   โ”‚   โ”œโ”€โ”€ nextflow/             # Nextflow execution and config management
โ”‚   โ”‚   โ””โ”€โ”€ resources.py          # Resource and test data management
โ”‚   โ”œโ”€โ”€ ๐Ÿ› ๏ธ utils/                  # Cross-cutting utilities
โ”‚   โ”‚   โ””โ”€โ”€ logging_utils.py      # Structured logging configuration
โ”‚   โ”œโ”€โ”€ ๐Ÿ“Ÿ cli.py                  # Rich CLI interface with Typer
โ”‚   โ””โ”€โ”€ workflow.py               # High-level workflow orchestration
โ”œโ”€โ”€ ๐Ÿงช tests/                      # Comprehensive test suite
โ”‚   โ”œโ”€โ”€ unit/                     # Component-specific unit tests
โ”‚   โ”œโ”€โ”€ integration/              # Cross-component integration tests
โ”‚   โ”œโ”€โ”€ pipeline/                 # Nextflow pipeline validation tests
โ”‚   โ””โ”€โ”€ docker/                   # Container integration tests
โ”œโ”€โ”€ ๐ŸŒŠ nextflow_pipeline/          # Nextflow DSL2 workflow
โ”‚   โ”œโ”€โ”€ main.nf                   # Main workflow orchestration
โ”‚   โ”œโ”€โ”€ nextflow.config           # Execution and resource configuration
โ”‚   โ”œโ”€โ”€ modules/local/            # Custom process definitions
โ”‚   โ””โ”€โ”€ subworkflows/local/       # Reusable workflow components
โ”œโ”€โ”€ ๐Ÿณ docker/                     # Container definitions and environments
โ”‚   โ”œโ”€โ”€ Dockerfile                # Multi-stage production image
โ”‚   โ””โ”€โ”€ environment-nextflow.yml  # Conda environment specification
โ”œโ”€โ”€ ๐Ÿ“š docs/                       # Documentation and examples
โ”‚   โ”œโ”€โ”€ api_reference.rst         # API documentation
โ”‚   โ”œโ”€โ”€ tutorials/                # Step-by-step guides
โ”‚   โ””โ”€โ”€ examples/                 # Working code examples
โ””โ”€โ”€ ๐Ÿ”ง Configuration files
    โ”œโ”€โ”€ pyproject.toml            # Python packaging and tool configuration
    โ”œโ”€โ”€ Makefile                  # Development workflow automation
    โ””โ”€โ”€ uv.lock                   # Reproducible dependency resolution
## ๐Ÿ“Š Output Formats & Results

siRNAforge generates comprehensive, structured outputs for downstream analysis and experimental validation:

### Workflow Output Structure

output_directory/ โ”œโ”€โ”€ ๐Ÿ“ transcripts/ # Retrieved transcript sequences โ”‚ โ”œโ”€โ”€ {gene}_transcripts.fasta # All retrieved transcript isoforms โ”‚ โ””โ”€โ”€ temp_for_design.fasta # Filtered sequences for design โ”œโ”€โ”€ ๐Ÿ“ orf_reports/ # Open reading frame validation โ”‚ โ””โ”€โ”€ {gene}_orf_validation.txt # Coding sequence quality report โ”œโ”€โ”€ ๐Ÿ“ sirnaforge/ # Core siRNA design results โ”‚ โ”œโ”€โ”€ {gene}_sirna_results.csv # Complete candidate table โ”‚ โ”œโ”€โ”€ {gene}_top_candidates.fasta # Top-ranked sequences for validation โ”‚ โ””โ”€โ”€ {gene}_candidate_summary.txt # Human-readable summary โ”œโ”€โ”€ ๐Ÿ“ off_target/ # Off-target analysis results โ”‚ โ”œโ”€โ”€ basic_analysis.json # Simplified off-target metrics โ”‚ โ”œโ”€โ”€ input_candidates.fasta # Candidates sent for analysis โ”‚ โ””โ”€โ”€ results/ # Detailed Nextflow pipeline outputs โ”‚ โ”œโ”€โ”€ aggregated/ # Combined multi-species results โ”‚ โ””โ”€โ”€ individual_results/ # Per-candidate detailed analysis โ”œโ”€โ”€ ๐Ÿ“„ workflow_manifest.json # Complete workflow configuration โ””โ”€โ”€ ๐Ÿ“„ workflow_summary.json # High-level results summary


### Key Output Files

**๐ŸŽฏ `{gene}_sirna_results.csv`** - Complete candidate table with all scoring metrics:
```csv
id,guide_sequence,antisense_sequence,transcript_ids,position,gc_content,melting_temp,thermodynamic_score,secondary_structure_score,off_target_score,composite_score
TP53_001,GUAACAUUUGAGCCUUCUGA,UCAGAAGGCUCAAAUGUUAC,"ENST00000269305;ENST00000455263",245,47.6,52.3,0.85,0.92,0.78,4.22
TP53_002,CAUCAACUGAUUGUGCUGC,GCAGCACAAUCAGUUGAUG,"ENST00000269305",512,52.6,54.1,0.91,0.88,0.82,4.45
...

๐Ÿงฌ {gene}_top_candidates.fasta - Ready-to-order sequences for experimental validation:

>TP53_001 score=4.22 gc=47.6% transcripts=2
GUAACAUUUGAGCCUUCUGA
>TP53_002 score=4.45 gc=52.6% transcripts=1
CAUCAACUGAUUGUGCUGC

๐Ÿ“‹ {gene}_candidate_summary.txt - Human-readable summary report:

siRNAforge Design Summary for TP53
Generated: 2025-09-08 14:30:22
=================================

Input Statistics:
- Transcripts processed: 3
- Total sequence length: 2,847 bp
- Coding sequences: 1,182 bp

Design Results:
- Candidates generated: 1,156
- Passed filters: 234
- Top candidates selected: 50

Top 5 Candidates:
1. TP53_001: GUAACAUUUGAGCCUUCUGA (Score: 4.22, GC: 47.6%)
2. TP53_002: CAUCAACUGAUUGUGCUGC (Score: 4.45, GC: 52.6%)
...

๐Ÿ” Off-target Analysis Outputs:

{
  "analysis_summary": {
    "candidates_analyzed": 10,
    "total_off_targets": 15,
    "high_confidence_hits": 3
  },
  "by_species": {
    "human": {"transcriptome_hits": 8, "mirna_hits": 2},
    "rat": {"transcriptome_hits": 3, "mirna_hits": 1},
    "rhesus": {"transcriptome_hits": 1, "mirna_hits": 0}
  },
  "candidates": [
    {
      "candidate_id": "TP53_001",
      "guide_sequence": "GUAACAUUUGAGCCUUCUGA",
      "off_target_score": 0.78,
      "species_analysis": {
        "human": {"hits": 5, "seed_matches": 2},
        "rat": {"hits": 2, "seed_matches": 0}
      }
    }
  ]
}

Integration with Analysis Tools

๐Ÿ”ฌ For Laboratory Validation:

  • FASTA files can be directly submitted to oligonucleotide synthesis providers
  • CSV files import into Excel/R/Python for further analysis
  • Candidate rankings support experimental prioritization

๐Ÿ–ฅ๏ธ For Computational Analysis:

  • JSON outputs enable programmatic result processing
  • Structured CSV format supports statistical analysis and machine learning
  • Off-target data facilitates safety assessment and regulatory compliance

๐Ÿ“Š For Visualization and Reporting:

  • Summary reports provide publication-ready candidate lists
  • Score distributions support quality control assessment
  • Multi-species comparisons enable cross-species research applications

๐Ÿ”ฌ Nextflow Pipeline Integration

The integrated Nextflow pipeline provides scalable, containerized off-target analysis:

Pipeline Features

  • Multi-Species Analysis - Human, rat, rhesus macaque genomes
  • Parallel Processing - Each siRNA candidate processed independently
  • Auto Index Management - Downloads and builds BWA indices on demand
  • Cloud Ready - AWS Batch, Kubernetes, SLURM support
  • Comprehensive Results - TSV, JSON, and HTML outputs

Usage Examples

# Standalone pipeline execution
nextflow run nextflow_pipeline/main.nf \
  --input candidates.fasta \
  --genome_species "human,rat,rhesus" \
  --outdir results

# With custom genome indices
nextflow run nextflow_pipeline/main.nf \
  --input candidates.fasta \
  --genome_indices "human:/path/to/human/index" \
  --profile docker

# Using S3-hosted indices
nextflow run nextflow_pipeline/main.nf \
  --input candidates.fasta \
  --download_indexes true \
  --profile aws

Pipeline Output Structure

results/
โ”œโ”€โ”€ aggregated/                    # Final combined results
โ”‚   โ”œโ”€โ”€ combined_mirna_analysis.tsv
โ”‚   โ”œโ”€โ”€ combined_transcriptome_analysis.tsv
โ”‚   โ”œโ”€โ”€ combined_summary.json
โ”‚   โ””โ”€โ”€ analysis_report.html
โ””โ”€โ”€ individual_results/            # Per-candidate results
    โ”œโ”€โ”€ candidate_0001/
    โ”œโ”€โ”€ candidate_0002/
    โ””โ”€โ”€ ...

๐Ÿ› ๏ธ Development & Quality Assurance

Modern Development Environment with uv

siRNAforge leverages uv for lightning-fast dependency management and development workflows:

# Complete development setup (recommended)
git clone https://github.com/austin-s-h/sirnaforge
cd sirnaforge
make install-dev  # Installs all dev dependencies

# Core development commands
make test-local-python  # Fastest Python-only tests (markers=local_python)
make test-fast          # Quick pytest suite excluding slow markers
make lint               # Ruff (lint + format --check) and mypy
make check              # lint-fix + test-fast for pre-commit parity
make docs               # Build Sphinx documentation
make docker             # Build the production Docker image

# Selective dependency installation
uv sync --group analysis    # Jupyter, plotting, pandas extras
uv sync --group pipeline    # Nextflow, Docker integration
uv sync --group docs        # Sphinx documentation tools
uv sync --group lint        # Pre-commit, mypy, ruff, black

# Production deployment (minimal dependencies)
uv sync --no-dev

Conda Environment Management

For local development with bioinformatics tools, siRNAforge provides conda environment management:

# Create complete development environment
make conda-env

# Update existing environment with new dependencies
make conda-env-update

# Remove environment (cleanup)
make conda-env-clean

# Activate environment for development
conda activate sirnaforge-dev

# Deactivate when done
conda deactivate

The conda environment includes all bioinformatics tools (BWA-MEM2, SAMtools, ViennaRNA, etc.) plus Python development dependencies, providing a complete local development setup without Docker.

Quality Assurance & Testing

๐Ÿงช Comprehensive Test Suite:

# Run all tests with coverage reporting
make test
# Output: >95% code coverage across all modules

# Fast development testing (unit tests only)
make test-fast

# Integration tests (includes external APIs)
uv run pytest tests/integration/ -v

# Pipeline tests (requires Docker/Nextflow)
uv run pytest tests/pipeline/ -v

# Specific test categories
uv run pytest tests/unit/test_design.py::test_scoring_algorithm -v

๐Ÿ” Code Quality Tools:

# Type checking with mypy (strict mode)
uv run mypy src/
# Result: Success: no issues found in 20 source files

# Code formatting with black
uv run black src tests
make format

# Linting with ruff (fast Python linter)
uv run ruff check src tests
make lint

# All quality checks together
make lint  # Includes ruff, black, mypy, nextflow lint

Available Dependency Groups

Group Purpose Key Tools
dev Core development (auto-installed) pytest, black, ruff
test Testing frameworks pytest-cov, pytest-xdist
lint Code quality mypy, ruff, black
analysis Data science workflows jupyter, matplotlib, pandas
pipeline Nextflow integration workflow tools, containers
docs Documentation generation sphinx, sphinx-rtd-theme

Code Quality Standards

  • Type Safety: Full mypy coverage with Pydantic models
  • Formatting: Black + Ruff for consistent style
  • Testing: Comprehensive pytest suite with >90% coverage
  • CI/CD: GitHub Actions with multi-Python testing
  • Security: Bandit + Safety dependency scanning

โšก Performance & System Requirements

Performance Benchmarks

๐Ÿงฌ siRNA Design Performance:

  • Small genes (1-5 transcripts): ~2-5 seconds
  • Medium genes (5-20 transcripts): ~10-30 seconds
  • Large genes (20+ transcripts): ~1-2 minutes
  • Batch processing (10 genes): ~5-15 minutes

๐Ÿ” Off-target Analysis Performance:

  • Per candidate (single species): ~30-60 seconds
  • Multi-species (3 genomes): ~2-5 minutes per candidate
  • Batch analysis (50 candidates): ~1-3 hours (parallelized)

System Requirements

๐Ÿ”ง Minimum Requirements:

  • CPU: 2 cores, 2.0 GHz
  • RAM: 4 GB (8 GB recommended for off-target analysis)
  • Storage: 2 GB free space (+ 50 GB for genome indices)
  • Network: Internet connection for gene searches and genome downloads

โšก Recommended Configuration:

  • CPU: 8+ cores, 3.0 GHz (for parallel Nextflow execution)
  • RAM: 16-32 GB (for large-scale off-target analysis)
  • Storage: SSD with 100+ GB (for genome indices and temporary files)
  • Network: High-bandwidth connection for S3 genome downloads

๐Ÿณ Docker Resource Allocation:

# Recommended Docker settings
docker run --cpus="4" --memory="8g" \
  -v $(pwd):/workspace -w /workspace \
  ghcr.io/austin-s-h/sirnaforge:latest \
  sirnaforge workflow TP53 --genome-species human,rat,rhesus

๐Ÿณ Docker Usage

Pre-built Images

# Pull latest stable release
docker pull ghcr.io/austin-s-h/sirnaforge:latest

# Run complete workflow
docker run --rm -v $(pwd):/data \
  ghcr.io/austin-s-h/sirnaforge:latest \
  sirnaforge workflow TP53 --output-dir /data/results

# Interactive development session
docker run -it --rm -v $(pwd):/data \
  ghcr.io/austin-s-h/sirnaforge:latest bash

Building Custom Images

# Build production image
make docker

# Build with specific Python version
docker build --build-arg PYTHON_VERSION=3.11 \
  -f docker/Dockerfile -t sirnaforge:py311 .

The Docker image uses micromamba with docker/environment-nextflow.yml for consistent bioinformatics tool installations across all environments.

๐Ÿงช Testing & Quality Assurance

Running Tests

Command Under the hood When to use Notes
make test-local-python uv run --group dev pytest -v -m "local_python" Fastest feedback loop during development Python-only markers, no Docker/Nextflow required
make test-unit uv run --group dev pytest -v -m "unit" Validate core algorithms Includes ~30 tests (~30s)
make test-fast uv run --group dev pytest -v -m "not slow" Pre-commit or PR checks Skips slow/integration markers
make test uv run --group dev pytest -v Full Python suite May include slow and docker-marked tests; expect >60s
make test-ci uv run --group dev pytest -m "ci" --junitxml=pytest-report.xml --cov=sirnaforge --cov-report=term-missing --cov-report=xml:coverage.xml -v CI pipelines needing artifacts Produces coverage + JUnit reports
make test-cov uv run --group dev pytest --cov=sirnaforge --cov-report=html --cov-report=term-missing Local coverage runs Outputs HTML coverage in htmlcov/
make lint Ruff lint + Ruff format check + MyPy Quick code-quality gate No automatic fixes
make check make lint-fix + make test-fast Pre-commit parity Applies Ruff fixes before running fast pytest subset

Docker-powered tiers share the same pytest markers but execute inside the published image:

Command Container invocation Resource profile Purpose
make docker-test-smoke docker run โ€ฆ python -m pytest -q -n 1 -m 'docker and smoke' 0.5 CPU / 256โ€ฏMB Minimal CI smoke (MUST PASS)
make docker-test-fast docker run โ€ฆ python -m pytest -q -n 1 -m 'docker and not slow' 1 CPU / 2โ€ฏGB Dev-friendly docker coverage
make docker-test docker run โ€ฆ python -m pytest -v -n 1 -m 'docker and (docker_integration or (not smoke))' 2 CPUs / 4โ€ฏGB Standard docker regression
make docker-test-full docker run โ€ฆ uv run --group dev pytest -v -n 2 4 CPUs / 8โ€ฏGB Release-grade validation

โ„น๏ธ Run make install-dev once to install development dependencies and pre-commit hooks before using these targets. The full matrix of commands, filters, and expected runtimes lives in docs/testing_guide.md.

Docker smoke snapshot

For a quick environment sanity check, make docker-test-smoke exercises the published container image with toy data in ~40โ€ฏseconds (0.5 CPU, 256โ€ฏMB). A passing run prints 9 passed with no failures; any remaining pytest collection warnings are tracked in the test suite and should disappear once the dataclass fix in this branch lands.

Fast CI/CD with Toy Data โšก

siRNAforge now includes an improved CI/CD workflow designed for quick feedback with minimal resources:

  • โšก Ultra-fast execution: < 15 minutes total
  • ๐Ÿชถ Minimal resources: 256MB memory, 0.5 CPU cores
  • ๐Ÿงธ Toy data: < 500 bytes of test sequences
  • ๐Ÿ”ฅ Smoke tests: Essential functionality validation
# Trigger fast CI/CD workflow locally
pytest -m "smoke" --tb=short

# Use toy data for quick validation
ls tests/unit/data/toy_*.fasta

# Fast workflow vs comprehensive workflow
# Fast:    15 min,  256MB RAM, toy data
# Full:    60 min,    8GB RAM, real datasets

See docs/ci-cd-fast.md for detailed documentation.

Test Categories

  • Unit Tests - Core algorithm validation
  • Integration Tests - Component interaction testing
  • Pipeline Tests - Nextflow workflow validation
  • Docker Tests - Container functionality testing

๐Ÿ“š Documentation

Local Documentation Building

# Install documentation dependencies
uv sync --group docs

# Build HTML documentation
make docs

# Generate CLI reference
make docs-cli

# Live-reload docs during editing
make docs-dev

Generated Documentation

  • docs/_build/html/ - Complete Sphinx HTML documentation (via make docs)
  • docs/CLI_REFERENCE.md - Auto-generated CLI help (via make docs-cli)
  • docs/api_reference.rst - Python API reference source
  • docs/modification_annotation_spec.md - Chemical modifications metadata specification

๐Ÿ“– See docs/getting_started.md for detailed tutorials and docs/deployment.md for deployment guides.

Chemical Modifications Metadata

siRNAforge supports structured annotation of chemical modifications, overhangs, and provenance information for siRNA sequences. This enables systematic tracking of modifications like 2'-O-methyl, 2'-fluoro, and phosphorothioate linkages.

Quick Example:

# Create metadata JSON file
cat > metadata.json << 'EOF'
{
  "patisiran_ttr_guide": {
    "id": "patisiran_ttr_guide",
    "sequence": "AUGGAAUACUCUUGGUUAC",
    "target_gene": "TTR",
    "strand_role": "guide",
    "overhang": "dTdT",
    "chem_mods": [
      {
        "type": "2OMe",
        "positions": [1, 4, 6, 11, 13, 16, 19]
      }
    ],
    "provenance": {
      "source_type": "patent",
      "identifier": "US10060921B2",
      "url": "https://patents.google.com/patent/US10060921B2"
    },
    "confirmation_status": "confirmed"
  }
}
EOF

# Annotate FASTA with metadata
sirnaforge sequences annotate sequences.fasta metadata.json -o annotated.fasta

# View sequences with metadata
sirnaforge sequences show annotated.fasta
sirnaforge sequences show annotated.fasta --format json

Features:

  • ๐Ÿงช Chemical Modifications - Annotate 2'-O-methyl, 2'-fluoro, PS linkages, LNA, etc.
  • ๐Ÿ“ Position Tracking - 1-based position numbering for each modification
  • ๐Ÿ”— Overhang Support - DNA (dTdT) or RNA (UU) overhangs
  • ๐Ÿ“š Provenance - Track sources (patents, publications, clinical trials)
  • โœ… Confirmation Status - Mark validated vs. predicted sequences
  • ๐Ÿ—‚๏ธ FASTA Headers - Standardized key-value encoding in headers
  • ๐Ÿ“„ JSON Sidecars - Separate metadata files for easy curation

Common Modification Types:

  • 2OMe - 2'-O-methyl (nuclease resistance)
  • 2F - 2'-fluoro (enhanced stability)
  • PS - Phosphorothioate (nuclease resistance)
  • LNA - Locked Nucleic Acid (enhanced binding)
  • MOE - 2'-O-methoxyethyl (improved pharmacokinetics)

Python API:

from sirnaforge.models.modifications import (
    StrandMetadata,
    ChemicalModification,
    Provenance,
    SourceType
)

# Create metadata
metadata = StrandMetadata(
    id="my_sirna_guide",
    sequence="AUCGAUCGAUCGAUCGAUCGA",
    overhang="dTdT",
    chem_mods=[
        ChemicalModification(type="2OMe", positions=[1, 4, 6, 11])
    ],
    provenance=Provenance(
        source_type=SourceType.PUBLICATION,
        identifier="PMID12345678"
    )
)

# Generate FASTA with metadata
from sirnaforge.models.modifications import SequenceRecord, StrandRole
record = SequenceRecord(
    target_gene="BRCA1",
    strand_role=StrandRole.GUIDE,
    metadata=metadata
)
print(record.to_fasta())

๐Ÿ“– See docs/modification_annotation_spec.md for complete specification, API reference, and examples.

๐Ÿค Contributing

We welcome contributions to siRNAforge! Here's how to get started:

Development Setup

  1. Fork the repository on GitHub
  2. Clone your fork: git clone https://github.com/yourusername/sirnaforge
  3. Setup development environment: make install-dev
  4. Create a feature branch: git checkout -b feature/amazing-feature

Development Workflow

# Make your changes
# ...

# Ensure code quality
make lint           # Check code style and types
make format         # Auto-format code
make test-local-python  # Fast sanity suite
make check              # Auto-fix lint + fast pytest

# Commit and push
git add .
git commit -m 'Add amazing feature'
git push origin feature/amazing-feature

Contribution Guidelines

  • Code Style: Follow Black formatting and Ruff linting rules
  • Type Hints: All new code must include type annotations
  • Tests: Add tests for new functionality
  • Documentation: Update docstrings and documentation
  • Commit Messages: Use conventional commit format

Pull Request Process

  1. Ensure all tests pass and code is properly formatted
  2. Update documentation for any API changes
  3. Add entries to CHANGELOG.md for user-facing changes
  4. Create a pull request with a clear description

See CONTRIBUTING.md for detailed guidelines.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

siRNAforge builds upon excellent open-source tools and libraries:

  • ViennaRNA Package - RNA secondary structure prediction
  • BWA-MEM2 - Fast and accurate sequence alignment
  • Nextflow - Workflow management and containerization
  • BioPython - Python bioinformatics toolkit
  • Pydantic - Data validation and type safety
  • Modern Python Stack - uv, Typer, Rich for developer experience

Note: Much of the code in this repository was developed with assistance from AI agents, but all code has been reviewed, tested, and validated by human developers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sirnaforge-0.2.1.tar.gz (422.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sirnaforge-0.2.1-py3-none-any.whl (130.6 kB view details)

Uploaded Python 3

File details

Details for the file sirnaforge-0.2.1.tar.gz.

File metadata

  • Download URL: sirnaforge-0.2.1.tar.gz
  • Upload date:
  • Size: 422.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sirnaforge-0.2.1.tar.gz
Algorithm Hash digest
SHA256 8ad32b3f76a51dc5d5f0d02c188f63b52c0d3a4869d50a6b03877ce8947cc4b5
MD5 2cafbe75383148089dc3a6c97b8cfbc8
BLAKE2b-256 d5aee03d139395ef5a2f000e6f4db82a30b6cef6eece64ffbb3c26c64441022a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sirnaforge-0.2.1.tar.gz:

Publisher: release.yml on Austin-s-h/sirnaforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sirnaforge-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: sirnaforge-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 130.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sirnaforge-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a33ba7d8dc49118b277416142ee8e80488e753cd44550e9a9b137bca3d2e7685
MD5 b80e94e5d4431163f5741b5f416c2553
BLAKE2b-256 9d71be7d962920cbb64764b0e364440b50928224710caf0e84d609737606e5b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for sirnaforge-0.2.1-py3-none-any.whl:

Publisher: release.yml on Austin-s-h/sirnaforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page