Skip to main content

AI-Powered Multi-Technology Genome Assembler with GPU Acceleration

Project description

StrandWeaver

AI & ML-Powered Multi-Technology Genome Assembler with GPU Acceleration

CI Python 3.9+ Status Models License

StrandWeaver is a next-generation genome assembly pipeline that applies machine learning to the hardest parts of genome assembly — graph path resolution, error correction, haplotype phasing, and structural variant detection. It combines technology-aware error correction, graph-based assembly with a haplotype-aware graph neural network, iterative polishing, and integrated SV calling into a single pipeline for multimodal sequencing data, including the latest Element data. GPU-accelerated (NVIDIA CUDA or Apple Silicon MPS), with CPU fallback. Google Colab notebooks are provided for custom model training.

StrandWeaver is deeply inspired by the incredible work implemented in MaSuRCA (1), Verkko (2), and Hifiasm (3). StrandWeaver adds:

  • GNN-based haplotype-aware path resolution that simplifies graph topology while strictly protecting biological variation (SNPs, indels, CNVs)
  • Unified multi-technology error correction architecture spanning ONT, PacBio HiFi, ultra-long reads, and short reads (Illumina, Element, Ultima, PacBio SBB)
  • ML-optimized k-mer selection for every assembly stage (graph construction, overlap, extension, polishing)
  • Ancient DNA damage repair with trained deamination models (C→T/G→A)
  • Assembly-time structural variant detection via XGBoost + LightGBM ensemble
  • Integrated QV estimation, polishing, and gap filling
  • One-command custom model training for any organism or sequencing technology

Pre-trained models ship with v0.2+ and load automatically. See the AI Model Training Guide for custom training.

🆕 What's New in v0.3

  • All 7 AI modules now ship with trained models optimized on real + synthetic data (HG002, CHM13) — download via strandweaver download
  • ErrorSmith correction optimized for 17 chemistries / platforms
  • SVScribe F1-macro improved from 0.557 → 0.957 with a separate long-read only & multi-tech (long-read, ultra-long, whatever-C) XGBoost + LightGBM ensemble, per-class threshold tuning, and biology-informed Bayesian priors
  • Standalone strandweaver (step) commands to skip to different parts of the pipeline
  • Integrated QV estimation (qv), iterative polishing (polish), and gap filling (gap-fill)
  • 200× faster k-mer spectra extraction
  • PyPI installable (pip install strandweaver)
  • Training notebooks included for all model types

✨ Features

  • 🧬 Multi-Technology Support: Illumina HiSeq2500, PacBio Onso/Revio/Sequel II, Element Aviti/UltraQ, Ultima, ONT R9/R10 (Guppy*, ligation/ultra-long, simplex), ONT R10 (Guppy*/Dorado, ligation/ultra-long, simplex/duplex)
  • 🔀 Hybrid Assembly: Combine any mix of platforms in a unified assembly graph — DBG for short reads, OLC for long reads, automatically selected
  • 🧬 Diploid-Aware: Protects SNP-level variation (>99.5% identity threshold), never collapses across haplotype boundaries, iterative refinement with phasing context backfed to ML models
  • 🏛️ Ancient DNA Mode: mapDamage2-inspired C→T/G→A deamination repair with configurable confidence thresholds
  • 🛡️ 80-Feature Edge Scoring (EdgeWarden): 26 static + 34 temporal + 20 expanded features; graceful fallback to static if alignment data unavailable
  • 🧠 GNN Path Resolution (PathGNN): GATv2Conv attention network for haplotype-aware graph simplification with strict variation protection
  • 🧵 Ultra-Long Read Routing (ThreadCompass): Multi-start pathfinding with confidence scoring and topology feedback
  • 🔀 Hi-C Scaffolding & Phasing (HaplotypeDetangler): Spectral clustering on Hi-C contact matrices for chromosome-scale haplotype separation
  • 🔍 Assembly-Time SV Detection (SVScribe): DEL/INS/INV/DUP/TRA calls from graph topology + UL spanning + Hi-C validation
  • 📊 Integrated QV, Polishing & Gap Filling: End-to-end finishing built into the pipeline
  • 📄 Rich Output: GFA graphs, BandageNG visualization, N50/L50/QV stats, VCF/JSON SV calls, phasing info, IGV/UCSC tracks, chromosome classification
  • 🔌 Modular: All AI modules can be disabled (--classical) for heuristic-only assembly
  • 🧪 Custom Training: CLI commands to generate synthetic training data and retrain all models for your organism

We recommend against using Guppy-basecalled data, although models are provided. The error profile is very difficult to train on, and models are set to binary calls only — "error or correct base" — rather than determining the type of error. "Best guess" corrected reads are output by the pipeline, but we highly recommend either hard-masking the errors or treating them very carefully.


🎯 Model Performance

See the AI Model Training Guide for full per-class breakdowns, per-family ErrorSmith metrics, and training details.

Module Task Accuracy / R² F1-macro CV (5-fold)
🛡️ EdgeWarden Edge quality scoring (per-technology) 0.881 0.896 0.878 ± 0.002
🔧 ErrorSmith (v18) Per-base error classification (11 families, 17 chemistries) 0.949 0.950 per-family (see Training Guide)
🧬 PathGNN Graph-aware edge classification 0.897 0.897 0.897 ± 0.001
🔀 DiploidAI Haplotype phasing (26 features) 0.862 0.862 0.858 ± 0.001
🧵 ThreadCompass Ultra-long read routing R²=0.997 R²=0.997 ± 0.0003
🔍 SVScribe (v7.2) Two-stage dual-tier SV detection S1 F1=0.991 S2 F1=0.957 S2 min-class=0.876
🧠 K-Weaver (DBG) De Bruijn graph k-mer selection R²=0.863 0.863 ± 0.064
🧠 K-Weaver (UL Overlap) Ultra-long overlap k-mer selection R²=0.982 0.982 ± 0.020
🧠 K-Weaver (Extension) Contig extension k-mer selection R²=0.849 0.849 ± 0.074
🧠 K-Weaver (Polish) Polishing k-mer selection R²=0.881 0.881 ± 0.067

Highlights of Training Results

For full figure sets, see Training Doc.

SV Call Matrix
Multitech Confusion Matrix
ErrorSmith Current Chemistry Training F1s
Chemistry F1s
ErrorSmith Current Chemistry Substitution Precision and Recall
Substitution Stats

📋 Contents


🔧 Installation

Requirements

  • Python 3.9+
  • 8 GB RAM minimum (32+ GB recommended for large genomes)
  • Disk space: 50-100 GB for intermediate files (genome-dependent)

Dependencies

StrandWeaver uses a modular dependency system with three installation tiers:

Core Dependencies (Always Installed):

  • Bioinformatics: biopython>=1.79, pysam>=0.19.0
  • Numerical: numpy>=1.21.0, scipy>=1.9.0, pandas>=1.3.0
  • Graph Processing: networkx>=2.6.0
  • Hi-C/Phasing: scipy>=1.9.0, scikit-learn>=1.3.0
  • CLI/IO: click>=8.0.0, pyyaml>=6.0, tqdm>=4.62.0, h5py>=3.5.0
  • Performance: numba>=0.54.0, joblib>=1.1.0

AI/ML Dependencies (Optional - [ai] flag):

  • PyTorch: torch>=2.0.0 (CPU or GPU support)
  • Graph Neural Networks: pytorch-geometric>=2.3.0
  • Gradient Boosting: xgboost>=2.0.0

These are required for:

  • Loading pre-trained models (shipped with v0.2+)
  • Custom model training
  • GPU-accelerated assembly (if CUDA available)

Development Dependencies (Optional - [dev] flag):

  • Testing: pytest>=7.0.0, pytest-cov>=4.0.0
  • Visualization: matplotlib>=3.4.0, seaborn>=0.11.0
  • Documentation: Development tools for contributors

GPU Support:

  • CUDA 11.8+ for NVIDIA GPUs (automatic via PyTorch)
  • MPS backend for Apple Silicon (automatic in macOS 12.3+)
  • CPU fallback if no GPU detected

Platform Compatibility:

  • ✅ Linux (x86_64, ARM64)
  • ✅ macOS (Intel, Apple Silicon with MPS acceleration)
  • ✅ Windows (via WSL2 recommended)

Install from GitHub (Recommended)

StrandWeaver has several dependencies, especially if you plan on installing the AI/ML training dependencies, so it is highly recommended to install in a virtual environment (all testing was performed in Python venvs, but Conda should work equally well).

# Basic installation
pip install git+https://github.com/pgrady1322/strandweaver.git

# Recommended: Complete installation with all dependencies
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

# With AI/training dependencies (PyTorch, XGBoost)
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[ai]"

# With developmental testing features
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[dev]"

🚀 Quick Start

StrandWeaver offers two execution modes:

Mode Usage Best For
Direct strandweaver <command> [options] Small/medium datasets, local workstation, testing
Nextflow strandweaver <command> [options] --nextflow Large datasets, HPC clusters, parallel processing

Python CLI Mode

Basic Long Read Assembly Example (PacBio)

strandweaver pipeline \
  --hifi-long-reads hifi_reads.fastq \
  -o assembly_output/ \
  -t 8

Hybrid Assembly with Multiple Ultra-Long Read Types

Note: The --ont-ul flag is used for path-finding reads. Any platform of long reads can be provided, but shorter long reads will degrade the assembly. The --ont-ul name is retained for clarity / comparison with other assemblers.

strandweaver pipeline \
  --hifi-long-reads hifi_reads.fastq \
  --ont-long-reads ont_reads.fastq.gz \
  --ont-ul ultralong_reads.fastq \
  -o assembly_output/ \
  -t 16

Mixed Technology Assembly with Hi-C

Note: ANY platform of proximity ligation tech can be provided. StrandWeaver will optimize for Hi-C and Omni-C just as well as Pore-C and CiFi.

strandweaver pipeline \
  --hifi-long-reads hifi_reads.fastq \
  --ont-ul ultralong_reads.fastq \
  --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
  -o assembly_output/ \
  -t 16

Nextflow Mode (HPC/Cloud)

Local Execution

nextflow run strandweaver/nextflow/main.nf \
  --hifi hifi_reads.fastq \
  --ont ont_reads.fastq \
  --hic_r1 hic_R1.fastq \
  --hic_r2 hic_R2.fastq \
  --outdir results/ \
  -profile local

SLURM Cluster with Singularity

nextflow run strandweaver/nextflow/main.nf \
  --hifi hifi_reads.fastq \
  --ont ont_reads.fastq \
  --ont_ul ultralong_reads.fastq \
  --hic_r1 hic_R1.fastq \
  --hic_r2 hic_R2.fastq \
  --outdir results/ \
  -profile slurm,singularity \
  -resume

Huge Genome Mode (Plants, Insects) (parallel k-mer extraction)

nextflow run strandweaver/nextflow/main.nf \
  --hifi hifi_reads.fastq \
  --huge \
  --outdir results/ \
  -profile slurm

See nextflow/README.md for complete Nextflow documentation.

Nextflow Profiles

Profile Description Use Case
local Direct execution on local machine Testing, small datasets
docker Docker containerization Reproducibility, dependency isolation
singularity Singularity containers HPC clusters (no root required)
slurm SLURM cluster scheduler HPC parallel processing
test Use synthetic E. coli data Quick validation

Combine profiles with commas: -profile slurm,singularity

Error Correction

# Direct mode
strandweaver correct --hifi reads.fq.gz -o corrected/ -t 8

# Nextflow mode (automatic parallelization)
strandweaver correct --hifi reads.fq.gz -o corrected/ \
  --nextflow --nf-profile slurm --correction-batch-size 100000
ErrorSmith Chemistry Designation

ErrorSmith uses chemistry-aware models trained on 17 sequencing platform / chemistry combinations. Specify the chemistry by appending :chemistry_id to the reads path, or by passing the corresponding chemistry flag immediately after the data flag. Both forms are equivalent — use whichever is clearer for your command. If your exact kit / flow cell combination is not listed, pick the closest available model. When no chemistry is specified the default is applied automatically.

# Colon syntax (preferred — keeps reads + chemistry together)
--flag reads.fastq:chemistry_id

# Flag syntax (equivalent)
--flag reads.fastq --flag-chemistry chemistry_id
Read flag Chemistry flag Available chemistries Default
--hifi-long-reads --hifi-chemistry pacbio_hifi_sequel2, pacbio_hifi_revio, pacbio_hifi_vega pacbio_hifi_revio
--ont-long-reads --ont-chemistry ont_lsk110_r941, ont_lsk114_r1041, ont_r1041_duplex, ont_herro_corrected ont_lsk114_r1041
--ont-ul --ont-ul-chemistry ont_ulk001_r941, ont_ulk114_r1041, ont_ulk114_r1041_hiacc, ont_ulk114_r1041_dorado ont_ulk114_r1041
-r1 / -r2 --short-read-chemistry illumina_hiseq2500, pacbio_onso, element_aviti, element_aviti_lng, element_ultraq, ultima_ug100 illumina_hiseq2500

Short reads: the colon suffix can go on either -r1 or -r2 — StrandWeaver applies the same chemistry model to the pair. Alternatively, use the standalone --short-read-chemistry flag.

Multi-chemistry support. When an assembly combines reads from more than one chemistry within the same read type (e.g. ONT R9 + R10 datasets, or multiple ultra-long preps), pass the read flag once per dataset:

# Two ONT chemistries — colon syntax
--ont-long-reads r9.fastq:ont_lsk110_r941 \
--ont-long-reads r10.fastq:ont_lsk114_r1041

# Same thing — flag syntax
--ont-long-reads r9.fastq --ont-chemistry ont_lsk110_r941 \
--ont-long-reads r10.fastq --ont-chemistry ont_lsk114_r1041

# Two ultra-long preps with different basecallers
--ont-ul ul_standard.fastq:ont_ulk114_r1041 \
--ont-ul ul_dorado.fastq:ont_ulk114_r1041_dorado

ErrorSmith applies the corresponding chemistry-specific error model to each read set independently, then merges the corrected outputs before downstream graph construction.

Chemistry → platform mapping reference
Chemistry ID Platform Read Type Kit / Flow Cell
pacbio_hifi_sequel2 PacBio Sequel II HiFi long CCS
pacbio_hifi_revio PacBio Revio HiFi long CCS
pacbio_hifi_vega PacBio Vega HiFi long CCS
ont_lsk110_r941 ONT R9.4.1 Ligation long LSK110
ont_lsk114_r1041 ONT R10.4.1 Ligation long LSK114
ont_r1041_duplex ONT R10.4.1 Duplex long LSK114 duplex
ont_herro_corrected ONT (Herro-corrected) Long (corrected) Herro error-corrected reads
ont_ulk001_r941 ONT R9.4.1 Ultra-long ULK001
ont_ulk114_r1041 ONT R10.4.1 Ultra-long ULK114
ont_ulk114_r1041_hiacc ONT R10.4.1 Ultra-long (HiAcc) ULK114, high-accuracy mode
ont_ulk114_r1041_dorado ONT R10.4.1 Ultra-long (Dorado) ULK114, Dorado basecaller
illumina_hiseq2500 Illumina HiSeq 2500 Short TruSeq
pacbio_onso PacBio Onso Short SBB
element_aviti Element Aviti Short AVITI chemistry
element_aviti_lng Element Aviti Long AVITI long-read chemistry
element_ultraq Element UltraQ Short UltraQ chemistry
ultima_ug100 Ultima Genomics UG100 Short UG100 flow chemistry
# Example: Revio HiFi + ONT duplex + Dorado ultra-long (colon syntax)
strandweaver pipeline \
  --hifi-long-reads hifi.fastq:pacbio_hifi_revio \
  --ont-long-reads duplex.bam:ont_r1041_duplex \
  --ont-ul ultralong.bam:ont_ulk114_r1041_dorado \
  -o assembly_output/ -t 16

# Example: Element Aviti short reads + ONT ultra-long (colon on -r1)
strandweaver pipeline \
  -r1 aviti_R1.fastq:element_aviti -r2 aviti_R2.fastq \
  --ont-ul ultralong.fastq \
  -o assembly_output/ -t 16

# Same short-read example using the flag syntax instead
strandweaver pipeline \
  -r1 aviti_R1.fastq -r2 aviti_R2.fastq --short-read-chemistry element_aviti \
  --ont-ul ultralong.fastq \
  -o assembly_output/ -t 16

# Example: mixed ONT chemistries (R9 + R10 reads in one assembly)
strandweaver pipeline \
  --ont-long-reads r9_reads.fastq:ont_lsk110_r941 \
  --ont-long-reads r10_reads.fastq:ont_lsk114_r1041 \
  --hifi-long-reads hifi.fastq \
  -o assembly_output/ -t 16

The profile and batch-correct commands accept a single --chemistry flag with any of the 13 values above.

Individual Processing Commands

StrandWeaver provides standalone commands for each processing stage for instances in which you may want just corrected reads or read error profiles. StrandWeaver also supports mapping of reads to GFA graphs, and calling SVs on GFA graphs. Each command supports both direct and Nextflow execution.

Error Correction

# HiFi + ONT correction with chemistry specified (colon syntax)
strandweaver correct \
  --hifi-long-reads hifi.fastq:pacbio_hifi_revio \
  --ont-long-reads ont.fastq:ont_lsk114_r1041 \
  -o corrected/ -t 16

# Mixed ONT chemistries with Nextflow parallelization (flag syntax)
strandweaver correct \
  --ont-long-reads r9.fastq --ont-chemistry ont_lsk110_r941 \
  --ont-long-reads r10.fastq --ont-chemistry ont_lsk114_r1041 \
  -o corrected/ \
  --nextflow --nf-profile slurm --correction-batch-size 100000

K-mer Extraction

# Direct mode
strandweaver extract-kmers --hifi reads.fq.gz -k 31 -o kmers.pkl -t 8

# Nextflow mode (huge genomes >10GB)
strandweaver extract-kmers --hifi reads.fq.gz -k 31 -o kmers.pkl \
  --nextflow --nf-profile slurm --kmer-batch-size 2000000

Edge Scoring

# Direct mode
strandweaver nf-score-edges -e edges.json -a aligns.bam -o scored.json -t 8

# Nextflow mode (large graphs)
strandweaver nf-score-edges -e edges.json -a aligns.bam -o scored.json \
  --nextflow --nf-profile slurm --edge-batch-size 10000

Ultra-Long Read Mapping

# Direct mode
strandweaver map-ul -u ul_reads.fq.gz -g graph.gfa -o aligns.paf --use-gpu

# Nextflow mode
strandweaver map-ul -u ul_reads.fq.gz -g graph.gfa -o aligns.paf \
  --nextflow --nf-profile slurm --use-gpu --ul-batch-size 100

Hi-C Alignment

# Direct mode
strandweaver align-hic --hic-r1 R1.fq.gz --hic-r2 R2.fq.gz \
  -g graph.gfa -o aligns.bam -t 8

# Nextflow mode (large Hi-C datasets)
strandweaver align-hic --hic-r1 R1.fq.gz --hic-r2 R2.fq.gz \
  -g graph.gfa -o aligns.bam \
  --nextflow --nf-profile slurm --hic-batch-size 500000

Structural Variant Detection

# Direct mode
strandweaver nf-detect-svs -g graph.gfa -o variants.vcf -t 8

# Nextflow mode (large graphs)
strandweaver nf-detect-svs -g graph.gfa -o variants.vcf \
  --nextflow --nf-profile slurm --sv-batch-size 1000

Performance Guidelines

When to use Direct mode: Dataset < 10GB, local workstation with 8+ cores, testing and debugging.

When to use Nextflow mode: Dataset > 10GB, HPC cluster available, need resume capability, want automatic parallelization.

Command Direct (1 node) Nextflow (20 nodes) Speedup
correct 20 hours 2 hours 10×
extract-kmers 8 hours 1.5 hours
nf-score-edges 8 hours 1.5 hours
map-ul 6 hours 1 hour
align-hic 10 hours 1.5 hours
nf-detect-svs 4 hours 1 hour

🎯 Use Cases

Machine-Learning-Tuned Genome Assembly with SV Calls

Combine ONT, HiFi, ultra-long reads, and Hi-C for chromosome-scale phased assemblies:

strandweaver pipeline \
  --ont-long-reads ont.fastq \
  --hifi-long-reads hifi.fastq \
  --ont-ul ultralong.fastq \
  --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
  --use-ai \
  -o genome_assembly/ \
  -t 32

Ancient DNA Assembly

Optimize for deamination damage with specialized error correction:

strandweaver pipeline \
  -r1 ancient_reads.fastq --technology1 ancient \
  -o ancient_assembly/ \
  -t 16

Note that the assembly can also be run WITHOUT damage awareness features for comparison.

SV-Rich Genome Analysis

Detect structural variants during assembly for cancer or population genomics:

strandweaver pipeline \
  --hifi-long-reads tumor.fastq \
  --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
  --min-sv-size 30 \
  -o tumor_assembly/ \
  -t 24

Highly Heterozygous Diploid Assembly

Maintain haplotype separation for F1 hybrids or outcrossing species:

strandweaver pipeline \
  --hifi-long-reads hifi.fastq \
  --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
  --ploidy diploid \
  --edge-filter-mode strict \
  -o diploid_assembly/ \
  -t 32

🔬 Pipeline Reference

Preprocessing

  1. Classify: Auto-detect sequencing technologies from FASTQ headers (supports ONT chemistry detection with LongBow)
  2. KWeaver: ML-based k-mer optimization with rule-based fallback for dynamic k-mer selection
  3. Profile: Error pattern profiling (substitutions, indels, homopolymers) with visualization
  4. Correct: Technology-aware read correction (ONTCorrector, PacBioCorrector)

Core Assembly

  1. Graph Building: Graph construction (type of graph based on read type) from reads with streaming architecture
  2. EdgeWarden: AI-powered graph edge filtering with 80-feature scoring
  3. PathWeaver: GNN-based haplotype-aware path resolution with variation protection
  4. StringGraph: Ultra-long read overlay for long-range connections
  5. ThreadCompass: UL read routing optimization with trained models
  6. Hi-C Integration: Proximity ligation contact matrix construction and edge addition
  7. HaplotypeDetangler: Hi-C-augmented phasing via spectral clustering
  8. Iteration: 3+ refinement cycles with phasing-aware filtering
  9. SVScribe: Graph-based structural variant detection (DEL, INS, INV, DUP, TRA)
  10. Iterate or Finalize: Contig and scaffold extraction with comprehensive statistics, or pass graph with feature scoring to pipeline for iteration 2+.

Post-Assembly Analysis

  1. Misassembly Report: Putative misassembly detection using multi-signal evidence (EdgeWarden confidence, coverage discontinuities, UL read conflicts, Hi-C violations). Outputs TSV and BED reports for genome-browser visualization.
    • --misassembly-report / --no-misassembly-report — Enabled by default
    • --misassembly-min-confidence HIGH|MEDIUM|LOW — Minimum confidence to flag (default: MEDIUM)
    • --misassembly-format tsv,bed,json — Comma-separated output formats (default: tsv,bed)
  2. Chromosome Classification: Multi-tier scaffold classification to identify chromosomes vs. assembly artifacts.
    • Tier 1 (always): Length, coverage, GC, connectivity, telomere detection
    • Tier 2 (always): Gene content analysis (ORF / BLAST / Augustus / BUSCO)
    • Tier 3 (--id-chromosomes-advanced): Hi-C self-contact patterns, synteny
    • Telomere flags: --telomere-sequence (default: TTAGGG), --telomere-min-units (default: 10), --telomere-search-depth (default: 5000 bp)

Output Generation

  1. GFA Export: Assembly graphs in GFA format with sequences
  2. BandageNG: Visualization files with coverage tracks and final 0 - 1 range StrandWeaver scores (long/UL/Hi-C).
  3. Statistics: N50, L50, coverage metrics, variation protection counts
  4. SV Calls: Structural variants in VCF and JSON formats
  5. Phasing Info: Haplotype assignments and confidence scores

Post-Assembly CLI Options Reference

Flag Default Description
--misassembly-report / --no-misassembly-report Enabled Generate misassembly report (TSV + BED)
--misassembly-min-confidence MEDIUM Minimum confidence for flags: HIGH, MEDIUM, LOW
--misassembly-format tsv,bed Comma-separated output formats: tsv, bed, json
--id-chromosomes Off Enable scaffold → chromosome classification (Tiers 1-2)
--id-chromosomes-advanced Off Add Hi-C pattern analysis & synteny (Tier 3)
--gene-detection-method orf Gene detection: orf (no deps), blast, augustus, busco
--blast-db nr BLAST database for gene detection
--telomere-sequence TTAGGG Telomere repeat motif. Alternatives: TTTAGGG (plants), TTAGG (insects)
--telomere-min-units 10 Minimum tandem repeats to call a telomere
--telomere-search-depth 5000 Base-pairs to search at each scaffold end

🤖 Custom Training

Pre-trained models load automatically. Custom training is only needed for organism-specific optimization (extreme repeat content, unusual ploidy, novel sequencing chemistries).

# 1. Generate training data (graph-only mode: ~7 min/genome, ~8.5 MB/genome)
strandweaver train generate-data \
  --genome-size 1000000 -n 200 \
  --graph-training --graph-only \
  -o training_data/

# 2. Train all models with cross-validation
strandweaver train run --data-dir training_data/ -o trained_models/

# 3. Assemble (custom models load automatically, or specify --model-dir)
strandweaver pipeline --hifi-long-reads reads.fastq.gz -o assembly/

See trained_models/TRAINING.md for the full training guide, Colab GPU notebooks, hyperparameter search spaces, and per-class performance breakdowns. See strandweaver/user_training/README.md for the synthetic genome generation parameter reference.

Output Files

output/
├── contigs.fasta                  # Primary assembly contigs
├── final_assembly.fasta           # Polished, length-filtered contigs
├── scaffolds.fasta                # Hi-C scaffolded sequences
├── assembly_graph.gfa             # Assembly graph (GFA format)
├── assembly_stats.json            # N50, L50, QV, coverage statistics
├── misassembly_report.tsv         # Putative misassemblies (tab-delimited)
├── misassembly_report.bed         # Misassemblies (genome browser BED)
├── chromosome_classification.json # Scaffold → chromosome classification
├── sv_calls.vcf                   # Structural variant calls
├── phasing_info.json              # Haplotype assignments
├── coverage_long.csv              # Long read coverage (BandageNG)
├── coverage_ul.csv                # Ultra-long coverage (BandageNG)
├── coverage_hic.csv               # Hi-C support (BandageNG)
├── kmer_predictions.json          # K-mer optimization results
├── error_profile_<tech>_<n>.json  # Per-technology error profiles
└── pipeline.log                   # Complete execution log

� Troubleshooting

Installation Issues

Problem: ModuleNotFoundError or import errors

# Solution: Reinstall with all dependencies
pip install --force-reinstall "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

Problem: Python version incompatibility

# Check Python version (requires 3.9+)
python3 --version

# Create conda environment with correct Python version
conda create -n strandweaver python=3.11
conda activate strandweaver
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

Problem: PyTorch/GPU issues

# For CUDA support, install PyTorch separately first
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

# Verify GPU detection
python3 -c "import torch; print(f'GPU available: {torch.cuda.is_available()}')"

Assembly Quality Issues

Problem: Low N50 or fragmented assembly

  • Check error rate & coverage: Aim for 30×+ HiFi or 50×+ ONT
    strandweaver profile -i reads.fastq -o profile.json
    
  • Add ultra-long reads, or a subset of your LONGEST READS: Dramatically improves contiguity
    strandweaver pipeline --hifi-long-reads hifi.fastq --ont-ul ultralong.fastq -o improved/
    
  • Enable Hi-C scaffolding: For chromosome-scale assemblies
    strandweaver pipeline --hifi-long-reads hifi.fastq --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq -o scaffolded/
    

Problem: Collapsed heterozygous regions

# Use diploid mode with strict edge filtering to preserve variation
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  --ploidy diploid \
  --edge-filter-mode strict \
  -o diploid_assembly/

Problem: Assembly produces too many contigs (over-fragmented)

  • Reduce edge filtering stringency:
    strandweaver pipeline --hifi-long-reads reads.fastq --edge-filter-mode lenient -o assembly/
    
  • Increase k-mer size: For high-coverage, low-error data
    strandweaver pipeline --hifi-long-reads reads.fastq --kmer-size-assembly 51 -o assembly/
    

Problem: Assembly is chimeric or has misassemblies

  • Enable stricter filtering:
    strandweaver pipeline --hifi-long-reads reads.fastq --edge-filter-mode strict -o assembly/
    
  • Add Hi-C validation: Long-range contact validation prevents chimeras
    strandweaver pipeline --hifi-long-reads hifi.fastq --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq -o validated/
    

Performance & Resource Issues

Problem: Out of memory (OOM) errors

# Limit memory usage
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --memory-limit 16

# Or reduce graph coverage via sampling
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --sample-size-graph 500000

Problem: Assembly is too slow

# Increase threads (use all available cores)
strandweaver pipeline --hifi-long-reads reads.fastq -t $(nproc) -o assembly/

# Disable AI features for faster heuristic-only assembly
strandweaver pipeline --hifi-long-reads reads.fastq --classical -o fast_assembly/

# Skip profiling step if reads are already well-characterized
strandweaver pipeline --hifi-long-reads reads.fastq --skip-profiling -o quick_assembly/

Problem: Disk space issues

# Export only FASTA (skip GFA graphs to save space)
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --output-format fasta

# Use a separate output directory on a larger drive
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o /mnt/large_drive/assembly/

Input Data Issues

Problem: "Unsupported file format" error

# StrandWeaver accepts: FASTQ, FASTA, gzipped variants
# Check file format
file reads.fastq

# Convert BAM to FASTQ if needed
samtools bam2fq reads.bam > reads.fastq

# Decompress if needed
gunzip -c reads.fastq.gz > reads.fastq

Problem: Technology auto-detection fails

# Manually specify read technology
strandweaver pipeline \
  -r1 reads.fastq --technology1 ont \
  -o assembly/

# Supported: illumina, ancient, ont, ont_ultralong, pacbio

Problem: Ancient DNA damage not detected

# Explicitly specify ancient DNA technology
strandweaver pipeline \
  -r1 ancient_reads.fastq --technology1 ancient \
  -o ancient_assembly/

# Check damage profile first
strandweaver profile -i ancient_reads.fastq --technology ancient -o damage_profile.json

AI/ML Issues

Problem: AI features not working

# Check if AI dependencies installed
python3 -c "import torch, xgboost; print('AI dependencies OK')"

# Install AI dependencies if missing
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[ai]"

Problem: "No trained models found" warning

# Pre-trained models ship with v0.2+ and should load automatically.
# If you see this warning, the trained_models/ directory may be missing.

# Reinstall to restore pre-trained models:
pip install --force-reinstall "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

# Or train custom models:
strandweaver train generate-data --genome-size 5000000 --graph-training --graph-only -o training_data/
strandweaver train run --data-dir training_data/ -o trained_models/
# See trained_models/TRAINING.md for the complete training guide

Problem: GPU not being used

# Force GPU usage with explicit backend
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --gpu-backend cuda

# On Apple Silicon
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --gpu-backend mps

# Check GPU memory usage during assembly
watch -n 1 nvidia-smi

Output Issues

Problem: No structural variants detected

# Ensure SV detection is happening (enabled by default in the pipeline)
# SVs require ultra-long or Hi-C data for validation
strandweaver pipeline \
  --hifi-long-reads hifi.fastq \
  --ont-ul ultralong.fastq \
  --min-sv-size 30 \
  -o assembly/

Problem: Missing output files

# Check pipeline.log for errors
tail -100 output/pipeline.log

# Export all output formats (FASTA + GFA) with intermediate graphs
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --output-format both \
  --export-intermediate-graphs

Problem: GFA file won't load in BandageNG

# Validate GFA format
grep "^S" assembly_graph.gfa | head -5

# Regenerate with both FASTA + GFA export
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --output-format both

Common Error Messages

ValueError: Coverage too low for reliable assembly

  • Solution: Increase sequencing coverage (aim for 30×+ minimum)

RuntimeError: Graph construction failed - no valid k-mer overlaps

  • Solution: Try different k-mer size with --kmer-size-assembly 31 or --kmer-size-assembly 51

MemoryError: Unable to allocate array

  • Solution: Use --memory-limit 16 to cap memory or reduce coverage with --sample-size-graph 500000

ImportError: cannot import name 'ThreadCompass'

  • Solution: Reinstall package with pip install --force-reinstall strandweaver

FileNotFoundError: [Errno 2] No such file or directory

  • Solution: Use absolute paths for input/output files or check current working directory

Getting Help

Check version and installation:

strandweaver --version
strandweaver --help

Enable verbose logging:

strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --log-level DEBUG

Report issues: GitHub Issues

Contact: patrickgsgrady@gmail.com


🗺️ Roadmap

v0.3 — Current Release

# Feature Status Notes
1 K-Weaver Trained Models ✅ Complete 4 regression models (DBG, UL overlap, extension, polish) via Optuna + Colab GPU
2 ErrorSmith Trained Models (v18) ✅ Complete 11-family, 17-chemistry, 45-feature two-stage ensemble on HG002/CHM13 (acc=0.949, F1=0.950)
3 SVScribe Overhaul (v7.2) ✅ Complete Two-stage dual-tier XGBoost + LightGBM ensemble, F1-macro 0.557 → 0.957 (S2)
4 Standalone assemble Command ✅ Complete Same assembly engine as pipeline, usable independently
5 QV Estimation & Gap Filling ✅ Complete qv, polish, gap-fill CLI commands wired into _step_finish()
6 Technology-Specific Subsampling ✅ Complete --subsample-{hifi,ont,ont-ul,illumina,ancient} flags
7 200× Faster K-mer Extraction ✅ Complete Streaming architecture with parallel batch processing
8 PyPI Packaging ✅ Complete pip install strandweaver
9 Training Notebooks ✅ Complete Colab notebooks for all model types
10 Validate — Reference Comparison Planned --reference flag accepted but comparison logic not yet implemented
11 BUSCO Integration Planned --busco-lineage present on validate but not wired
12 Decontamination Screening Planned --decontaminate flag stubbed

v0.2 Release Notes

# Feature Notes
1 Trained ML Models XGBoost + GNN for EdgeWarden, PathGNN, DiploidAI, ThreadCompass, SVScribe
2 DiploidAI Integration 26-feature phasing wired into HaplotypeDetangler
3 Bubble-Aware Local Phasing Genomics audit G2
4 Genomics Audit (24 items) G1–G24 resolved
5 Git LFS Model weights tracked
6 Graph-Only Training 3.3× faster, 27× less disk

Future

  • Polyploid assembly (--ploidy beyond haploid/diploid)
  • PacBio/ONT native metadata detection from BAM/POD5 headers
  • Additional ancient DNA damage models beyond deamination

📚 Documentation


📄 License

Dual-licensed: Noncommercial Academic (default) and Commercial. See LICENSE_ACADEMIC.md and LICENSE_COMMERCIAL.md.

  • Free for nonprofit academic research at universities and research institutes
  • Commercial license required for any for-profit use, industry-funded research, or integration into commercial products/pipelines
  • Source-available, not OSI open-source

Contact patrickgsgrady@gmail.com for commercial licensing.


📧 Contact

Patrick Grady | dr.pgrady(at)gmail.com


📈 Citation and References

@software{strandweaver2026,
  author = {Grady, Patrick; Green, Rich},
  title = {StrandWeaver: AI-Powered Multi-Technology Genome Assembler},
  year = {2026},
  url = {https://github.com/pgrady1322/strandweaver}
}
  1. Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Yorke JA, Dvorak J, Salzberg S. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm. Genome Research. 2017 Jan 1:066100.
  2. Rautiainen, M., Nurk, S., Walenz, B.P. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol 41, 1474–1482 (2023). https://doi.org/10.1038/s41587-023-01662-6
  3. Cheng, H., Concepcion, G.T., Feng, X. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021). https://doi.org/10.1038/s41592-020-01056-5

StrandWeaver 🧬⚡🤖

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strandweaver-0.3.0.dev0.tar.gz (520.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strandweaver-0.3.0.dev0-py3-none-any.whl (497.6 kB view details)

Uploaded Python 3

File details

Details for the file strandweaver-0.3.0.dev0.tar.gz.

File metadata

  • Download URL: strandweaver-0.3.0.dev0.tar.gz
  • Upload date:
  • Size: 520.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strandweaver-0.3.0.dev0.tar.gz
Algorithm Hash digest
SHA256 b5bc7a1686f950f117869435d20050219aef6621442c084d277092fe5f8b021f
MD5 8a69d1baf11afc782e6c32d1bc81022a
BLAKE2b-256 056b2c54ac8fb3e74688883195dcd8af899a7eae72e9e0b066be329d38181642

See more details on using hashes here.

Provenance

The following attestation bundles were made for strandweaver-0.3.0.dev0.tar.gz:

Publisher: publish.yml on pgrady1322/strandweaver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file strandweaver-0.3.0.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for strandweaver-0.3.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 05060a109c4452905140bf589c0df802f53f0b8c5d1234e034ac3c6a19657526
MD5 c4dd36c7cd93eca90697cb277ab77b7b
BLAKE2b-256 1e484b7245952f70961ab2348a60bbd7e403b7b93a88c1fb8006fd8139faf1d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for strandweaver-0.3.0.dev0-py3-none-any.whl:

Publisher: publish.yml on pgrady1322/strandweaver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page