AI-Powered Multi-Technology Genome Assembler with GPU Acceleration
Project description
StrandWeaver
AI & ML-Powered Multi-Technology Genome Assembler with GPU Acceleration
StrandWeaver is a next-generation genome assembly pipeline that applies machine learning to the hardest parts of genome assembly — graph path resolution, error correction, haplotype phasing, and structural variant detection. It combines technology-aware error correction, graph-based assembly with a haplotype-aware graph neural network, iterative polishing, and integrated SV calling into a single pipeline for multimodal sequencing data, including the latest Element data. GPU-accelerated (NVIDIA CUDA or Apple Silicon MPS), with CPU fallback. Google Colab notebooks are provided for custom model training.
StrandWeaver is deeply inspired by the incredible work implemented in MaSuRCA (1), Verkko (2), and Hifiasm (3). StrandWeaver adds:
- GNN-based haplotype-aware path resolution that simplifies graph topology while strictly protecting biological variation (SNPs, indels, CNVs)
- Unified multi-technology error correction architecture spanning ONT, PacBio HiFi, ultra-long reads, and short reads (Illumina, Element, Ultima, PacBio SBB)
- ML-optimized k-mer selection for every assembly stage (graph construction, overlap, extension, polishing)
- Ancient DNA damage repair with trained deamination models (C→T/G→A)
- Assembly-time structural variant detection via XGBoost + LightGBM ensemble
- Integrated QV estimation, polishing, and gap filling
- One-command custom model training for any organism or sequencing technology
Pre-trained models ship with v0.2+ and load automatically. See the AI Model Training Guide for custom training.
🆕 What's New in v0.3
- All 7 AI modules now ship with trained models optimized on real + synthetic data (HG002, CHM13) — download via
strandweaver download- ErrorSmith correction optimized for 17 chemistries / platforms
- SVScribe F1-macro improved from 0.557 → 0.957 with a separate long-read only & multi-tech (long-read, ultra-long, whatever-C) XGBoost + LightGBM ensemble, per-class threshold tuning, and biology-informed Bayesian priors
- Standalone
strandweaver (step)commands to skip to different parts of the pipeline- Integrated QV estimation (
qv), iterative polishing (polish), and gap filling (gap-fill)- 200× faster k-mer spectra extraction
- PyPI installable (
pip install strandweaver)- Training notebooks included for all model types
✨ Features
- 🧬 Multi-Technology Support: Illumina HiSeq2500, PacBio Onso/Revio/Sequel II, Element Aviti/UltraQ, Ultima, ONT R9/R10 (Guppy*, ligation/ultra-long, simplex), ONT R10 (Guppy*/Dorado, ligation/ultra-long, simplex/duplex)
- 🔀 Hybrid Assembly: Combine any mix of platforms in a unified assembly graph — DBG for short reads, OLC for long reads, automatically selected
- 🧬 Diploid-Aware: Protects SNP-level variation (>99.5% identity threshold), never collapses across haplotype boundaries, iterative refinement with phasing context backfed to ML models
- 🏛️ Ancient DNA Mode: mapDamage2-inspired C→T/G→A deamination repair with configurable confidence thresholds
- 🛡️ 80-Feature Edge Scoring (EdgeWarden): 26 static + 34 temporal + 20 expanded features; graceful fallback to static if alignment data unavailable
- 🧠 GNN Path Resolution (PathGNN): GATv2Conv attention network for haplotype-aware graph simplification with strict variation protection
- 🧵 Ultra-Long Read Routing (ThreadCompass): Multi-start pathfinding with confidence scoring and topology feedback
- 🔀 Hi-C Scaffolding & Phasing (HaplotypeDetangler): Spectral clustering on Hi-C contact matrices for chromosome-scale haplotype separation
- 🔍 Assembly-Time SV Detection (SVScribe): DEL/INS/INV/DUP/TRA calls from graph topology + UL spanning + Hi-C validation
- 📊 Integrated QV, Polishing & Gap Filling: End-to-end finishing built into the pipeline
- 📄 Rich Output: GFA graphs, BandageNG visualization, N50/L50/QV stats, VCF/JSON SV calls, phasing info, IGV/UCSC tracks, chromosome classification
- 🔌 Modular: All AI modules can be disabled (
--classical) for heuristic-only assembly - 🧪 Custom Training: CLI commands to generate synthetic training data and retrain all models for your organism
We recommend against using Guppy-basecalled data, although models are provided. The error profile is very difficult to train on, and models are set to binary calls only — "error or correct base" — rather than determining the type of error. "Best guess" corrected reads are output by the pipeline, but we highly recommend either hard-masking the errors or treating them very carefully.
🎯 Model Performance
See the AI Model Training Guide for full per-class breakdowns, per-family ErrorSmith metrics, and training details.
| Module | Task | Accuracy / R² | F1-macro | CV (5-fold) |
|---|---|---|---|---|
| 🛡️ EdgeWarden | Edge quality scoring (per-technology) | 0.881 | 0.896 | 0.878 ± 0.002 |
| 🔧 ErrorSmith (v18) | Per-base error classification (11 families, 17 chemistries) | 0.949 | 0.950 | per-family (see Training Guide) |
| 🧬 PathGNN | Graph-aware edge classification | 0.897 | 0.897 | 0.897 ± 0.001 |
| 🔀 DiploidAI | Haplotype phasing (26 features) | 0.862 | 0.862 | 0.858 ± 0.001 |
| 🧵 ThreadCompass | Ultra-long read routing | R²=0.997 | — | R²=0.997 ± 0.0003 |
| 🔍 SVScribe (v7.2) | Two-stage dual-tier SV detection | S1 F1=0.991 | S2 F1=0.957 | S2 min-class=0.876 |
| 🧠 K-Weaver (DBG) | De Bruijn graph k-mer selection | R²=0.863 | — | 0.863 ± 0.064 |
| 🧠 K-Weaver (UL Overlap) | Ultra-long overlap k-mer selection | R²=0.982 | — | 0.982 ± 0.020 |
| 🧠 K-Weaver (Extension) | Contig extension k-mer selection | R²=0.849 | — | 0.849 ± 0.074 |
| 🧠 K-Weaver (Polish) | Polishing k-mer selection | R²=0.881 | — | 0.881 ± 0.067 |
Highlights of Training Results
For full figure sets, see Training Doc.
| SV Call Matrix |
|---|
| ErrorSmith Current Chemistry Training F1s |
|---|
| ErrorSmith Current Chemistry Substitution Precision and Recall |
|---|
📋 Contents
- Installation
- Quick Start
- Use Cases
- Pipeline Reference
- Custom Training
- Troubleshooting
- Roadmap
- License
- Citation
🔧 Installation
Requirements
- Python 3.9+
- 8 GB RAM minimum (32+ GB recommended for large genomes)
- Disk space: 50-100 GB for intermediate files (genome-dependent)
Dependencies
StrandWeaver uses a modular dependency system with three installation tiers:
Core Dependencies (Always Installed):
- Bioinformatics:
biopython>=1.79,pysam>=0.19.0 - Numerical:
numpy>=1.21.0,scipy>=1.9.0,pandas>=1.3.0 - Graph Processing:
networkx>=2.6.0 - Hi-C/Phasing:
scipy>=1.9.0,scikit-learn>=1.3.0 - CLI/IO:
click>=8.0.0,pyyaml>=6.0,tqdm>=4.62.0,h5py>=3.5.0 - Performance:
numba>=0.54.0,joblib>=1.1.0
AI/ML Dependencies (Optional - [ai] flag):
- PyTorch:
torch>=2.0.0(CPU or GPU support) - Graph Neural Networks:
pytorch-geometric>=2.3.0 - Gradient Boosting:
xgboost>=2.0.0
These are required for:
- Loading pre-trained models (shipped with v0.2+)
- Custom model training
- GPU-accelerated assembly (if CUDA available)
Development Dependencies (Optional - [dev] flag):
- Testing:
pytest>=7.0.0,pytest-cov>=4.0.0 - Visualization:
matplotlib>=3.4.0,seaborn>=0.11.0 - Documentation: Development tools for contributors
GPU Support:
- CUDA 11.8+ for NVIDIA GPUs (automatic via PyTorch)
- MPS backend for Apple Silicon (automatic in macOS 12.3+)
- CPU fallback if no GPU detected
Platform Compatibility:
- ✅ Linux (x86_64, ARM64)
- ✅ macOS (Intel, Apple Silicon with MPS acceleration)
- ✅ Windows (via WSL2 recommended)
Install from GitHub (Recommended)
StrandWeaver has several dependencies, especially if you plan on installing the AI/ML training dependencies, so it is highly recommended to install in a virtual environment (all testing was performed in Python venvs, but Conda should work equally well).
# Basic installation
pip install git+https://github.com/pgrady1322/strandweaver.git
# Recommended: Complete installation with all dependencies
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"
# With AI/training dependencies (PyTorch, XGBoost)
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[ai]"
# With developmental testing features
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[dev]"
🚀 Quick Start
StrandWeaver offers two execution modes:
| Mode | Usage | Best For |
|---|---|---|
| Direct | strandweaver <command> [options] |
Small/medium datasets, local workstation, testing |
| Nextflow | strandweaver <command> [options] --nextflow |
Large datasets, HPC clusters, parallel processing |
Python CLI Mode
Basic Long Read Assembly Example (PacBio)
strandweaver pipeline \
--hifi-long-reads hifi_reads.fastq \
-o assembly_output/ \
-t 8
Hybrid Assembly with Multiple Ultra-Long Read Types
Note: The --ont-ul flag is used for path-finding reads. Any platform of long reads can be provided, but shorter long reads will degrade the assembly. The --ont-ul name is retained for clarity / comparison with other assemblers.
strandweaver pipeline \
--hifi-long-reads hifi_reads.fastq \
--ont-long-reads ont_reads.fastq.gz \
--ont-ul ultralong_reads.fastq \
-o assembly_output/ \
-t 16
Mixed Technology Assembly with Hi-C
Note: ANY platform of proximity ligation tech can be provided. StrandWeaver will optimize for Hi-C and Omni-C just as well as Pore-C and CiFi.
strandweaver pipeline \
--hifi-long-reads hifi_reads.fastq \
--ont-ul ultralong_reads.fastq \
--hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
-o assembly_output/ \
-t 16
Nextflow Mode (HPC/Cloud)
Local Execution
nextflow run strandweaver/nextflow/main.nf \
--hifi hifi_reads.fastq \
--ont ont_reads.fastq \
--hic_r1 hic_R1.fastq \
--hic_r2 hic_R2.fastq \
--outdir results/ \
-profile local
SLURM Cluster with Singularity
nextflow run strandweaver/nextflow/main.nf \
--hifi hifi_reads.fastq \
--ont ont_reads.fastq \
--ont_ul ultralong_reads.fastq \
--hic_r1 hic_R1.fastq \
--hic_r2 hic_R2.fastq \
--outdir results/ \
-profile slurm,singularity \
-resume
Huge Genome Mode (Plants, Insects) (parallel k-mer extraction)
nextflow run strandweaver/nextflow/main.nf \
--hifi hifi_reads.fastq \
--huge \
--outdir results/ \
-profile slurm
See nextflow/README.md for complete Nextflow documentation.
Nextflow Profiles
| Profile | Description | Use Case |
|---|---|---|
local |
Direct execution on local machine | Testing, small datasets |
docker |
Docker containerization | Reproducibility, dependency isolation |
singularity |
Singularity containers | HPC clusters (no root required) |
slurm |
SLURM cluster scheduler | HPC parallel processing |
test |
Use synthetic E. coli data | Quick validation |
Combine profiles with commas: -profile slurm,singularity
Error Correction
# Direct mode
strandweaver correct --hifi reads.fq.gz -o corrected/ -t 8
# Nextflow mode (automatic parallelization)
strandweaver correct --hifi reads.fq.gz -o corrected/ \
--nextflow --nf-profile slurm --correction-batch-size 100000
ErrorSmith Chemistry Designation
ErrorSmith uses chemistry-aware models trained on 17 sequencing platform / chemistry
combinations. Specify the chemistry by appending :chemistry_id to the reads
path, or by passing the corresponding chemistry flag immediately after the
data flag. Both forms are equivalent — use whichever is clearer for your
command. If your exact kit / flow cell combination is not listed, pick the
closest available model. When no chemistry is specified the default is applied
automatically.
# Colon syntax (preferred — keeps reads + chemistry together)
--flag reads.fastq:chemistry_id
# Flag syntax (equivalent)
--flag reads.fastq --flag-chemistry chemistry_id
| Read flag | Chemistry flag | Available chemistries | Default |
|---|---|---|---|
--hifi-long-reads |
--hifi-chemistry |
pacbio_hifi_sequel2, pacbio_hifi_revio, pacbio_hifi_vega |
pacbio_hifi_revio |
--ont-long-reads |
--ont-chemistry |
ont_lsk110_r941, ont_lsk114_r1041, ont_r1041_duplex, ont_herro_corrected |
ont_lsk114_r1041 |
--ont-ul |
--ont-ul-chemistry |
ont_ulk001_r941, ont_ulk114_r1041, ont_ulk114_r1041_hiacc, ont_ulk114_r1041_dorado |
ont_ulk114_r1041 |
-r1 / -r2 |
--short-read-chemistry |
illumina_hiseq2500, pacbio_onso, element_aviti, element_aviti_lng, element_ultraq, ultima_ug100 |
illumina_hiseq2500 |
Short reads: the colon suffix can go on either
-r1or-r2— StrandWeaver applies the same chemistry model to the pair. Alternatively, use the standalone--short-read-chemistryflag.
Multi-chemistry support. When an assembly combines reads from more than one chemistry within the same read type (e.g. ONT R9 + R10 datasets, or multiple ultra-long preps), pass the read flag once per dataset:
# Two ONT chemistries — colon syntax
--ont-long-reads r9.fastq:ont_lsk110_r941 \
--ont-long-reads r10.fastq:ont_lsk114_r1041
# Same thing — flag syntax
--ont-long-reads r9.fastq --ont-chemistry ont_lsk110_r941 \
--ont-long-reads r10.fastq --ont-chemistry ont_lsk114_r1041
# Two ultra-long preps with different basecallers
--ont-ul ul_standard.fastq:ont_ulk114_r1041 \
--ont-ul ul_dorado.fastq:ont_ulk114_r1041_dorado
ErrorSmith applies the corresponding chemistry-specific error model to each read set independently, then merges the corrected outputs before downstream graph construction.
Chemistry → platform mapping reference
| Chemistry ID | Platform | Read Type | Kit / Flow Cell |
|---|---|---|---|
pacbio_hifi_sequel2 |
PacBio Sequel II | HiFi long | CCS |
pacbio_hifi_revio |
PacBio Revio | HiFi long | CCS |
pacbio_hifi_vega |
PacBio Vega | HiFi long | CCS |
ont_lsk110_r941 |
ONT R9.4.1 | Ligation long | LSK110 |
ont_lsk114_r1041 |
ONT R10.4.1 | Ligation long | LSK114 |
ont_r1041_duplex |
ONT R10.4.1 | Duplex long | LSK114 duplex |
ont_herro_corrected |
ONT (Herro-corrected) | Long (corrected) | Herro error-corrected reads |
ont_ulk001_r941 |
ONT R9.4.1 | Ultra-long | ULK001 |
ont_ulk114_r1041 |
ONT R10.4.1 | Ultra-long | ULK114 |
ont_ulk114_r1041_hiacc |
ONT R10.4.1 | Ultra-long (HiAcc) | ULK114, high-accuracy mode |
ont_ulk114_r1041_dorado |
ONT R10.4.1 | Ultra-long (Dorado) | ULK114, Dorado basecaller |
illumina_hiseq2500 |
Illumina HiSeq 2500 | Short | TruSeq |
pacbio_onso |
PacBio Onso | Short | SBB |
element_aviti |
Element Aviti | Short | AVITI chemistry |
element_aviti_lng |
Element Aviti | Long | AVITI long-read chemistry |
element_ultraq |
Element UltraQ | Short | UltraQ chemistry |
ultima_ug100 |
Ultima Genomics UG100 | Short | UG100 flow chemistry |
# Example: Revio HiFi + ONT duplex + Dorado ultra-long (colon syntax)
strandweaver pipeline \
--hifi-long-reads hifi.fastq:pacbio_hifi_revio \
--ont-long-reads duplex.bam:ont_r1041_duplex \
--ont-ul ultralong.bam:ont_ulk114_r1041_dorado \
-o assembly_output/ -t 16
# Example: Element Aviti short reads + ONT ultra-long (colon on -r1)
strandweaver pipeline \
-r1 aviti_R1.fastq:element_aviti -r2 aviti_R2.fastq \
--ont-ul ultralong.fastq \
-o assembly_output/ -t 16
# Same short-read example using the flag syntax instead
strandweaver pipeline \
-r1 aviti_R1.fastq -r2 aviti_R2.fastq --short-read-chemistry element_aviti \
--ont-ul ultralong.fastq \
-o assembly_output/ -t 16
# Example: mixed ONT chemistries (R9 + R10 reads in one assembly)
strandweaver pipeline \
--ont-long-reads r9_reads.fastq:ont_lsk110_r941 \
--ont-long-reads r10_reads.fastq:ont_lsk114_r1041 \
--hifi-long-reads hifi.fastq \
-o assembly_output/ -t 16
The profile and batch-correct commands accept a single --chemistry flag
with any of the 13 values above.
Individual Processing Commands
StrandWeaver provides standalone commands for each processing stage for instances in which you may want just corrected reads or read error profiles. StrandWeaver also supports mapping of reads to GFA graphs, and calling SVs on GFA graphs. Each command supports both direct and Nextflow execution.
Error Correction
# HiFi + ONT correction with chemistry specified (colon syntax)
strandweaver correct \
--hifi-long-reads hifi.fastq:pacbio_hifi_revio \
--ont-long-reads ont.fastq:ont_lsk114_r1041 \
-o corrected/ -t 16
# Mixed ONT chemistries with Nextflow parallelization (flag syntax)
strandweaver correct \
--ont-long-reads r9.fastq --ont-chemistry ont_lsk110_r941 \
--ont-long-reads r10.fastq --ont-chemistry ont_lsk114_r1041 \
-o corrected/ \
--nextflow --nf-profile slurm --correction-batch-size 100000
K-mer Extraction
# Direct mode
strandweaver extract-kmers --hifi reads.fq.gz -k 31 -o kmers.pkl -t 8
# Nextflow mode (huge genomes >10GB)
strandweaver extract-kmers --hifi reads.fq.gz -k 31 -o kmers.pkl \
--nextflow --nf-profile slurm --kmer-batch-size 2000000
Edge Scoring
# Direct mode
strandweaver nf-score-edges -e edges.json -a aligns.bam -o scored.json -t 8
# Nextflow mode (large graphs)
strandweaver nf-score-edges -e edges.json -a aligns.bam -o scored.json \
--nextflow --nf-profile slurm --edge-batch-size 10000
Ultra-Long Read Mapping
# Direct mode
strandweaver map-ul -u ul_reads.fq.gz -g graph.gfa -o aligns.paf --use-gpu
# Nextflow mode
strandweaver map-ul -u ul_reads.fq.gz -g graph.gfa -o aligns.paf \
--nextflow --nf-profile slurm --use-gpu --ul-batch-size 100
Hi-C Alignment
# Direct mode
strandweaver align-hic --hic-r1 R1.fq.gz --hic-r2 R2.fq.gz \
-g graph.gfa -o aligns.bam -t 8
# Nextflow mode (large Hi-C datasets)
strandweaver align-hic --hic-r1 R1.fq.gz --hic-r2 R2.fq.gz \
-g graph.gfa -o aligns.bam \
--nextflow --nf-profile slurm --hic-batch-size 500000
Structural Variant Detection
# Direct mode
strandweaver nf-detect-svs -g graph.gfa -o variants.vcf -t 8
# Nextflow mode (large graphs)
strandweaver nf-detect-svs -g graph.gfa -o variants.vcf \
--nextflow --nf-profile slurm --sv-batch-size 1000
Performance Guidelines
When to use Direct mode: Dataset < 10GB, local workstation with 8+ cores, testing and debugging.
When to use Nextflow mode: Dataset > 10GB, HPC cluster available, need resume capability, want automatic parallelization.
| Command | Direct (1 node) | Nextflow (20 nodes) | Speedup |
|---|---|---|---|
correct |
20 hours | 2 hours | 10× |
extract-kmers |
8 hours | 1.5 hours | 5× |
nf-score-edges |
8 hours | 1.5 hours | 5× |
map-ul |
6 hours | 1 hour | 6× |
align-hic |
10 hours | 1.5 hours | 7× |
nf-detect-svs |
4 hours | 1 hour | 4× |
🎯 Use Cases
Machine-Learning-Tuned Genome Assembly with SV Calls
Combine ONT, HiFi, ultra-long reads, and Hi-C for chromosome-scale phased assemblies:
strandweaver pipeline \
--ont-long-reads ont.fastq \
--hifi-long-reads hifi.fastq \
--ont-ul ultralong.fastq \
--hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
--use-ai \
-o genome_assembly/ \
-t 32
Ancient DNA Assembly
Optimize for deamination damage with specialized error correction:
strandweaver pipeline \
-r1 ancient_reads.fastq --technology1 ancient \
-o ancient_assembly/ \
-t 16
Note that the assembly can also be run WITHOUT damage awareness features for comparison.
SV-Rich Genome Analysis
Detect structural variants during assembly for cancer or population genomics:
strandweaver pipeline \
--hifi-long-reads tumor.fastq \
--hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
--min-sv-size 30 \
-o tumor_assembly/ \
-t 24
Highly Heterozygous Diploid Assembly
Maintain haplotype separation for F1 hybrids or outcrossing species:
strandweaver pipeline \
--hifi-long-reads hifi.fastq \
--hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
--ploidy diploid \
--edge-filter-mode strict \
-o diploid_assembly/ \
-t 32
🔬 Pipeline Reference
Preprocessing
- Classify: Auto-detect sequencing technologies from FASTQ headers (supports ONT chemistry detection with LongBow)
- KWeaver: ML-based k-mer optimization with rule-based fallback for dynamic k-mer selection
- Profile: Error pattern profiling (substitutions, indels, homopolymers) with visualization
- Correct: Technology-aware read correction (ONTCorrector, PacBioCorrector)
Core Assembly
- Graph Building: Graph construction (type of graph based on read type) from reads with streaming architecture
- EdgeWarden: AI-powered graph edge filtering with 80-feature scoring
- PathWeaver: GNN-based haplotype-aware path resolution with variation protection
- StringGraph: Ultra-long read overlay for long-range connections
- ThreadCompass: UL read routing optimization with trained models
- Hi-C Integration: Proximity ligation contact matrix construction and edge addition
- HaplotypeDetangler: Hi-C-augmented phasing via spectral clustering
- Iteration: 3+ refinement cycles with phasing-aware filtering
- SVScribe: Graph-based structural variant detection (DEL, INS, INV, DUP, TRA)
- Iterate or Finalize: Contig and scaffold extraction with comprehensive statistics, or pass graph with feature scoring to pipeline for iteration 2+.
Post-Assembly Analysis
- Misassembly Report: Putative misassembly detection using multi-signal evidence (EdgeWarden confidence, coverage discontinuities, UL read conflicts, Hi-C violations). Outputs TSV and BED reports for genome-browser visualization.
--misassembly-report/--no-misassembly-report— Enabled by default--misassembly-min-confidence HIGH|MEDIUM|LOW— Minimum confidence to flag (default: MEDIUM)--misassembly-format tsv,bed,json— Comma-separated output formats (default: tsv,bed)
- Chromosome Classification: Multi-tier scaffold classification to identify chromosomes vs. assembly artifacts.
- Tier 1 (always): Length, coverage, GC, connectivity, telomere detection
- Tier 2 (always): Gene content analysis (ORF / BLAST / Augustus / BUSCO)
- Tier 3 (
--id-chromosomes-advanced): Hi-C self-contact patterns, synteny - Telomere flags:
--telomere-sequence(default: TTAGGG),--telomere-min-units(default: 10),--telomere-search-depth(default: 5000 bp)
Output Generation
- GFA Export: Assembly graphs in GFA format with sequences
- BandageNG: Visualization files with coverage tracks and final 0 - 1 range StrandWeaver scores (long/UL/Hi-C).
- Statistics: N50, L50, coverage metrics, variation protection counts
- SV Calls: Structural variants in VCF and JSON formats
- Phasing Info: Haplotype assignments and confidence scores
Post-Assembly CLI Options Reference
| Flag | Default | Description |
|---|---|---|
--misassembly-report / --no-misassembly-report |
Enabled | Generate misassembly report (TSV + BED) |
--misassembly-min-confidence |
MEDIUM |
Minimum confidence for flags: HIGH, MEDIUM, LOW |
--misassembly-format |
tsv,bed |
Comma-separated output formats: tsv, bed, json |
--id-chromosomes |
Off | Enable scaffold → chromosome classification (Tiers 1-2) |
--id-chromosomes-advanced |
Off | Add Hi-C pattern analysis & synteny (Tier 3) |
--gene-detection-method |
orf |
Gene detection: orf (no deps), blast, augustus, busco |
--blast-db |
nr |
BLAST database for gene detection |
--telomere-sequence |
TTAGGG |
Telomere repeat motif. Alternatives: TTTAGGG (plants), TTAGG (insects) |
--telomere-min-units |
10 |
Minimum tandem repeats to call a telomere |
--telomere-search-depth |
5000 |
Base-pairs to search at each scaffold end |
🤖 Custom Training
Pre-trained models load automatically. Custom training is only needed for organism-specific optimization (extreme repeat content, unusual ploidy, novel sequencing chemistries).
# 1. Generate training data (graph-only mode: ~7 min/genome, ~8.5 MB/genome)
strandweaver train generate-data \
--genome-size 1000000 -n 200 \
--graph-training --graph-only \
-o training_data/
# 2. Train all models with cross-validation
strandweaver train run --data-dir training_data/ -o trained_models/
# 3. Assemble (custom models load automatically, or specify --model-dir)
strandweaver pipeline --hifi-long-reads reads.fastq.gz -o assembly/
See trained_models/TRAINING.md for the full training guide, Colab GPU notebooks, hyperparameter search spaces, and per-class performance breakdowns. See strandweaver/user_training/README.md for the synthetic genome generation parameter reference.
Output Files
output/
├── contigs.fasta # Primary assembly contigs
├── final_assembly.fasta # Polished, length-filtered contigs
├── scaffolds.fasta # Hi-C scaffolded sequences
├── assembly_graph.gfa # Assembly graph (GFA format)
├── assembly_stats.json # N50, L50, QV, coverage statistics
├── misassembly_report.tsv # Putative misassemblies (tab-delimited)
├── misassembly_report.bed # Misassemblies (genome browser BED)
├── chromosome_classification.json # Scaffold → chromosome classification
├── sv_calls.vcf # Structural variant calls
├── phasing_info.json # Haplotype assignments
├── coverage_long.csv # Long read coverage (BandageNG)
├── coverage_ul.csv # Ultra-long coverage (BandageNG)
├── coverage_hic.csv # Hi-C support (BandageNG)
├── kmer_predictions.json # K-mer optimization results
├── error_profile_<tech>_<n>.json # Per-technology error profiles
└── pipeline.log # Complete execution log
� Troubleshooting
Installation Issues
Problem: ModuleNotFoundError or import errors
# Solution: Reinstall with all dependencies
pip install --force-reinstall "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"
Problem: Python version incompatibility
# Check Python version (requires 3.9+)
python3 --version
# Create conda environment with correct Python version
conda create -n strandweaver python=3.11
conda activate strandweaver
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"
Problem: PyTorch/GPU issues
# For CUDA support, install PyTorch separately first
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"
# Verify GPU detection
python3 -c "import torch; print(f'GPU available: {torch.cuda.is_available()}')"
Assembly Quality Issues
Problem: Low N50 or fragmented assembly
- Check error rate & coverage: Aim for 30×+ HiFi or 50×+ ONT
strandweaver profile -i reads.fastq -o profile.json
- Add ultra-long reads, or a subset of your LONGEST READS: Dramatically improves contiguity
strandweaver pipeline --hifi-long-reads hifi.fastq --ont-ul ultralong.fastq -o improved/
- Enable Hi-C scaffolding: For chromosome-scale assemblies
strandweaver pipeline --hifi-long-reads hifi.fastq --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq -o scaffolded/
Problem: Collapsed heterozygous regions
# Use diploid mode with strict edge filtering to preserve variation
strandweaver pipeline \
--hifi-long-reads reads.fastq \
--ploidy diploid \
--edge-filter-mode strict \
-o diploid_assembly/
Problem: Assembly produces too many contigs (over-fragmented)
- Reduce edge filtering stringency:
strandweaver pipeline --hifi-long-reads reads.fastq --edge-filter-mode lenient -o assembly/
- Increase k-mer size: For high-coverage, low-error data
strandweaver pipeline --hifi-long-reads reads.fastq --kmer-size-assembly 51 -o assembly/
Problem: Assembly is chimeric or has misassemblies
- Enable stricter filtering:
strandweaver pipeline --hifi-long-reads reads.fastq --edge-filter-mode strict -o assembly/
- Add Hi-C validation: Long-range contact validation prevents chimeras
strandweaver pipeline --hifi-long-reads hifi.fastq --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq -o validated/
Performance & Resource Issues
Problem: Out of memory (OOM) errors
# Limit memory usage
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o assembly/ \
--memory-limit 16
# Or reduce graph coverage via sampling
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o assembly/ \
--sample-size-graph 500000
Problem: Assembly is too slow
# Increase threads (use all available cores)
strandweaver pipeline --hifi-long-reads reads.fastq -t $(nproc) -o assembly/
# Disable AI features for faster heuristic-only assembly
strandweaver pipeline --hifi-long-reads reads.fastq --classical -o fast_assembly/
# Skip profiling step if reads are already well-characterized
strandweaver pipeline --hifi-long-reads reads.fastq --skip-profiling -o quick_assembly/
Problem: Disk space issues
# Export only FASTA (skip GFA graphs to save space)
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o assembly/ \
--output-format fasta
# Use a separate output directory on a larger drive
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o /mnt/large_drive/assembly/
Input Data Issues
Problem: "Unsupported file format" error
# StrandWeaver accepts: FASTQ, FASTA, gzipped variants
# Check file format
file reads.fastq
# Convert BAM to FASTQ if needed
samtools bam2fq reads.bam > reads.fastq
# Decompress if needed
gunzip -c reads.fastq.gz > reads.fastq
Problem: Technology auto-detection fails
# Manually specify read technology
strandweaver pipeline \
-r1 reads.fastq --technology1 ont \
-o assembly/
# Supported: illumina, ancient, ont, ont_ultralong, pacbio
Problem: Ancient DNA damage not detected
# Explicitly specify ancient DNA technology
strandweaver pipeline \
-r1 ancient_reads.fastq --technology1 ancient \
-o ancient_assembly/
# Check damage profile first
strandweaver profile -i ancient_reads.fastq --technology ancient -o damage_profile.json
AI/ML Issues
Problem: AI features not working
# Check if AI dependencies installed
python3 -c "import torch, xgboost; print('AI dependencies OK')"
# Install AI dependencies if missing
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[ai]"
Problem: "No trained models found" warning
# Pre-trained models ship with v0.2+ and should load automatically.
# If you see this warning, the trained_models/ directory may be missing.
# Reinstall to restore pre-trained models:
pip install --force-reinstall "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"
# Or train custom models:
strandweaver train generate-data --genome-size 5000000 --graph-training --graph-only -o training_data/
strandweaver train run --data-dir training_data/ -o trained_models/
# See trained_models/TRAINING.md for the complete training guide
Problem: GPU not being used
# Force GPU usage with explicit backend
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o assembly/ \
--gpu-backend cuda
# On Apple Silicon
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o assembly/ \
--gpu-backend mps
# Check GPU memory usage during assembly
watch -n 1 nvidia-smi
Output Issues
Problem: No structural variants detected
# Ensure SV detection is happening (enabled by default in the pipeline)
# SVs require ultra-long or Hi-C data for validation
strandweaver pipeline \
--hifi-long-reads hifi.fastq \
--ont-ul ultralong.fastq \
--min-sv-size 30 \
-o assembly/
Problem: Missing output files
# Check pipeline.log for errors
tail -100 output/pipeline.log
# Export all output formats (FASTA + GFA) with intermediate graphs
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o assembly/ \
--output-format both \
--export-intermediate-graphs
Problem: GFA file won't load in BandageNG
# Validate GFA format
grep "^S" assembly_graph.gfa | head -5
# Regenerate with both FASTA + GFA export
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o assembly/ \
--output-format both
Common Error Messages
ValueError: Coverage too low for reliable assembly
- Solution: Increase sequencing coverage (aim for 30×+ minimum)
RuntimeError: Graph construction failed - no valid k-mer overlaps
- Solution: Try different k-mer size with
--kmer-size-assembly 31or--kmer-size-assembly 51
MemoryError: Unable to allocate array
- Solution: Use
--memory-limit 16to cap memory or reduce coverage with--sample-size-graph 500000
ImportError: cannot import name 'ThreadCompass'
- Solution: Reinstall package with
pip install --force-reinstall strandweaver
FileNotFoundError: [Errno 2] No such file or directory
- Solution: Use absolute paths for input/output files or check current working directory
Getting Help
Check version and installation:
strandweaver --version
strandweaver --help
Enable verbose logging:
strandweaver pipeline \
--hifi-long-reads reads.fastq \
-o assembly/ \
--log-level DEBUG
Report issues: GitHub Issues
Contact: patrickgsgrady@gmail.com
🗺️ Roadmap
v0.3 — Current Release
| # | Feature | Status | Notes |
|---|---|---|---|
| 1 | K-Weaver Trained Models | ✅ Complete | 4 regression models (DBG, UL overlap, extension, polish) via Optuna + Colab GPU |
| 2 | ErrorSmith Trained Models (v18) | ✅ Complete | 11-family, 17-chemistry, 45-feature two-stage ensemble on HG002/CHM13 (acc=0.949, F1=0.950) |
| 3 | SVScribe Overhaul (v7.2) | ✅ Complete | Two-stage dual-tier XGBoost + LightGBM ensemble, F1-macro 0.557 → 0.957 (S2) |
| 4 | Standalone assemble Command |
✅ Complete | Same assembly engine as pipeline, usable independently |
| 5 | QV Estimation & Gap Filling | ✅ Complete | qv, polish, gap-fill CLI commands wired into _step_finish() |
| 6 | Technology-Specific Subsampling | ✅ Complete | --subsample-{hifi,ont,ont-ul,illumina,ancient} flags |
| 7 | 200× Faster K-mer Extraction | ✅ Complete | Streaming architecture with parallel batch processing |
| 8 | PyPI Packaging | ✅ Complete | pip install strandweaver |
| 9 | Training Notebooks | ✅ Complete | Colab notebooks for all model types |
| 10 | Validate — Reference Comparison | Planned | --reference flag accepted but comparison logic not yet implemented |
| 11 | BUSCO Integration | Planned | --busco-lineage present on validate but not wired |
| 12 | Decontamination Screening | Planned | --decontaminate flag stubbed |
v0.2 Release Notes
| # | Feature | Notes |
|---|---|---|
| 1 | Trained ML Models | XGBoost + GNN for EdgeWarden, PathGNN, DiploidAI, ThreadCompass, SVScribe |
| 2 | DiploidAI Integration | 26-feature phasing wired into HaplotypeDetangler |
| 3 | Bubble-Aware Local Phasing | Genomics audit G2 |
| 4 | Genomics Audit (24 items) | G1–G24 resolved |
| 5 | Git LFS | Model weights tracked |
| 6 | Graph-Only Training | 3.3× faster, 27× less disk |
Future
- Polyploid assembly (
--ploidybeyond haploid/diploid) - PacBio/ONT native metadata detection from BAM/POD5 headers
- Additional ancient DNA damage models beyond deamination
📚 Documentation
- AI Model Training Guide — Model performance, custom training, Colab notebooks
- User Training Module — Synthetic genome generation & parameter reference
📄 License
Dual-licensed: Noncommercial Academic (default) and Commercial. See LICENSE_ACADEMIC.md and LICENSE_COMMERCIAL.md.
- Free for nonprofit academic research at universities and research institutes
- Commercial license required for any for-profit use, industry-funded research, or integration into commercial products/pipelines
- Source-available, not OSI open-source
Contact patrickgsgrady@gmail.com for commercial licensing.
📧 Contact
Patrick Grady | dr.pgrady(at)gmail.com
📈 Citation and References
@software{strandweaver2026,
author = {Grady, Patrick; Green, Rich},
title = {StrandWeaver: AI-Powered Multi-Technology Genome Assembler},
year = {2026},
url = {https://github.com/pgrady1322/strandweaver}
}
- Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Yorke JA, Dvorak J, Salzberg S. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm. Genome Research. 2017 Jan 1:066100.
- Rautiainen, M., Nurk, S., Walenz, B.P. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol 41, 1474–1482 (2023). https://doi.org/10.1038/s41587-023-01662-6
- Cheng, H., Concepcion, G.T., Feng, X. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021). https://doi.org/10.1038/s41592-020-01056-5
StrandWeaver 🧬⚡🤖
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strandweaver-0.3.0.dev0.tar.gz.
File metadata
- Download URL: strandweaver-0.3.0.dev0.tar.gz
- Upload date:
- Size: 520.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5bc7a1686f950f117869435d20050219aef6621442c084d277092fe5f8b021f
|
|
| MD5 |
8a69d1baf11afc782e6c32d1bc81022a
|
|
| BLAKE2b-256 |
056b2c54ac8fb3e74688883195dcd8af899a7eae72e9e0b066be329d38181642
|
Provenance
The following attestation bundles were made for strandweaver-0.3.0.dev0.tar.gz:
Publisher:
publish.yml on pgrady1322/strandweaver
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
strandweaver-0.3.0.dev0.tar.gz -
Subject digest:
b5bc7a1686f950f117869435d20050219aef6621442c084d277092fe5f8b021f - Sigstore transparency entry: 1154436119
- Sigstore integration time:
-
Permalink:
pgrady1322/strandweaver@b09df42c6e8a35c86f903973f431c96b2706a884 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/pgrady1322
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b09df42c6e8a35c86f903973f431c96b2706a884 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file strandweaver-0.3.0.dev0-py3-none-any.whl.
File metadata
- Download URL: strandweaver-0.3.0.dev0-py3-none-any.whl
- Upload date:
- Size: 497.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05060a109c4452905140bf589c0df802f53f0b8c5d1234e034ac3c6a19657526
|
|
| MD5 |
c4dd36c7cd93eca90697cb277ab77b7b
|
|
| BLAKE2b-256 |
1e484b7245952f70961ab2348a60bbd7e403b7b93a88c1fb8006fd8139faf1d3
|
Provenance
The following attestation bundles were made for strandweaver-0.3.0.dev0-py3-none-any.whl:
Publisher:
publish.yml on pgrady1322/strandweaver
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
strandweaver-0.3.0.dev0-py3-none-any.whl -
Subject digest:
05060a109c4452905140bf589c0df802f53f0b8c5d1234e034ac3c6a19657526 - Sigstore transparency entry: 1154436120
- Sigstore integration time:
-
Permalink:
pgrady1322/strandweaver@b09df42c6e8a35c86f903973f431c96b2706a884 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/pgrady1322
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b09df42c6e8a35c86f903973f431c96b2706a884 -
Trigger Event:
workflow_dispatch
-
Statement type: