AI-Powered Multi-Technology Genome Assembler with GPU Acceleration

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pgrady1322

These details have not been verified by PyPI

Project description

StrandWeaver

AI & ML-Powered Multi-Technology Genome Assembler with GPU Acceleration

StrandWeaver is a next-generation genome assembly pipeline that applies machine learning to the hardest parts of genome assembly — graph path resolution, error correction, haplotype phasing, and structural variant detection. It combines technology-aware error correction, graph-based assembly with a haplotype-aware graph neural network, iterative polishing, and integrated SV calling into a single pipeline for multimodal sequencing data, including the latest Element data. GPU-accelerated (NVIDIA CUDA or Apple Silicon MPS), with CPU fallback. Google Colab notebooks are provided for custom model training.

StrandWeaver is deeply inspired by the incredible work implemented in MaSuRCA (1), Verkko (2), and Hifiasm (3). StrandWeaver adds:

GNN-based haplotype-aware path resolution that simplifies graph topology while strictly protecting biological variation (SNPs, indels, CNVs)
Unified multi-technology error correction architecture spanning ONT, PacBio HiFi, ultra-long reads, and short reads (Illumina, Element, Ultima, PacBio SBB)
ML-optimized k-mer selection for every assembly stage (graph construction, overlap, extension, polishing)
Ancient DNA damage repair with trained deamination models (C→T/G→A)
Assembly-time structural variant detection via XGBoost + LightGBM ensemble
Integrated QV estimation, polishing, and gap filling
One-command custom model training for any organism or sequencing technology

Pre-trained models ship with v0.2+ and load automatically. See the AI Model Training Guide for custom training.

🆕 What's New in v0.3

All 7 AI modules now ship with trained models optimized on real + synthetic data (HG002, CHM13) — download via strandweaver download

ErrorSmith correction optimized for 17 chemistries / platforms

SVScribe F1-macro improved from 0.557 → 0.957 with a separate long-read only & multi-tech (long-read, ultra-long, whatever-C) XGBoost + LightGBM ensemble, per-class threshold tuning, and biology-informed Bayesian priors

Standalone strandweaver (step) commands to skip to different parts of the pipeline

Integrated QV estimation (qv), iterative polishing (polish), and gap filling (gap-fill)

200× faster k-mer spectra extraction

PyPI installable (pip install strandweaver)

Training notebooks included for all model types

✨ Features

🧬 Multi-Technology Support: Illumina HiSeq2500, PacBio Onso/Revio/Sequel II, Element Aviti/UltraQ, Ultima, ONT R9/R10 (Guppy*, ligation/ultra-long, simplex), ONT R10 (Guppy*/Dorado, ligation/ultra-long, simplex/duplex)
🔀 Hybrid Assembly: Combine any mix of platforms in a unified assembly graph — DBG for short reads, OLC for long reads, automatically selected
🧬 Diploid-Aware: Protects SNP-level variation (>99.5% identity threshold), never collapses across haplotype boundaries, iterative refinement with phasing context backfed to ML models
🏛️ Ancient DNA Mode: mapDamage2-inspired C→T/G→A deamination repair with configurable confidence thresholds
🛡️ 80-Feature Edge Scoring (EdgeWarden): 26 static + 34 temporal + 20 expanded features; graceful fallback to static if alignment data unavailable
🧠 GNN Path Resolution (PathGNN): GATv2Conv attention network for haplotype-aware graph simplification with strict variation protection
🧵 Ultra-Long Read Routing (ThreadCompass): Multi-start pathfinding with confidence scoring and topology feedback
🔀 Hi-C Scaffolding & Phasing (HaplotypeDetangler): Spectral clustering on Hi-C contact matrices for chromosome-scale haplotype separation
🔍 Assembly-Time SV Detection (SVScribe): DEL/INS/INV/DUP/TRA calls from graph topology + UL spanning + Hi-C validation
📊 Integrated QV, Polishing & Gap Filling: End-to-end finishing built into the pipeline
📄 Rich Output: GFA graphs, BandageNG visualization, N50/L50/QV stats, VCF/JSON SV calls, phasing info, IGV/UCSC tracks, chromosome classification
🔌 Modular: All AI modules can be disabled (--classical) for heuristic-only assembly
🧪 Custom Training: CLI commands to generate synthetic training data and retrain all models for your organism

We recommend against using Guppy-basecalled data, although models are provided. The error profile is very difficult to train on, and models are set to binary calls only — "error or correct base" — rather than determining the type of error. "Best guess" corrected reads are output by the pipeline, but we highly recommend either hard-masking the errors or treating them very carefully.

🎯 Model Performance

See the AI Model Training Guide for full per-class breakdowns, per-family ErrorSmith metrics, and training details.

Module	Task	Accuracy / R²	F1-macro	CV (5-fold)
🛡️ EdgeWarden	Edge quality scoring (per-technology)	0.881	0.896	0.878 ± 0.002
🔧 ErrorSmith (v18)	Per-base error classification (11 families, 17 chemistries)	0.949	0.950	per-family (see Training Guide)
🧬 PathGNN	Graph-aware edge classification	0.897	0.897	0.897 ± 0.001
🔀 DiploidAI	Haplotype phasing (26 features)	0.862	0.862	0.858 ± 0.001
🧵 ThreadCompass	Ultra-long read routing	R²=0.997	—	R²=0.997 ± 0.0003
🔍 SVScribe (v7.2)	Two-stage dual-tier SV detection	S1 F1=0.991	S2 F1=0.957	S2 min-class=0.876
🧠 K-Weaver (DBG)	De Bruijn graph k-mer selection	R²=0.863	—	0.863 ± 0.064
🧠 K-Weaver (UL Overlap)	Ultra-long overlap k-mer selection	R²=0.982	—	0.982 ± 0.020
🧠 K-Weaver (Extension)	Contig extension k-mer selection	R²=0.849	—	0.849 ± 0.074
🧠 K-Weaver (Polish)	Polishing k-mer selection	R²=0.881	—	0.881 ± 0.067

Highlights of Training Results

For full figure sets, see Training Doc.

SV Call Matrix

ErrorSmith Current Chemistry Training F1s

ErrorSmith Current Chemistry Substitution Precision and Recall

🔧 Installation

Requirements

Python 3.9+
8 GB RAM minimum (32+ GB recommended for large genomes)
Disk space: 50-100 GB for intermediate files (genome-dependent)

Dependencies

StrandWeaver uses a modular dependency system with three installation tiers:

Core Dependencies (Always Installed):

Bioinformatics: biopython>=1.79, pysam>=0.19.0
Numerical: numpy>=1.21.0, scipy>=1.9.0, pandas>=1.3.0
Graph Processing: networkx>=2.6.0
Hi-C/Phasing: scipy>=1.9.0, scikit-learn>=1.3.0
CLI/IO: click>=8.0.0, pyyaml>=6.0, tqdm>=4.62.0, h5py>=3.5.0
Performance: numba>=0.54.0, joblib>=1.1.0

AI/ML Dependencies (Optional - [ai] flag):

PyTorch: torch>=2.0.0 (CPU or GPU support)
Graph Neural Networks: pytorch-geometric>=2.3.0
Gradient Boosting: xgboost>=2.0.0

These are required for:

Loading pre-trained models (shipped with v0.2+)
Custom model training
GPU-accelerated assembly (if CUDA available)

Development Dependencies (Optional - [dev] flag):

Testing: pytest>=7.0.0, pytest-cov>=4.0.0
Visualization: matplotlib>=3.4.0, seaborn>=0.11.0
Documentation: Development tools for contributors

GPU Support:

CUDA 11.8+ for NVIDIA GPUs (automatic via PyTorch)
MPS backend for Apple Silicon (automatic in macOS 12.3+)
CPU fallback if no GPU detected

Platform Compatibility:

✅ Linux (x86_64, ARM64)
✅ macOS (Intel, Apple Silicon with MPS acceleration)
✅ Windows (via WSL2 recommended)

Install from GitHub (Recommended)

StrandWeaver has several dependencies, especially if you plan on installing the AI/ML training dependencies, so it is highly recommended to install in a virtual environment (all testing was performed in Python venvs, but Conda should work equally well).

# Basic installation
pip install git+https://github.com/pgrady1322/strandweaver.git

# Recommended: Complete installation with all dependencies
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

# With AI/training dependencies (PyTorch, XGBoost)
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[ai]"

# With developmental testing features
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[dev]"

🚀 Quick Start

StrandWeaver offers two execution modes:

Mode	Usage	Best For
Direct	`strandweaver <command> [options]`	Small/medium datasets, local workstation, testing
Nextflow	`strandweaver <command> [options] --nextflow`	Large datasets, HPC clusters, parallel processing

Python CLI Mode

Basic Long Read Assembly Example (PacBio)

strandweaver pipeline \
  --hifi-long-reads hifi_reads.fastq \
  -o assembly_output/ \
  -t 8

Hybrid Assembly with Multiple Ultra-Long Read Types

Note: The --ont-ul flag is used for path-finding reads. Any platform of long reads can be provided, but shorter long reads will degrade the assembly. The --ont-ul name is retained for clarity / comparison with other assemblers.

strandweaver pipeline \
  --hifi-long-reads hifi_reads.fastq \
  --ont-long-reads ont_reads.fastq.gz \
  --ont-ul ultralong_reads.fastq \
  -o assembly_output/ \
  -t 16

Mixed Technology Assembly with Hi-C

Note: ANY platform of proximity ligation tech can be provided. StrandWeaver will optimize for Hi-C and Omni-C just as well as Pore-C and CiFi.

strandweaver pipeline \
  --hifi-long-reads hifi_reads.fastq \
  --ont-ul ultralong_reads.fastq \
  --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
  -o assembly_output/ \
  -t 16

Nextflow Mode (HPC/Cloud)

Local Execution

nextflow run strandweaver/nextflow/main.nf \
  --hifi hifi_reads.fastq \
  --ont ont_reads.fastq \
  --hic_r1 hic_R1.fastq \
  --hic_r2 hic_R2.fastq \
  --outdir results/ \
  -profile local

SLURM Cluster with Singularity

nextflow run strandweaver/nextflow/main.nf \
  --hifi hifi_reads.fastq \
  --ont ont_reads.fastq \
  --ont_ul ultralong_reads.fastq \
  --hic_r1 hic_R1.fastq \
  --hic_r2 hic_R2.fastq \
  --outdir results/ \
  -profile slurm,singularity \
  -resume

Huge Genome Mode (Plants, Insects) (parallel k-mer extraction)

nextflow run strandweaver/nextflow/main.nf \
  --hifi hifi_reads.fastq \
  --huge \
  --outdir results/ \
  -profile slurm

See nextflow/README.md for complete Nextflow documentation.

Nextflow Profiles

Profile	Description	Use Case
`local`	Direct execution on local machine	Testing, small datasets
`docker`	Docker containerization	Reproducibility, dependency isolation
`singularity`	Singularity containers	HPC clusters (no root required)
`slurm`	SLURM cluster scheduler	HPC parallel processing
`test`	Use synthetic E. coli data	Quick validation

Combine profiles with commas: -profile slurm,singularity

Error Correction

# Direct mode
strandweaver correct --hifi reads.fq.gz -o corrected/ -t 8

# Nextflow mode (automatic parallelization)
strandweaver correct --hifi reads.fq.gz -o corrected/ \
  --nextflow --nf-profile slurm --correction-batch-size 100000

ErrorSmith Chemistry Designation

ErrorSmith uses chemistry-aware models trained on 17 sequencing platform / chemistry combinations. Specify the chemistry by appending :chemistry_id to the reads path, or by passing the corresponding chemistry flag immediately after the data flag. Both forms are equivalent — use whichever is clearer for your command. If your exact kit / flow cell combination is not listed, pick the closest available model. When no chemistry is specified the default is applied automatically.

# Colon syntax (preferred — keeps reads + chemistry together)
--flag reads.fastq:chemistry_id

# Flag syntax (equivalent)
--flag reads.fastq --flag-chemistry chemistry_id

Read flag	Chemistry flag	Available chemistries	Default
`--hifi-long-reads`	`--hifi-chemistry`	`pacbio_hifi_sequel2`, `pacbio_hifi_revio`, `pacbio_hifi_vega`	`pacbio_hifi_revio`
`--ont-long-reads`	`--ont-chemistry`	`ont_lsk110_r941`, `ont_lsk114_r1041`, `ont_r1041_duplex`, `ont_herro_corrected`	`ont_lsk114_r1041`
`--ont-ul`	`--ont-ul-chemistry`	`ont_ulk001_r941`, `ont_ulk114_r1041`, `ont_ulk114_r1041_hiacc`, `ont_ulk114_r1041_dorado`	`ont_ulk114_r1041`
`-r1` / `-r2`	`--short-read-chemistry`	`illumina_hiseq2500`, `pacbio_onso`, `element_aviti`, `element_aviti_lng`, `element_ultraq`, `ultima_ug100`	`illumina_hiseq2500`

Short reads: the colon suffix can go on either -r1 or -r2 — StrandWeaver applies the same chemistry model to the pair. Alternatively, use the standalone --short-read-chemistry flag.

Multi-chemistry support. When an assembly combines reads from more than one chemistry within the same read type (e.g. ONT R9 + R10 datasets, or multiple ultra-long preps), pass the read flag once per dataset:

# Two ONT chemistries — colon syntax
--ont-long-reads r9.fastq:ont_lsk110_r941 \
--ont-long-reads r10.fastq:ont_lsk114_r1041

# Same thing — flag syntax
--ont-long-reads r9.fastq --ont-chemistry ont_lsk110_r941 \
--ont-long-reads r10.fastq --ont-chemistry ont_lsk114_r1041

# Two ultra-long preps with different basecallers
--ont-ul ul_standard.fastq:ont_ulk114_r1041 \
--ont-ul ul_dorado.fastq:ont_ulk114_r1041_dorado

ErrorSmith applies the corresponding chemistry-specific error model to each read set independently, then merges the corrected outputs before downstream graph construction.

Chemistry → platform mapping reference

Chemistry ID	Platform	Read Type	Kit / Flow Cell
`pacbio_hifi_sequel2`	PacBio Sequel II	HiFi long	CCS
`pacbio_hifi_revio`	PacBio Revio	HiFi long	CCS
`pacbio_hifi_vega`	PacBio Vega	HiFi long	CCS
`ont_lsk110_r941`	ONT R9.4.1	Ligation long	LSK110
`ont_lsk114_r1041`	ONT R10.4.1	Ligation long	LSK114
`ont_r1041_duplex`	ONT R10.4.1	Duplex long	LSK114 duplex
`ont_herro_corrected`	ONT (Herro-corrected)	Long (corrected)	Herro error-corrected reads
`ont_ulk001_r941`	ONT R9.4.1	Ultra-long	ULK001
`ont_ulk114_r1041`	ONT R10.4.1	Ultra-long	ULK114
`ont_ulk114_r1041_hiacc`	ONT R10.4.1	Ultra-long (HiAcc)	ULK114, high-accuracy mode
`ont_ulk114_r1041_dorado`	ONT R10.4.1	Ultra-long (Dorado)	ULK114, Dorado basecaller
`illumina_hiseq2500`	Illumina HiSeq 2500	Short	TruSeq
`pacbio_onso`	PacBio Onso	Short	SBB
`element_aviti`	Element Aviti	Short	AVITI chemistry
`element_aviti_lng`	Element Aviti	Long	AVITI long-read chemistry
`element_ultraq`	Element UltraQ	Short	UltraQ chemistry
`ultima_ug100`	Ultima Genomics UG100	Short	UG100 flow chemistry

# Example: Revio HiFi + ONT duplex + Dorado ultra-long (colon syntax)
strandweaver pipeline \
  --hifi-long-reads hifi.fastq:pacbio_hifi_revio \
  --ont-long-reads duplex.bam:ont_r1041_duplex \
  --ont-ul ultralong.bam:ont_ulk114_r1041_dorado \
  -o assembly_output/ -t 16

# Example: Element Aviti short reads + ONT ultra-long (colon on -r1)
strandweaver pipeline \
  -r1 aviti_R1.fastq:element_aviti -r2 aviti_R2.fastq \
  --ont-ul ultralong.fastq \
  -o assembly_output/ -t 16

# Same short-read example using the flag syntax instead
strandweaver pipeline \
  -r1 aviti_R1.fastq -r2 aviti_R2.fastq --short-read-chemistry element_aviti \
  --ont-ul ultralong.fastq \
  -o assembly_output/ -t 16

# Example: mixed ONT chemistries (R9 + R10 reads in one assembly)
strandweaver pipeline \
  --ont-long-reads r9_reads.fastq:ont_lsk110_r941 \
  --ont-long-reads r10_reads.fastq:ont_lsk114_r1041 \
  --hifi-long-reads hifi.fastq \
  -o assembly_output/ -t 16

The profile and batch-correct commands accept a single --chemistry flag with any of the 13 values above.

Individual Processing Commands

StrandWeaver provides standalone commands for each processing stage for instances in which you may want just corrected reads or read error profiles. StrandWeaver also supports mapping of reads to GFA graphs, and calling SVs on GFA graphs. Each command supports both direct and Nextflow execution.

Error Correction

# HiFi + ONT correction with chemistry specified (colon syntax)
strandweaver correct \
  --hifi-long-reads hifi.fastq:pacbio_hifi_revio \
  --ont-long-reads ont.fastq:ont_lsk114_r1041 \
  -o corrected/ -t 16

# Mixed ONT chemistries with Nextflow parallelization (flag syntax)
strandweaver correct \
  --ont-long-reads r9.fastq --ont-chemistry ont_lsk110_r941 \
  --ont-long-reads r10.fastq --ont-chemistry ont_lsk114_r1041 \
  -o corrected/ \
  --nextflow --nf-profile slurm --correction-batch-size 100000

K-mer Extraction

# Direct mode
strandweaver extract-kmers --hifi reads.fq.gz -k 31 -o kmers.pkl -t 8

# Nextflow mode (huge genomes >10GB)
strandweaver extract-kmers --hifi reads.fq.gz -k 31 -o kmers.pkl \
  --nextflow --nf-profile slurm --kmer-batch-size 2000000

Edge Scoring

# Direct mode
strandweaver nf-score-edges -e edges.json -a aligns.bam -o scored.json -t 8

# Nextflow mode (large graphs)
strandweaver nf-score-edges -e edges.json -a aligns.bam -o scored.json \
  --nextflow --nf-profile slurm --edge-batch-size 10000

Ultra-Long Read Mapping

# Direct mode
strandweaver map-ul -u ul_reads.fq.gz -g graph.gfa -o aligns.paf --use-gpu

# Nextflow mode
strandweaver map-ul -u ul_reads.fq.gz -g graph.gfa -o aligns.paf \
  --nextflow --nf-profile slurm --use-gpu --ul-batch-size 100

Hi-C Alignment

# Direct mode
strandweaver align-hic --hic-r1 R1.fq.gz --hic-r2 R2.fq.gz \
  -g graph.gfa -o aligns.bam -t 8

# Nextflow mode (large Hi-C datasets)
strandweaver align-hic --hic-r1 R1.fq.gz --hic-r2 R2.fq.gz \
  -g graph.gfa -o aligns.bam \
  --nextflow --nf-profile slurm --hic-batch-size 500000

Structural Variant Detection

# Direct mode
strandweaver nf-detect-svs -g graph.gfa -o variants.vcf -t 8

# Nextflow mode (large graphs)
strandweaver nf-detect-svs -g graph.gfa -o variants.vcf \
  --nextflow --nf-profile slurm --sv-batch-size 1000

Performance Guidelines

When to use Direct mode: Dataset < 10GB, local workstation with 8+ cores, testing and debugging.

When to use Nextflow mode: Dataset > 10GB, HPC cluster available, need resume capability, want automatic parallelization.

Command	Direct (1 node)	Nextflow (20 nodes)	Speedup
`correct`	20 hours	2 hours	10×
`extract-kmers`	8 hours	1.5 hours	5×
`nf-score-edges`	8 hours	1.5 hours	5×
`map-ul`	6 hours	1 hour	6×
`align-hic`	10 hours	1.5 hours	7×
`nf-detect-svs`	4 hours	1 hour	4×

🎯 Use Cases

Machine-Learning-Tuned Genome Assembly with SV Calls

Combine ONT, HiFi, ultra-long reads, and Hi-C for chromosome-scale phased assemblies:

strandweaver pipeline \
  --ont-long-reads ont.fastq \
  --hifi-long-reads hifi.fastq \
  --ont-ul ultralong.fastq \
  --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
  --use-ai \
  -o genome_assembly/ \
  -t 32

Ancient DNA Assembly

Optimize for deamination damage with specialized error correction:

strandweaver pipeline \
  -r1 ancient_reads.fastq --technology1 ancient \
  -o ancient_assembly/ \
  -t 16

Note that the assembly can also be run WITHOUT damage awareness features for comparison.

SV-Rich Genome Analysis

Detect structural variants during assembly for cancer or population genomics:

strandweaver pipeline \
  --hifi-long-reads tumor.fastq \
  --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
  --min-sv-size 30 \
  -o tumor_assembly/ \
  -t 24

Highly Heterozygous Diploid Assembly

Maintain haplotype separation for F1 hybrids or outcrossing species:

strandweaver pipeline \
  --hifi-long-reads hifi.fastq \
  --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq \
  --ploidy diploid \
  --edge-filter-mode strict \
  -o diploid_assembly/ \
  -t 32

🔬 Pipeline Reference

Preprocessing

Classify: Auto-detect sequencing technologies from FASTQ headers (supports ONT chemistry detection with LongBow)
KWeaver: ML-based k-mer optimization with rule-based fallback for dynamic k-mer selection
Profile: Error pattern profiling (substitutions, indels, homopolymers) with visualization
Correct: Technology-aware read correction (ONTCorrector, PacBioCorrector)

Core Assembly

Graph Building: Graph construction (type of graph based on read type) from reads with streaming architecture
EdgeWarden: AI-powered graph edge filtering with 80-feature scoring
PathWeaver: GNN-based haplotype-aware path resolution with variation protection
StringGraph: Ultra-long read overlay for long-range connections
ThreadCompass: UL read routing optimization with trained models
Hi-C Integration: Proximity ligation contact matrix construction and edge addition
HaplotypeDetangler: Hi-C-augmented phasing via spectral clustering
Iteration: 3+ refinement cycles with phasing-aware filtering
SVScribe: Graph-based structural variant detection (DEL, INS, INV, DUP, TRA)
Iterate or Finalize: Contig and scaffold extraction with comprehensive statistics, or pass graph with feature scoring to pipeline for iteration 2+.

Post-Assembly Analysis

Misassembly Report: Putative misassembly detection using multi-signal evidence (EdgeWarden confidence, coverage discontinuities, UL read conflicts, Hi-C violations). Outputs TSV and BED reports for genome-browser visualization.
- --misassembly-report / --no-misassembly-report — Enabled by default
- --misassembly-min-confidence HIGH|MEDIUM|LOW — Minimum confidence to flag (default: MEDIUM)
- --misassembly-format tsv,bed,json — Comma-separated output formats (default: tsv,bed)
Chromosome Classification: Multi-tier scaffold classification to identify chromosomes vs. assembly artifacts.
- Tier 1 (always): Length, coverage, GC, connectivity, telomere detection
- Tier 2 (always): Gene content analysis (ORF / BLAST / Augustus / BUSCO)
- Tier 3 (--id-chromosomes-advanced): Hi-C self-contact patterns, synteny
- Telomere flags: --telomere-sequence (default: TTAGGG), --telomere-min-units (default: 10), --telomere-search-depth (default: 5000 bp)

Output Generation

GFA Export: Assembly graphs in GFA format with sequences
BandageNG: Visualization files with coverage tracks and final 0 - 1 range StrandWeaver scores (long/UL/Hi-C).
Statistics: N50, L50, coverage metrics, variation protection counts
SV Calls: Structural variants in VCF and JSON formats
Phasing Info: Haplotype assignments and confidence scores

Post-Assembly CLI Options Reference

Flag	Default	Description
`--misassembly-report` / `--no-misassembly-report`	Enabled	Generate misassembly report (TSV + BED)
`--misassembly-min-confidence`	`MEDIUM`	Minimum confidence for flags: `HIGH`, `MEDIUM`, `LOW`
`--misassembly-format`	`tsv,bed`	Comma-separated output formats: `tsv`, `bed`, `json`
`--id-chromosomes`	Off	Enable scaffold → chromosome classification (Tiers 1-2)
`--id-chromosomes-advanced`	Off	Add Hi-C pattern analysis & synteny (Tier 3)
`--gene-detection-method`	`orf`	Gene detection: `orf` (no deps), `blast`, `augustus`, `busco`
`--blast-db`	`nr`	BLAST database for gene detection
`--telomere-sequence`	`TTAGGG`	Telomere repeat motif. Alternatives: `TTTAGGG` (plants), `TTAGG` (insects)
`--telomere-min-units`	`10`	Minimum tandem repeats to call a telomere
`--telomere-search-depth`	`5000`	Base-pairs to search at each scaffold end

🤖 Custom Training

Pre-trained models load automatically. Custom training is only needed for organism-specific optimization (extreme repeat content, unusual ploidy, novel sequencing chemistries).

# 1. Generate training data (graph-only mode: ~7 min/genome, ~8.5 MB/genome)
strandweaver train generate-data \
  --genome-size 1000000 -n 200 \
  --graph-training --graph-only \
  -o training_data/

# 2. Train all models with cross-validation
strandweaver train run --data-dir training_data/ -o trained_models/

# 3. Assemble (custom models load automatically, or specify --model-dir)
strandweaver pipeline --hifi-long-reads reads.fastq.gz -o assembly/

See trained_models/TRAINING.md for the full training guide, Colab GPU notebooks, hyperparameter search spaces, and per-class performance breakdowns. See strandweaver/user_training/README.md for the synthetic genome generation parameter reference.

Output Files

output/
├── contigs.fasta                  # Primary assembly contigs
├── final_assembly.fasta           # Polished, length-filtered contigs
├── scaffolds.fasta                # Hi-C scaffolded sequences
├── assembly_graph.gfa             # Assembly graph (GFA format)
├── assembly_stats.json            # N50, L50, QV, coverage statistics
├── misassembly_report.tsv         # Putative misassemblies (tab-delimited)
├── misassembly_report.bed         # Misassemblies (genome browser BED)
├── chromosome_classification.json # Scaffold → chromosome classification
├── sv_calls.vcf                   # Structural variant calls
├── phasing_info.json              # Haplotype assignments
├── coverage_long.csv              # Long read coverage (BandageNG)
├── coverage_ul.csv                # Ultra-long coverage (BandageNG)
├── coverage_hic.csv               # Hi-C support (BandageNG)
├── kmer_predictions.json          # K-mer optimization results
├── error_profile_<tech>_<n>.json  # Per-technology error profiles
└── pipeline.log                   # Complete execution log

� Troubleshooting

Installation Issues

Problem: ModuleNotFoundError or import errors

# Solution: Reinstall with all dependencies
pip install --force-reinstall "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

Problem: Python version incompatibility

# Check Python version (requires 3.9+)
python3 --version

# Create conda environment with correct Python version
conda create -n strandweaver python=3.11
conda activate strandweaver
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

Problem: PyTorch/GPU issues

# For CUDA support, install PyTorch separately first
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

# Verify GPU detection
python3 -c "import torch; print(f'GPU available: {torch.cuda.is_available()}')"

Assembly Quality Issues

Problem: Low N50 or fragmented assembly

Check error rate & coverage: Aim for 30×+ HiFi or 50×+ ONT
```
strandweaver profile -i reads.fastq -o profile.json
```

Add ultra-long reads, or a subset of your LONGEST READS: Dramatically improves contiguity

strandweaver pipeline --hifi-long-reads hifi.fastq --ont-ul ultralong.fastq -o improved/

Enable Hi-C scaffolding: For chromosome-scale assemblies

strandweaver pipeline --hifi-long-reads hifi.fastq --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq -o scaffolded/

Problem: Collapsed heterozygous regions

# Use diploid mode with strict edge filtering to preserve variation
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  --ploidy diploid \
  --edge-filter-mode strict \
  -o diploid_assembly/

Problem: Assembly produces too many contigs (over-fragmented)

Reduce edge filtering stringency:

strandweaver pipeline --hifi-long-reads reads.fastq --edge-filter-mode lenient -o assembly/

Increase k-mer size: For high-coverage, low-error data

strandweaver pipeline --hifi-long-reads reads.fastq --kmer-size-assembly 51 -o assembly/

Problem: Assembly is chimeric or has misassemblies

Enable stricter filtering:

strandweaver pipeline --hifi-long-reads reads.fastq --edge-filter-mode strict -o assembly/

Add Hi-C validation: Long-range contact validation prevents chimeras

strandweaver pipeline --hifi-long-reads hifi.fastq --hic-r1 hic_R1.fastq --hic-r2 hic_R2.fastq -o validated/

Performance & Resource Issues

Problem: Out of memory (OOM) errors

# Limit memory usage
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --memory-limit 16

# Or reduce graph coverage via sampling
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --sample-size-graph 500000

Problem: Assembly is too slow

# Increase threads (use all available cores)
strandweaver pipeline --hifi-long-reads reads.fastq -t $(nproc) -o assembly/

# Disable AI features for faster heuristic-only assembly
strandweaver pipeline --hifi-long-reads reads.fastq --classical -o fast_assembly/

# Skip profiling step if reads are already well-characterized
strandweaver pipeline --hifi-long-reads reads.fastq --skip-profiling -o quick_assembly/

Problem: Disk space issues

# Export only FASTA (skip GFA graphs to save space)
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --output-format fasta

# Use a separate output directory on a larger drive
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o /mnt/large_drive/assembly/

Input Data Issues

Problem: "Unsupported file format" error

# StrandWeaver accepts: FASTQ, FASTA, gzipped variants
# Check file format
file reads.fastq

# Convert BAM to FASTQ if needed
samtools bam2fq reads.bam > reads.fastq

# Decompress if needed
gunzip -c reads.fastq.gz > reads.fastq

Problem: Technology auto-detection fails

# Manually specify read technology
strandweaver pipeline \
  -r1 reads.fastq --technology1 ont \
  -o assembly/

# Supported: illumina, ancient, ont, ont_ultralong, pacbio

Problem: Ancient DNA damage not detected

# Explicitly specify ancient DNA technology
strandweaver pipeline \
  -r1 ancient_reads.fastq --technology1 ancient \
  -o ancient_assembly/

# Check damage profile first
strandweaver profile -i ancient_reads.fastq --technology ancient -o damage_profile.json

AI/ML Issues

Problem: AI features not working

# Check if AI dependencies installed
python3 -c "import torch, xgboost; print('AI dependencies OK')"

# Install AI dependencies if missing
pip install "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[ai]"

Problem: "No trained models found" warning

# Pre-trained models ship with v0.2+ and should load automatically.
# If you see this warning, the trained_models/ directory may be missing.

# Reinstall to restore pre-trained models:
pip install --force-reinstall "git+https://github.com/pgrady1322/strandweaver.git#egg=strandweaver[all]"

# Or train custom models:
strandweaver train generate-data --genome-size 5000000 --graph-training --graph-only -o training_data/
strandweaver train run --data-dir training_data/ -o trained_models/
# See trained_models/TRAINING.md for the complete training guide

Problem: GPU not being used

# Force GPU usage with explicit backend
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --gpu-backend cuda

# On Apple Silicon
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --gpu-backend mps

# Check GPU memory usage during assembly
watch -n 1 nvidia-smi

Output Issues

Problem: No structural variants detected

# Ensure SV detection is happening (enabled by default in the pipeline)
# SVs require ultra-long or Hi-C data for validation
strandweaver pipeline \
  --hifi-long-reads hifi.fastq \
  --ont-ul ultralong.fastq \
  --min-sv-size 30 \
  -o assembly/

Problem: Missing output files

# Check pipeline.log for errors
tail -100 output/pipeline.log

# Export all output formats (FASTA + GFA) with intermediate graphs
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --output-format both \
  --export-intermediate-graphs

Problem: GFA file won't load in BandageNG

# Validate GFA format
grep "^S" assembly_graph.gfa | head -5

# Regenerate with both FASTA + GFA export
strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --output-format both

Common Error Messages

ValueError: Coverage too low for reliable assembly

Solution: Increase sequencing coverage (aim for 30×+ minimum)

RuntimeError: Graph construction failed - no valid k-mer overlaps

Solution: Try different k-mer size with --kmer-size-assembly 31 or --kmer-size-assembly 51

MemoryError: Unable to allocate array

Solution: Use --memory-limit 16 to cap memory or reduce coverage with --sample-size-graph 500000

ImportError: cannot import name 'ThreadCompass'

Solution: Reinstall package with pip install --force-reinstall strandweaver

FileNotFoundError: [Errno 2] No such file or directory

Solution: Use absolute paths for input/output files or check current working directory

Getting Help

Check version and installation:

strandweaver --version
strandweaver --help

Enable verbose logging:

strandweaver pipeline \
  --hifi-long-reads reads.fastq \
  -o assembly/ \
  --log-level DEBUG

Report issues: GitHub Issues

Contact: patrickgsgrady@gmail.com

🗺️ Roadmap

v0.3 — Current Release

#	Feature	Status	Notes
1	K-Weaver Trained Models	✅ Complete	4 regression models (DBG, UL overlap, extension, polish) via Optuna + Colab GPU
2	ErrorSmith Trained Models (v18)	✅ Complete	11-family, 17-chemistry, 45-feature two-stage ensemble on HG002/CHM13 (acc=0.949, F1=0.950)
3	SVScribe Overhaul (v7.2)	✅ Complete	Two-stage dual-tier XGBoost + LightGBM ensemble, F1-macro 0.557 → 0.957 (S2)
4	Standalone `assemble` Command	✅ Complete	Same assembly engine as `pipeline`, usable independently
5	QV Estimation & Gap Filling	✅ Complete	`qv`, `polish`, `gap-fill` CLI commands wired into `_step_finish()`
6	Technology-Specific Subsampling	✅ Complete	`--subsample-{hifi,ont,ont-ul,illumina,ancient}` flags
7	200× Faster K-mer Extraction	✅ Complete	Streaming architecture with parallel batch processing
8	PyPI Packaging	✅ Complete	`pip install strandweaver`
9	Training Notebooks	✅ Complete	Colab notebooks for all model types
10	Validate — Reference Comparison	Planned	`--reference` flag accepted but comparison logic not yet implemented
11	BUSCO Integration	Planned	`--busco-lineage` present on `validate` but not wired
12	Decontamination Screening	Planned	`--decontaminate` flag stubbed

v0.2 Release Notes

#	Feature	Notes
1	Trained ML Models	XGBoost + GNN for EdgeWarden, PathGNN, DiploidAI, ThreadCompass, SVScribe
2	DiploidAI Integration	26-feature phasing wired into HaplotypeDetangler
3	Bubble-Aware Local Phasing	Genomics audit G2
4	Genomics Audit (24 items)	G1–G24 resolved
5	Git LFS	Model weights tracked
6	Graph-Only Training	3.3× faster, 27× less disk

Future

Polyploid assembly (--ploidy beyond haploid/diploid)
PacBio/ONT native metadata detection from BAM/POD5 headers
Additional ancient DNA damage models beyond deamination

📚 Documentation

AI Model Training Guide — Model performance, custom training, Colab notebooks
User Training Module — Synthetic genome generation & parameter reference

📄 License

Dual-licensed: Noncommercial Academic (default) and Commercial. See LICENSE_ACADEMIC.md and LICENSE_COMMERCIAL.md.

Free for nonprofit academic research at universities and research institutes
Commercial license required for any for-profit use, industry-funded research, or integration into commercial products/pipelines
Source-available, not OSI open-source

Contact patrickgsgrady@gmail.com for commercial licensing.

📧 Contact

Patrick Grady | dr.pgrady(at)gmail.com

📈 Citation and References

@software{strandweaver2026,
  author = {Grady, Patrick; Green, Rich},
  title = {StrandWeaver: AI-Powered Multi-Technology Genome Assembler},
  year = {2026},
  url = {https://github.com/pgrady1322/strandweaver}
}

Zimin AV, Puiu D, Luo MC, Zhu T, Koren S, Yorke JA, Dvorak J, Salzberg S. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm. Genome Research. 2017 Jan 1:066100.
Rautiainen, M., Nurk, S., Walenz, B.P. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol 41, 1474–1482 (2023). https://doi.org/10.1038/s41587-023-01662-6
Cheng, H., Concepcion, G.T., Feng, X. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021). https://doi.org/10.1038/s41592-020-01056-5

StrandWeaver 🧬⚡🤖

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pgrady1322

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0.dev0 pre-release

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strandweaver-0.3.0.dev0.tar.gz (520.8 kB view details)

Uploaded Mar 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strandweaver-0.3.0.dev0-py3-none-any.whl (497.6 kB view details)

Uploaded Mar 21, 2026 Python 3

File details

Details for the file strandweaver-0.3.0.dev0.tar.gz.

File metadata

Download URL: strandweaver-0.3.0.dev0.tar.gz
Upload date: Mar 21, 2026
Size: 520.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strandweaver-0.3.0.dev0.tar.gz
Algorithm	Hash digest
SHA256	`b5bc7a1686f950f117869435d20050219aef6621442c084d277092fe5f8b021f`
MD5	`8a69d1baf11afc782e6c32d1bc81022a`
BLAKE2b-256	`056b2c54ac8fb3e74688883195dcd8af899a7eae72e9e0b066be329d38181642`

See more details on using hashes here.

Provenance

The following attestation bundles were made for strandweaver-0.3.0.dev0.tar.gz:

Publisher: publish.yml on pgrady1322/strandweaver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: strandweaver-0.3.0.dev0.tar.gz
- Subject digest: b5bc7a1686f950f117869435d20050219aef6621442c084d277092fe5f8b021f
- Sigstore transparency entry: 1154436119
- Sigstore integration time: Mar 21, 2026
Source repository:
- Permalink: pgrady1322/strandweaver@b09df42c6e8a35c86f903973f431c96b2706a884
- Branch / Tag: refs/heads/main
- Owner: https://github.com/pgrady1322
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b09df42c6e8a35c86f903973f431c96b2706a884
- Trigger Event: workflow_dispatch

File details

Details for the file strandweaver-0.3.0.dev0-py3-none-any.whl.

File metadata

Download URL: strandweaver-0.3.0.dev0-py3-none-any.whl
Upload date: Mar 21, 2026
Size: 497.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for strandweaver-0.3.0.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05060a109c4452905140bf589c0df802f53f0b8c5d1234e034ac3c6a19657526`
MD5	`c4dd36c7cd93eca90697cb277ab77b7b`
BLAKE2b-256	`1e484b7245952f70961ab2348a60bbd7e403b7b93a88c1fb8006fd8139faf1d3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for strandweaver-0.3.0.dev0-py3-none-any.whl:

Publisher: publish.yml on pgrady1322/strandweaver

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: strandweaver-0.3.0.dev0-py3-none-any.whl
- Subject digest: 05060a109c4452905140bf589c0df802f53f0b8c5d1234e034ac3c6a19657526
- Sigstore transparency entry: 1154436120
- Sigstore integration time: Mar 21, 2026
Source repository:
- Permalink: pgrady1322/strandweaver@b09df42c6e8a35c86f903973f431c96b2706a884
- Branch / Tag: refs/heads/main
- Owner: https://github.com/pgrady1322
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b09df42c6e8a35c86f903973f431c96b2706a884
- Trigger Event: workflow_dispatch

strandweaver 0.3.0.dev0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

StrandWeaver

🆕 What's New in v0.3

✨ Features

🎯 Model Performance

Highlights of Training Results

📋 Contents

🔧 Installation

Requirements

Dependencies

Install from GitHub (Recommended)

🚀 Quick Start

Python CLI Mode

Nextflow Mode (HPC/Cloud)

Nextflow Profiles

Error Correction

ErrorSmith Chemistry Designation

Individual Processing Commands

Error Correction

K-mer Extraction

Edge Scoring

Ultra-Long Read Mapping

Hi-C Alignment

Structural Variant Detection

Performance Guidelines

🎯 Use Cases

Machine-Learning-Tuned Genome Assembly with SV Calls

Ancient DNA Assembly

SV-Rich Genome Analysis

Highly Heterozygous Diploid Assembly

🔬 Pipeline Reference

Preprocessing

Core Assembly

Post-Assembly Analysis

Output Generation

Post-Assembly CLI Options Reference

🤖 Custom Training

Output Files

� Troubleshooting

Installation Issues

Assembly Quality Issues

Performance & Resource Issues

Input Data Issues

AI/ML Issues

Output Issues

Common Error Messages

Getting Help

🗺️ Roadmap

v0.3 — Current Release

v0.2 Release Notes

Future

📚 Documentation

📄 License

📧 Contact

📈 Citation and References

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance