Python tools for detecting structural variation from nanopore sequence data
Project description
nanomonsv
Introduction
nanomonsv is a software for detecting somatic structural variations from paired (tumor and matched control) cancer genome sequence data. nanomonsv is presented in the following paper. When you use nanomonsv or any resource of this repository, please kindly cite this paper.
Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Shiraishi et al., Nucleic Acids Research, 2023, [link].
Key features:
- Single-nucleotide breakpoint resolution using consensus sequences from long-read alignments.
- LINE1 insertion classification: Distinguishes Solo L1, Partnered L1 (transduction), and Orphan L1 (orphan transduction), and identifies source L1 elements.
- Two detection modules: Canonical SV module for standard SVs with high precision and recall, and Single breakend SV module for complex SVs involving highly-repetitive sequences (centromeres, LINE1, viruses) that can only be identified by long-reads.
- Haplotype-aware (v0.9.0+): Reports per-haplotype supporting read counts (HP1, HP2, unphased) using phasing information from the input BAM file. This enables phasing of SV breakpoints.
Installation
pip install nanomonsv
You can also install via conda (bioconda channel). Occasionally the conda releases lag behind PyPI.
conda create -n nanomonsv -c conda-forge -c bioconda nanomonsv
Dependencies
| Tool | Required for | Notes |
|---|---|---|
| htslib (tabix, bgzip) | parse, get | Must be in PATH |
| racon | get | Consensus generation (default) |
| mafft | get (--use_mafft) |
For backward compatibility |
| bwa | insert_classify | |
| minimap2 | insert_classify | |
| bedtools | insert_classify | |
| RepeatMasker | insert_classify |
Python >=3.9, pysam, numpy, parasail
Input requirements
- BAM or CRAM file aligned by minimap2
- For CRAM files, specify
--reference_fasta - S3 paths (e.g.,
s3://bucket/path.bam) are supported viapip install nanomonsv[s3]. Note that network latency may significantly slow down processing compared to local files.
Quick Start
- Prepare the reference genome (here, GDC GRCh38 reference genome).
wget https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 -O GRCh38.d1.vd1.fa.tar.gz
tar xvf GRCh38.d1.vd1.fa.tar.gz
- (Optional but highly recommended) Download a control panel from zenodo. See Control Panel for available panels and how to choose.
wget https://zenodo.org/api/records/11470934/files/1kg-ont-vienna_hg38_no_singleton.tar.gz/content \
-O 1kg-ont-vienna_hg38_no_singleton.tar.gz
tar xvf 1kg-ont-vienna_hg38_no_singleton.tar.gz
-
(Optional but highly recommended) Download a simple repeat BED file. Pre-built files for GRCh38 and CHM13 are included in this repository under resource/simple_repeats.
-
Parse putative SV supporting reads.
nanomonsv parse tumor.bam output/tumor
nanomonsv parse ctrl.bam output/ctrl
- Get the final result.
nanomonsv get output/tumor tumor.bam GRCh38.d1.vd1.fa \
--control_prefix output/ctrl --control_bam ctrl.bam \
--control_panel_prefix 1kg-ont-vienna_hg38_no_singleton \
--simple_repeat_bed resource/simple_repeats/human_GRCh38_simpleRepeat.bed.gz
You will find the result file tumor.nanomonsv.result.txt.
Usage
parse
Parses all supporting reads of putative somatic SVs.
nanomonsv parse [-h] [--reference_fasta reference.fa] [--debug]
[--split_alignment_check_margin SPLIT_ALIGNMENT_CHECK_MARGIN]
[--minimum_breakpoint_ambiguity MINIMUM_BREAKPOINT_AMBIGUITY]
alignment_file output_prefix
- alignment_file: Path to input indexed BAM or CRAM file
- output_prefix: Output file prefix
- --reference_fasta: Path to reference genome (recommended for CRAM files)
After successful completion, you will find:
{output_prefix}.deletion.sorted.bed.gz, {output_prefix}.insertion.sorted.bed.gz, {output_prefix}.rearrangement.sorted.bedpe.gz, {output_prefix}.bp_info.sorted.bed.gz and their indexes (.tbi files).
get
Gets the SV result from parsed supporting reads.
nanomonsv get [-h] [--control_prefix CONTROL_PREFIX]
[--control_bam CONTROL_BAM]
[--control_panel_prefix CONTROL_PANEL_PREFIX]
[--simple_repeat_bed SIMPLE_REPEAT_BED]
[--min_tumor_variant_read_num MIN_TUMOR_VARIANT_READ_NUM]
[--min_tumor_VAF MIN_TUMOR_VAF]
[--max_control_variant_read_num MAX_CONTROL_VARIANT_READ_NUM]
[--max_control_VAF MAX_CONTROL_VAF]
[--cluster_margin_size CLUSTER_MARGIN_SIZE]
[--median_mapQ_thres MEDIAN_MAPQ_THRES]
[--max_overhang_size_thres MAX_OVERHANG_SIZE_THRES]
[--var_read_min_mapq VAR_READ_MIN_MAPQ]
[--qv10] [--qv15] [--qv20] [--qv25] [--use_mafft]
[--no_single_bnd] [--processes PROCESSES]
[--sort_option SORT_OPTION] [--max_memory_minimap2] [--debug]
tumor_prefix tumor_bam reference.fa
- tumor_prefix: Prefix to the tumor data set in the parse step
- tumor_bam: Path to input indexed BAM file
- reference.fa: Path to reference genome used for the alignment
Recommended options
| Option | Recommendation | Description |
|---|---|---|
--control_prefix / --control_bam |
Strongly recommended | Matched control for somatic filtering. We strongly recommend using matched control data whenever possible. |
--control_panel_prefix |
Recommended | Non-matched control panel (see Control Panel) |
--simple_repeat_bed |
Strongly recommended | Filter indels in simple repeats. BED files provided in resource/simple_repeats |
--use_mafft |
Not recommended | Use mafft instead of racon for consensus generation (for backward compatibility) |
--no_single_bnd |
Not recommended | Disable single breakend SV detection. See wiki |
--processes N |
Optional | Multi-processing mode |
Quality presets
| Preset | Recommended for |
|---|---|
--qv10 |
ONT data with median Q10 (e.g., Guppy before v5) |
--qv15 |
ONT data with median Q15 (e.g., Guppy v5/v6) |
--qv20 |
ONT data with median Q20+ (e.g., Dorado SUP, Q20+ chemistry) |
--qv25 |
PacBio HiFi data |
merge_control
Merges non-matched control panel supporting reads obtained by parse.
nanomonsv merge_control [-h] prefix_list_file output_prefix
- prefix_list_file: List of output_prefix generated at the
parsestage - output_prefix: Prefix to the merged control supporting reads
insert_classify
Classifies long insertions into mobile element insertions (LINE1, Alu, SVA, processed pseudogene).
nanomonsv insert_classify [-h] [--debug] sv_list_file output_file reference.fa gencode.gtf.gz LINE1_db
- sv_list_file: VCF file or nanomonsv get result file (nanomonsv.result.txt)
- output_file: Path to the output file
- reference.fa: Path to the reference genome
- gencode.gtf.gz: Path to gene annotation GTF file. We recommend Gencode basic annotation (e.g., gencode.v49.basic.annotation.gtf.gz)
- LINE1_db: Path to LINE1 database. Use the files in resource/LINE1_db
validate
Validates candidate SVs by alignment of tumor and matched control BAM files. This may be helpful for evaluating SV tools on short-read platforms when pairs of short-read and long-read sequencing data are available.
nanomonsv validate [-h] [--control_bam CONTROL_BAM]
[--var_read_min_mapq VAR_READ_MIN_MAPQ] [--debug]
sv_list_file tumor_bam output reference.fa
- sv_list_file: SV candidate list file (only Chr_1 to Inserted_Seq columns are necessary)
- output_file: Path to the output file
- reference.fa: Path to the reference genome
Output Format
Canonical SV result ({tumor_prefix}.nanomonsv.result.txt)
| Column | Description |
|---|---|
| Chr_1 | Chromosome for the 1st breakpoint |
| Pos_1 | Coordinate for the 1st breakpoint |
| Dir_1 | Direction of the 1st breakpoint |
| Chr_2 | Chromosome for the 2nd breakpoint |
| Pos_2 | Coordinate for the 2nd breakpoint |
| Dir_2 | Direction of the 2nd breakpoint |
| Inserted_Seq | Inserted nucleotides within the breakpoints (--- if none) |
| SV_ID | Identifier of SVs |
| Checked_Read_Num_Tumor | Total reads in the tumor used for validation alignment |
| Supporting_Read_Num_Tumor | Variant reads in the tumor from validation alignment |
| Supporting_Read_Num_Tumor_HP_BP1 | Haplotype counts of variant reads at breakpoint 1 (HP1,HP2,unphased) |
| Supporting_Read_Num_Tumor_HP_BP2 | Haplotype counts of variant reads at breakpoint 2 (HP1,HP2,unphased) |
| Checked_Read_Num_Control | Total reads in the matched control used for validation alignment |
| Supporting_Read_Num_Control | Variant reads in the matched control from validation alignment |
| Is_Filter | Filter status (PASS or filter reason such as Simple_repeat) |
A VCF format file ({tumor_prefix}.nanomonsv.result.vcf) is also generated. See the wiki page for details on filtering.
Single breakend result ({tumor_prefix}.nanomonsv.sbnd.result.txt)
Generated by default. Use --no_single_bnd to disable.
| Column | Description |
|---|---|
| Chr_1 | Chromosome of the breakpoint |
| Pos_1 | Coordinate of the breakpoint |
| Dir_1 | Direction of the breakpoint |
| Contig | Assembled contig sequence at the breakpoint |
| SV_ID | Identifier of the single breakend |
| Checked_Read_Num_Tumor | Total reads in the tumor used for validation alignment |
| Supporting_Read_Num_Tumor | Variant reads in the tumor from validation alignment |
| Supporting_Read_Num_Tumor_HP | Haplotype counts of variant reads (HP1,HP2,unphased) |
| Checked_Read_Num_Control | Total reads in the matched control used for validation alignment |
| Supporting_Read_Num_Control | Variant reads in the matched control from validation alignment |
| Is_Filter | Filter status (PASS, Simple_repeat, Canonical_SV_overlap, or combinations) |
A VCF format file ({tumor_prefix}.nanomonsv.sbnd.result.vcf) is also generated,
using VCF single breakend notation (e.g., N. or .N in ALT field with SVTYPE=BND).
insert_classify result
| Column | Description |
|---|---|
| Insert_Type | Type of insertion (Solo_L1, Partnered_L1, Orphan_L1, Alu, SVA, PSD) |
| Is_Inversion | Inverted form for Solo LINE1 (Simple, Inverted, Other) |
| L1_Ratio | Match rate with LINE1 sequences |
| Alu_Ratio | Match rate with Alu sequences |
| SVA_Ratio | Match rate with SVA sequences |
| RMSK_Info | Summary information of RepeatMasker |
| Alignment_Info | Alignment information to the human genome |
| Inserted_Pos | Inserted position (for tandem duplication or nested LINE1 transduction) |
| Is_PolyA_T | Extracted poly-A or poly-T sequences |
| Target_Site_Duplication | Nucleotides of target site duplications |
| L1_Source_Info | Inferred source site of LINE1 transduction |
| PSD_Gene | Processed pseudogene name |
| PSD_Overlap_Ratio | Match rate with the pseudogene |
| PSD_Exon_Num | Number of pseudogene exons matched with the inserted sequence |
Control Panel
We strongly recommend using a control panel for filtering common SVs and sequencing noise.
Pre-built control panels are available at zenodo.
You can also create your own from your sequencing data using merge_control.
Pre-built control panels
| Panel | Samples | Reference | Source |
|---|---|---|---|
| 1000G ONT Vienna | 1,019 | GRCh38 / CHM13 | 1000 Genomes Project |
| HPRC Nanopore (Guppy v4) | ~30 | GRCh38 / CHM13 | HPRC release 1 |
| HPRC Nanopore (Guppy v6) | ~40 | GRCh38 / CHM13 | HPRC release 1 |
| HPRC PacBio HiFi | ~30 | GRCh38 / CHM13 | HPRC release 1 |
For ONT data, the 1000G ONT Vienna panel (1,019 samples) is recommended for its large sample size. We recommend using a control panel as close as possible in platform and basecall quality. When unsure, a noisier panel (e.g., Guppy v4) tends to be more versatile.
When you use these control panels and publish, please cite:
- Liao et al., Nature, 2023 (doi:10.1038/s41586-023-05896-x) for HPRC panels
- Schloissnig et al., Nature, 2025 (doi:10.1038/s41586-025-09290-7) for 1000G ONT Vienna panels
Example Data
The Oxford Nanopore Sequencing data used in the paper is available through the public sequence repository (BioProject ID: PRJDB10898):
Results of nanomonsv for the above data are available here. Please kindly cite the NAR paper when you use these data.
See the tutorial wiki page for an example workflow on analyzing the COLO829 sample.
Citation
Shiraishi et al., Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Nucleic Acids Research, 2023, [link].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanomonsv-0.9.0.tar.gz.
File metadata
- Download URL: nanomonsv-0.9.0.tar.gz
- Upload date:
- Size: 541.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ee66eff5908a1d3fd781d5a2b94bc9ed175609911590438b6d9f9aff7c3ce98
|
|
| MD5 |
3d672a9e08977f4fbd8a66903e5880a4
|
|
| BLAKE2b-256 |
79a8905339c385c970be9e25d12f6bad9bd6329b004fcfd6bb56068f5a31c77d
|
Provenance
The following attestation bundles were made for nanomonsv-0.9.0.tar.gz:
Publisher:
python-publish.yml on friend1ws/nanomonsv
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nanomonsv-0.9.0.tar.gz -
Subject digest:
8ee66eff5908a1d3fd781d5a2b94bc9ed175609911590438b6d9f9aff7c3ce98 - Sigstore transparency entry: 1178054035
- Sigstore integration time:
-
Permalink:
friend1ws/nanomonsv@dfdac3c79ee4b2b51449089725f0fde2c66eac44 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/friend1ws
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@dfdac3c79ee4b2b51449089725f0fde2c66eac44 -
Trigger Event:
release
-
Statement type:
File details
Details for the file nanomonsv-0.9.0-py3-none-any.whl.
File metadata
- Download URL: nanomonsv-0.9.0-py3-none-any.whl
- Upload date:
- Size: 546.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8a2cdaa6fc6e85c74183f66e22ea3a396ae3980816413c42ad4cff2520ca4a3
|
|
| MD5 |
65f310f61a932ebd1edff8d6b2a8f470
|
|
| BLAKE2b-256 |
9131c35a647796bce4af5291cb1d5e75558d653c5692726c61f95402dfa7414e
|
Provenance
The following attestation bundles were made for nanomonsv-0.9.0-py3-none-any.whl:
Publisher:
python-publish.yml on friend1ws/nanomonsv
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nanomonsv-0.9.0-py3-none-any.whl -
Subject digest:
a8a2cdaa6fc6e85c74183f66e22ea3a396ae3980816413c42ad4cff2520ca4a3 - Sigstore transparency entry: 1178054173
- Sigstore integration time:
-
Permalink:
friend1ws/nanomonsv@dfdac3c79ee4b2b51449089725f0fde2c66eac44 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/friend1ws
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@dfdac3c79ee4b2b51449089725f0fde2c66eac44 -
Trigger Event:
release
-
Statement type: