A novel tool for accurately merging haplotype-based SV calls and comparing SVs across reference genomes
Project description
EchoSV
EchoSV is a versatile tool for comparing and merging structural variant (SV) call sets generated using different reference genomes. It studies how SVs "echo" across these references through a hybrid workflow that combines liftover and graph-based matching.
Given two or more SV call sets from the same sample—each aligned to a different reference—EchoSV can perform two primary operations:
- Compare: Generates a detailed comparison identifying overlapping variants and those exclusive to a specific reference, e.g., calls across GRCh38, CHM13, and a donor-specific assembly (DSA).
- Merge: Consolidates multiple SV call sets into a single, unified output, e.g., merging two DSA haplotype-based call sets into one consolidated file.
Table of Contents
Requirements
EchoSV depends on the following Python packages:
- pysam: Read and write BAM/CRAM files and VCF records for variant processing.
- intervaltree: Efficiently store and query genomic intervals to detect overlapping SVs.
- Biopython (Bio): Parse and manipulate sequence data during liftover steps.
- scipy: Perform statistical analysis and numerical computations on variant metrics.
- networkx: Construct and traverse graphs that model structural variant matches.
- pandas: Tabular data manipulation for comparison outputs.
- numpy: Numerical computations on variant metrics.
- rich: Formatted terminal output and progress display.
Installation
Option 1: From GitHub (recommended)
git clone git@github.com:parklab/EchoSV.git
cd EchoSV
pip install -r requirements.txt
pip install .
Option 2: Via PyPI
pip install echosv
Usage
The EchoSV workflow consists of four main steps: chain, merge (optional), genotype, and match. Below are detailed instructions and examples using the test data (can be downloaded from Zenodo).
Step 0: Download and uncompress test data
Download the EchoSV test data echosv_test_data.tar.gz from Zenodo and decompress it:
tar -xzvf echosv_test_data.tar.gz
Step 1: Generate chains
The chain command generates a liftover chain file that maps coordinates from ref2 (the source assembly) to ref1 (the target reference). Before running chain, align ref2 against ref1 using minimap2's asm-to-asm mode and index the output:
minimap2 -a -x asm5 --cs ref1.fa ref2.fa \
| samtools view -hSb - \
| samtools sort -O BAM -o ref2_to_ref1.bam
samtools index ref2_to_ref1.bam
Then generate the chain file. EchoSV looks for a pre-built index automatically to parse the contig lengths; if none exist, the FASTA is parsed directly (slower for large assemblies). You can generate an index with samtools faidx ref2.fa or samtools dict ref2.fa > ref2.fa.dict.
echosv chain \
-b test_data/input_data/chm13_to_grch38.bam \
-f test_data/input_data/chm13.fa \
-o test_data/chm13_to_grch38.chain.gz
Parameters
-b: Path to the ref2-to-ref1 alignment (BAM format, must be indexed)-f: Path to the ref2 reference FASTA-o: Output chain file for coordinate mapping (a coverage BED file is also written alongside)
Step 2: Merge SV call sets from the same reference (optional)
Merge multi-caller VCFs from the same reference into one call set before genotyping.
The merge command merges multiple SV call sets that were called against the same reference genome (e.g., outputs from multiple callers). This step is typically run before genotype and match so that each reference has a single unified call set for cross-reference comparison. Scripts to reproduce the analysis from our paper are available in scripts/.
# Merge multiple VCFs from the same reference into a single call set
echosv merge \
-i grch38_colo829_caller1.vcf.gz grch38_colo829_caller2.vcf.gz [...] \
-o grch38_colo829_svs.vcf.gz \
--merge --new
# Extract high-confidence SVs (≥4 supporting callers, ≥2 platforms)
echosv merge \
-i grch38_colo829_svs.vcf.gz \
-o grch38_colo829_svs_highconf.vcf.gz \
--extract
Pre-built gap BED files for the references used in this study are provided in the src/echosv/beds/ directory; a new gap BED can be passed by using --gapbed.
Parameters:
-i: Input VCF file(s) — space-separated list for--merge, single file for--extract-o: Output file path-a / --atol: Positional tolerance in bp for matching breakpoints (default: 500)-s / --sizetol: Minimum size-similarity ratio for matching SVs (default: 0.5)-c / --checksvtype: Require matching SV types when merging--merge: Write a merged VCF from the comparison result--new: Build merged VCF records from scratch (use with--merge)--extract: Extract high-confidence SVs (≥4 supporting callers and ≥2 platforms)--gaps-bed: BED file of reference gap / N regions; SVs near gaps are excluded when using--extract
Step 3: Collect supporting reads
The genotype command collects supporting reads for each SV from BAM files and annotates the VCF with allele-frequency and read-name fields used by the graph-based matching in Step 4.
echosv genotype --longread \
-i test_data/input_data/grch38_colo829_somatic_svs.vcf.gz \
-b test_data/input_data/chm13_to_grch38.bam \
-o test_data/grch38_colo829_genotyped.vcf.gz
Parameters
--longread: Collect supporting reads from long-read alignments--shortread: Collect supporting reads from short-read alignments-i: Input SV VCF file-b: BAM file(s) — multiple BAMs can be provided space-separated-o: Output VCF with annotated supporting-read information
Step 4: Match SVs across references
The match command compares SV call sets across different reference genomes using a two-step hybrid approach: liftover-based coordinate matching followed by graph-based matching on shared supporting reads (echo score).
# Compare SV call sets and report concordant / reference-exclusive variants
echosv match -i test_data/test_colo829_config.json
# Compare SV call sets between DSA haplotypes and also produce a merged DSA-based VCF
echosv match -i dsa_merge_colo829_config.json --merge
The input is a JSON config file specifying reference labels, genotyped VCFs, chain files, and the output path. See test_data/test_colo829_config.json below for a working example.
Example JSON
{
"refs": { "1": "grch38", "2": "chm13", "3": "dsa" },
"vcfs": { "1": "./test_data/grch38_colo829_genotyped.vcf.gz",
"2": "./test_data/chm13_colo829_genotyped.vcf.gz",
"3": "./test_data/dsa_colo829_genotyped.vcf.gz" },
"chains": { "2_to_1": "./test_data/chm13_to_grch38.chain.gz",
"3_to_1": "./test_data/colo829bl_hap*_grch38.chain.gz" },
"output": "./test_data/colo829_svs_comparison.txt"
}
Parameters
-i: Input config JSON file--merge: Merge concordant SVs across references and write a unified VCF--multiplat: Use multi-platform genotyping information during matching-m / --min_echo_score: Minimum echo score to consider two SVs a match (default: 0.5)
License
This project is licensed under the MIT License — see the LICENSE file for details.
Contact
Feel free to open an issue on GitHub or contact Yuwei Zhang (yuwei_zhang@hms.harvard.edu) if you have any questions about EchoSV.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file echosv-1.0.tar.gz.
File metadata
- Download URL: echosv-1.0.tar.gz
- Upload date:
- Size: 51.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6796764a27ec5e1557e993fc02096adaaddc737eb4544627c3313adb9b0c1275
|
|
| MD5 |
970faba3edf61485fb27887ca6c81fe3
|
|
| BLAKE2b-256 |
1a6cc79e4a180ca8ff4f3ce73275d5bf28c9070782c48f3388646d2100ac653f
|
File details
Details for the file echosv-1.0-py3-none-any.whl.
File metadata
- Download URL: echosv-1.0-py3-none-any.whl
- Upload date:
- Size: 55.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e141c0eeb1e5dab7d37ba582213c06c3ed18b45c2e88847810da0bfbd9180758
|
|
| MD5 |
ee70c7aef7b1e80839aebc44ca1d680e
|
|
| BLAKE2b-256 |
3d5b665e9285f9243b7e40029067980107b2e267890c317fa5cd3c1cacd88c9f
|