A novel tool for accurately merging haplotype-based SV calls and comparing SVs across reference genomes

Project description

EchoSV

EchoSV is a versatile tool for comparing and merging structural variant (SV) call sets generated using different reference genomes. It studies how SVs "echo" across these references through a hybrid workflow that combines liftover and graph-based matching.

EchoSV Workflow

Given two or more SV call sets from the same sample—each aligned to a different reference—EchoSV can perform two primary operations:

Compare: Generates a detailed comparison identifying overlapping variants and those exclusive to a specific reference, e.g., calls across GRCh38, CHM13, and a donor-specific assembly (DSA).
Merge: Consolidates multiple SV call sets into a single, unified output, e.g., merging two DSA haplotype-based call sets into one consolidated file.

Requirements
Installation
Usage
License
Contact

Requirements

EchoSV depends on the following Python packages:

pysam: Read and write BAM/CRAM files and VCF records for variant processing.
intervaltree: Efficiently store and query genomic intervals to detect overlapping SVs.
Biopython (Bio): Parse and manipulate sequence data during liftover steps.
scipy: Perform statistical analysis and numerical computations on variant metrics.
networkx: Construct and traverse graphs that model structural variant matches.
pandas: Tabular data manipulation for comparison outputs.
numpy: Numerical computations on variant metrics.
rich: Formatted terminal output and progress display.

Installation

Option 1: From GitHub (recommended)

git clone git@github.com:parklab/EchoSV.git
cd EchoSV
pip install -r requirements.txt
pip install .

Option 2: Via PyPI

pip install echosv

Usage

The EchoSV workflow consists of four main steps: chain, merge (optional), genotype, and match. Below are detailed instructions and examples using the test data (can be downloaded from Zenodo).

Step 0: Download and uncompress test data

Download the EchoSV test data echosv_test_data.tar.gz from Zenodo and decompress it:

tar -xzvf echosv_test_data.tar.gz

Step 1: Generate chains

The chain command generates a liftover chain file that maps coordinates from ref2 (the source assembly) to ref1 (the target reference). Before running chain, align ref2 against ref1 using minimap2's asm-to-asm mode and index the output:

minimap2 -a -x asm5 --cs ref1.fa ref2.fa \
    | samtools view -hSb - \
    | samtools sort -O BAM -o ref2_to_ref1.bam
samtools index ref2_to_ref1.bam

Then generate the chain file. EchoSV looks for a pre-built index automatically to parse the contig lengths; if none exist, the FASTA is parsed directly (slower for large assemblies). You can generate an index with samtools faidx ref2.fa or samtools dict ref2.fa > ref2.fa.dict.

echosv chain \
    -b test_data/input_data/chm13_to_grch38.bam \
    -f test_data/input_data/chm13.fa \
    -o test_data/chm13_to_grch38.chain.gz

Parameters

-b: Path to the ref2-to-ref1 alignment (BAM format, must be indexed)
-f: Path to the ref2 reference FASTA
-o: Output chain file for coordinate mapping (a coverage BED file is also written alongside)

Step 2: Merge SV call sets from the same reference (optional)

Merge multi-caller VCFs from the same reference into one call set before genotyping.

The merge command merges multiple SV call sets that were called against the same reference genome (e.g., outputs from multiple callers). This step is typically run before genotype and match so that each reference has a single unified call set for cross-reference comparison. Scripts to reproduce the analysis from our paper are available in scripts/.

# Merge multiple VCFs from the same reference into a single call set
echosv merge \
    -i grch38_colo829_caller1.vcf.gz grch38_colo829_caller2.vcf.gz [...] \
    -o grch38_colo829_svs.vcf.gz \
    --merge --new

# Extract high-confidence SVs (≥4 supporting callers, ≥2 platforms)
echosv merge \
    -i grch38_colo829_svs.vcf.gz \
    -o grch38_colo829_svs_highconf.vcf.gz \
    --extract

Pre-built gap BED files for the references used in this study are provided in the src/echosv/beds/ directory; a new gap BED can be passed by using --gapbed.

Parameters:

-i: Input VCF file(s) — space-separated list for --merge, single file for --extract
-o: Output file path
-a / --atol: Positional tolerance in bp for matching breakpoints (default: 500)
-s / --sizetol: Minimum size-similarity ratio for matching SVs (default: 0.5)
-c / --checksvtype: Require matching SV types when merging
--merge: Write a merged VCF from the comparison result
--new: Build merged VCF records from scratch (use with --merge)
--extract: Extract high-confidence SVs (≥4 supporting callers and ≥2 platforms)
--gaps-bed: BED file of reference gap / N regions; SVs near gaps are excluded when using --extract

Step 3: Collect supporting reads

The genotype command collects supporting reads for each SV from BAM files and annotates the VCF with allele-frequency and read-name fields used by the graph-based matching in Step 4.

echosv genotype --longread \
    -i test_data/input_data/grch38_colo829_somatic_svs.vcf.gz \
    -b test_data/input_data/chm13_to_grch38.bam \
    -o test_data/grch38_colo829_genotyped.vcf.gz

Parameters

--longread: Collect supporting reads from long-read alignments
--shortread: Collect supporting reads from short-read alignments
-i: Input SV VCF file
-b: BAM file(s) — multiple BAMs can be provided space-separated
-o: Output VCF with annotated supporting-read information

Step 4: Match SVs across references

The match command compares SV call sets across different reference genomes using a two-step hybrid approach: liftover-based coordinate matching followed by graph-based matching on shared supporting reads (echo score).

# Compare SV call sets and report concordant / reference-exclusive variants
echosv match -i test_data/test_colo829_config.json

# Compare SV call sets between DSA haplotypes and also produce a merged DSA-based VCF
echosv match -i dsa_merge_colo829_config.json --merge

The input is a JSON config file specifying reference labels, genotyped VCFs, chain files, and the output path. See test_data/test_colo829_config.json below for a working example.

Example JSON

{
    "refs":   { "1": "grch38", "2": "chm13", "3": "dsa" },
    "vcfs":   { "1": "./test_data/grch38_colo829_genotyped.vcf.gz",
                "2": "./test_data/chm13_colo829_genotyped.vcf.gz",
                "3": "./test_data/dsa_colo829_genotyped.vcf.gz" },
    "chains": { "2_to_1": "./test_data/chm13_to_grch38.chain.gz",
                "3_to_1": "./test_data/colo829bl_hap*_grch38.chain.gz" },
    "output": "./test_data/colo829_svs_comparison.txt"
}

Parameters

-i: Input config JSON file
--merge: Merge concordant SVs across references and write a unified VCF
--multiplat: Use multi-platform genotyping information during matching
-m / --min_echo_score: Minimum echo score to consider two SVs a match (default: 0.5)

License

This project is licensed under the MIT License — see the LICENSE file for details.

Contact

Feel free to open an issue on GitHub or contact Yuwei Zhang (yuwei_zhang@hms.harvard.edu) if you have any questions about EchoSV.

Project details

Release history Release notifications | RSS feed

This version

1.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

echosv-1.0.tar.gz (51.0 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

echosv-1.0-py3-none-any.whl (55.8 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file echosv-1.0.tar.gz.

File metadata

Download URL: echosv-1.0.tar.gz
Upload date: May 15, 2026
Size: 51.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for echosv-1.0.tar.gz
Algorithm	Hash digest
SHA256	`6796764a27ec5e1557e993fc02096adaaddc737eb4544627c3313adb9b0c1275`
MD5	`970faba3edf61485fb27887ca6c81fe3`
BLAKE2b-256	`1a6cc79e4a180ca8ff4f3ce73275d5bf28c9070782c48f3388646d2100ac653f`

See more details on using hashes here.

File details

Details for the file echosv-1.0-py3-none-any.whl.

File metadata

Download URL: echosv-1.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 55.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for echosv-1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e141c0eeb1e5dab7d37ba582213c06c3ed18b45c2e88847810da0bfbd9180758`
MD5	`ee70c7aef7b1e80839aebc44ca1d680e`
BLAKE2b-256	`3d5b665e9285f9243b7e40029067980107b2e267890c317fa5cd3c1cacd88c9f`

See more details on using hashes here.

echosv 1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

EchoSV

Table of Contents

Requirements

Installation

Usage

Step 0: Download and uncompress test data

Step 1: Generate chains

Step 2: Merge SV call sets from the same reference (optional)

Step 3: Collect supporting reads

Step 4: Match SVs across references

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes