Skip to main content

A novel tool for accurately merging haplotype-based SV calls and comparing SVs across reference genomes

Project description

EchoSV

EchoSV is a versatile tool for comparing and merging structural variant (SV) call sets generated using different reference genomes. It studies how SVs "echo" across these references through a hybrid workflow that combines liftover and graph-based matching.

EchoSV Workflow

Given two or more SV call sets from the same sample—each aligned to a different reference—EchoSV can perform two primary operations:

  • Compare: Generates a detailed comparison identifying overlapping variants and those exclusive to a specific reference, e.g., calls across GRCh38, CHM13, and a donor-specific assembly (DSA).
  • Merge: Consolidates multiple SV call sets into a single, unified output, e.g., merging two DSA haplotype-based call sets into one consolidated file.

Table of Contents

Requirements

EchoSV depends on the following Python packages:

  • pysam: Read and write BAM/CRAM files and VCF records for variant processing.
  • intervaltree: Efficiently store and query genomic intervals to detect overlapping SVs.
  • Biopython (Bio): Parse and manipulate sequence data during liftover steps.
  • scipy: Perform statistical analysis and numerical computations on variant metrics.
  • networkx: Construct and traverse graphs that model structural variant matches.
  • pandas: Tabular data manipulation for comparison outputs.
  • numpy: Numerical computations on variant metrics.
  • rich: Formatted terminal output and progress display.

Installation

Option 1: From GitHub (recommended)

git clone git@github.com:parklab/EchoSV.git
cd EchoSV
pip install -r requirements.txt
pip install .

Option 2: Via PyPI

pip install echosv

Usage

The EchoSV workflow consists of four main steps: chain, merge (optional), genotype, and match. Below are detailed instructions and examples using the test data (can be downloaded from Zenodo).

Step 0: Download and uncompress test data

Download the EchoSV test data echosv_test_data.tar.gz from Zenodo and decompress it:

tar -xzvf echosv_test_data.tar.gz

Step 1: Generate chains

The chain command generates a liftover chain file that maps coordinates from ref2 (the source assembly) to ref1 (the target reference). Before running chain, align ref2 against ref1 using minimap2's asm-to-asm mode and index the output:

minimap2 -a -x asm5 --cs ref1.fa ref2.fa \
    | samtools view -hSb - \
    | samtools sort -O BAM -o ref2_to_ref1.bam
samtools index ref2_to_ref1.bam

Then generate the chain file. EchoSV looks for a pre-built index automatically to parse the contig lengths; if none exist, the FASTA is parsed directly (slower for large assemblies). You can generate an index with samtools faidx ref2.fa or samtools dict ref2.fa > ref2.fa.dict.

echosv chain \
    -b test_data/input_data/chm13_to_grch38.bam \
    -f test_data/input_data/chm13.fa \
    -o test_data/chm13_to_grch38.chain.gz
Parameters
  • -b: Path to the ref2-to-ref1 alignment (BAM format, must be indexed)
  • -f: Path to the ref2 reference FASTA
  • -o: Output chain file for coordinate mapping (a coverage BED file is also written alongside)

Step 2: Merge SV call sets from the same reference (optional)

Merge multi-caller VCFs from the same reference into one call set before genotyping.

The merge command merges multiple SV call sets that were called against the same reference genome (e.g., outputs from multiple callers). This step is typically run before genotype and match so that each reference has a single unified call set for cross-reference comparison. Scripts to reproduce the analysis from our paper are available in scripts/.

# Merge multiple VCFs from the same reference into a single call set
echosv merge \
    -i grch38_colo829_caller1.vcf.gz grch38_colo829_caller2.vcf.gz [...] \
    -o grch38_colo829_svs.vcf.gz \
    --merge --new

# Extract high-confidence SVs (≥4 supporting callers, ≥2 platforms)
echosv merge \
    -i grch38_colo829_svs.vcf.gz \
    -o grch38_colo829_svs_highconf.vcf.gz \
    --extract 

Pre-built gap BED files for the references used in this study are provided in the src/echosv/beds/ directory; a new gap BED can be passed by using --gapbed.

Parameters:

  • -i: Input VCF file(s) — space-separated list for --merge, single file for --extract
  • -o: Output file path
  • -a / --atol: Positional tolerance in bp for matching breakpoints (default: 500)
  • -s / --sizetol: Minimum size-similarity ratio for matching SVs (default: 0.5)
  • -c / --checksvtype: Require matching SV types when merging
  • --merge: Write a merged VCF from the comparison result
  • --new: Build merged VCF records from scratch (use with --merge)
  • --extract: Extract high-confidence SVs (≥4 supporting callers and ≥2 platforms)
  • --gaps-bed: BED file of reference gap / N regions; SVs near gaps are excluded when using --extract

Step 3: Collect supporting reads

The genotype command collects supporting reads for each SV from BAM files and annotates the VCF with allele-frequency and read-name fields used by the graph-based matching in Step 4.

echosv genotype --longread \
    -i test_data/input_data/grch38_colo829_somatic_svs.vcf.gz \
    -b test_data/input_data/chm13_to_grch38.bam \
    -o test_data/grch38_colo829_genotyped.vcf.gz
Parameters
  • --longread: Collect supporting reads from long-read alignments
  • --shortread: Collect supporting reads from short-read alignments
  • -i: Input SV VCF file
  • -b: BAM file(s) — multiple BAMs can be provided space-separated
  • -o: Output VCF with annotated supporting-read information

Step 4: Match SVs across references

The match command compares SV call sets across different reference genomes using a two-step hybrid approach: liftover-based coordinate matching followed by graph-based matching on shared supporting reads (echo score).

# Compare SV call sets and report concordant / reference-exclusive variants
echosv match -i test_data/test_colo829_config.json

# Compare SV call sets between DSA haplotypes and also produce a merged DSA-based VCF
echosv match -i dsa_merge_colo829_config.json --merge 

The input is a JSON config file specifying reference labels, genotyped VCFs, chain files, and the output path. See test_data/test_colo829_config.json below for a working example.

Example JSON
{
    "refs":   { "1": "grch38", "2": "chm13", "3": "dsa" },
    "vcfs":   { "1": "./test_data/grch38_colo829_genotyped.vcf.gz",
                "2": "./test_data/chm13_colo829_genotyped.vcf.gz",
                "3": "./test_data/dsa_colo829_genotyped.vcf.gz" },
    "chains": { "2_to_1": "./test_data/chm13_to_grch38.chain.gz",
                "3_to_1": "./test_data/colo829bl_hap*_grch38.chain.gz" },
    "output": "./test_data/colo829_svs_comparison.txt"
}
Parameters
  • -i: Input config JSON file
  • --merge: Merge concordant SVs across references and write a unified VCF
  • --multiplat: Use multi-platform genotyping information during matching
  • -m / --min_echo_score: Minimum echo score to consider two SVs a match (default: 0.5)

License

This project is licensed under the MIT License — see the LICENSE file for details.

Contact

Feel free to open an issue on GitHub or contact Yuwei Zhang (yuwei_zhang@hms.harvard.edu) if you have any questions about EchoSV.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

echosv-1.0.tar.gz (51.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

echosv-1.0-py3-none-any.whl (55.8 kB view details)

Uploaded Python 3

File details

Details for the file echosv-1.0.tar.gz.

File metadata

  • Download URL: echosv-1.0.tar.gz
  • Upload date:
  • Size: 51.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for echosv-1.0.tar.gz
Algorithm Hash digest
SHA256 6796764a27ec5e1557e993fc02096adaaddc737eb4544627c3313adb9b0c1275
MD5 970faba3edf61485fb27887ca6c81fe3
BLAKE2b-256 1a6cc79e4a180ca8ff4f3ce73275d5bf28c9070782c48f3388646d2100ac653f

See more details on using hashes here.

File details

Details for the file echosv-1.0-py3-none-any.whl.

File metadata

  • Download URL: echosv-1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for echosv-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e141c0eeb1e5dab7d37ba582213c06c3ed18b45c2e88847810da0bfbd9180758
MD5 ee70c7aef7b1e80839aebc44ca1d680e
BLAKE2b-256 3d5b665e9285f9243b7e40029067980107b2e267890c317fa5cd3c1cacd88c9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page