paraphase: HiFi-based SMN1/SMN2 variant caller
Project description
Paraphase: HiFi-based SMN1/SMN2 variant caller
SMN1, the gene that causes spinal muscular atrophy, is considered a 'dark' region of the genome due to high sequence similarity with its paralog SMN2. Paraphase is a Python tool that takes HiFi BAMs as input (WGS or enrichment), assembles complete SMN1 and SMN2 haplotypes, determines copy numbers and makes phased variant calls for both genes. It also categorizes the haplotypes, enabling future haplotype-based screening of silent carriers (2+0). Please check out our preprint for more details about the method and our population-wide haplotype analysis.
Contact
If you need assistance or have suggestions, please don't hesitate to reach out by email or open a GitHub issue. Xiao Chen: xchen@pacificbiosciences.com
Dependencies
- samtools
- minimap2
- singularity (if you would like to run DeepVariant and generate vcfs on assembled haplotypes)
Installation
Paraphase can be installed through pip:
pip install paraphase
Alternatively, Paraphase can be installed from GitHub.
git clone https://github.com/PacificBiosciences/paraphase
cd paraphase
python setup.py install
Running the program
paraphase -b input.bam -o output_directory
Alternatively when you have a list of bam files
paraphase -l list.txt -o output_directory
Required parameters:
-b
: Input BAM file or-l
: List of BAM files (one per line)-o
: Output directory
Optional parameters:
-v
: If specified, Paraphase will run DeepVariant to produce VCFs for each haplotype (singularity is required).-c
: Config file, default config file isparaphase/data/smn1/config.yaml
.-t
: Number of threads, used when-l
is specified.-d
: File listing average genome depth per sample, with two columns, sample ID and depth values, separated by tab or space. This saves run time by skipping the step to calculate genome depth.--samtools
--minimap2
--singularity
The paths to samtools, minimaps and singularity (only required if -v
is specified) can be provided through the --samtools
, --minimap2
and --singularity
parameters or by modifying the tools
section of the config.yaml
file.
Note that currently only GRCh38 is supported. We will support GRCh37 in the future if there is request.
Interpreting the output
Paraphase produces a few output files in the directory specified by -o
, with the sample ID as the prefix.
_realigned_tagged.bam
: This BAM file can be loaded into IGV for visualization of haplotypes, see the next section.- If
-v
is specified, Paraphase will generate VCF files produced by DeepVariant. A VCF file is written for each haplotype, and there is also a_variants.vcf
file containing merged variants from all haplotypes. This is a nonstandard scenario with variable ploidy, and we will continue to improve the variant filtering and merging steps. Any suggestions are welcome. .json
: Main output file, summerizes haplotypes and variant calls for each sample. Details of the fields are explained below:smn1_cn
: copy number of SMN1, anull
call indicates that Paraphase finds only one haplotype but depth does not unambiguously support a copy number of one or two.smn2_cn
: copy number of SMN2, anull
call indicates that Paraphase finds only one haplotype but depth does not unambiguously support a copy number of one or two.smn2_del78_cn
: copy number of SMN2Δ7–8 (SMN2 with a deletion of Exon7-8)smn1_read_number
: number of reads containing c.840Csmn2_read_number
: number of reads containing c.840Tsmn2_del78_read_number
: number of reads containing the known deletion of Exon7-8 on SMN2smn1_haplotypes
: SMN1 haplotypes assembledsmn2_haplotypes
: SMN2 haplotypes assembledsmn2_del78_haplotypes
: SMN2Δ7–8 haplotypes assembledtwo_copy_haplotypes
: haplotypes that are present in two copies based on depth. In rare cases two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.haplotype_details
: lists each haplotype and the variants it contains, followed by the boundary of the region that was resolved on the haplotype, as well as the haplogroup assignment. The variants listed here could be a subset of those called by DeepVariant, as we use a simple pileup method to call variants and we exclude homopolymer regions. For most accurate variant calls, please use the-v
option.
Visualizing the output
We can visualize the haplotypes by loading the output file _realigned_tagged.bam
into IGV and grouping reads by the HP
tag.
In this example, there are two copies of SMN1 and two copies of SMN2. All reads are aligned to SMN1. Reads in blue are uniquely assigned to a haplotype, while reads in gray can be assigned to more than one possible haplotype and a random one is selected (this happens when two haplotypes are identical over a region). The Unassigned
category contains reads that carry bases that do not agree with any haplotypes (this could be due to sequencing errors or incompletely assembled haplotypes).
In this second example, smn1hap1
is present in about twice as many reads as the other haplotypes, so Paraphase infers that there are two copies of SMN1 (the haplotype sequences are identical).
The examples
folders contains IGV sessions showing more examples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for paraphase-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a761933934ae349ed3633d37bfca6820bab74a8f393d804e6f0fbf0578df94e |
|
MD5 | de3fc6d1fa2a9439406dece06bd77255 |
|
BLAKE2b-256 | a1e774d0128b1baa89b9386dfa05b6e806e8bdcbd5f06ed5e4f4e9f2b05d9e56 |