sma-finder

A tool for diagnosing spinal muscular atrophy (SMA) using exome or genome sequencing data

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

SMA Finder

SMA Finder is a tool for diagnosing spinal muscular atrophy (SMA) using exome or genome sequencing data.
It takes a reference sequence (FASTA) and 1 or more alignment files (CRAM or BAM) as input, evaluates reads that overlap the c.840 position in SMN1 and SMN2, and then reports whether the input sample(s) have complete loss of functional SMN1.

Running this tool on 13 positive controls and 10,434 negative controls showed 100% sensitivity and specificity (details below).

Limitations:

does not report SMA carrier status or SMN2 copy number.
does not detect the ~5% of cases caused by SMN1 loss-of-function mutations that do not involve the c.840 position
requires at least 14 reads to overlap the c.840 position in SMN1 plus SMN2 in order to make a call

Install

python3 -m pip install sma-finder

Example

Example command:

sma_finder --verbose --hg38-reference-fasta /ref/hg38.fa  sample1.cram

Command output:

Input args:
    --hg38-reference-fasta: /ref/hg38.fa
    --output-tsv: sample1.sma_finder_results.tsv
    CRAMS or BAMS: sample1.cram
----
Output row #1:
        filename_prefix                     sample1
        file_type                           cram
        genome_version                      hg38
        sample_id                           s1
        sma_status                          has SMA
        confidence_score                    168
        c840_reads_with_smn1_base_C         0
        c840_total_reads                    174
Wrote 1 rows to sample1.sma_finder_results.tsv

Usage

Usage help text:

sma_finder --help

usage: sma_finder.py [-h] [--hg37-reference-fasta HG37_REFERENCE_FASTA]
                     [--hg38-reference-fasta HG38_REFERENCE_FASTA]
                     [--t2t-reference-fasta T2T_REFERENCE_FASTA]
                     [-o OUTPUT_TSV] [-v]
                     cram_or_bam_path [cram_or_bam_path ...]

positional arguments:
  cram_or_bam_path      One or more CRAM or BAM file paths

optional arguments:
  -h, --help            show this help message and exit
  --hg37-reference-fasta HG37_REFERENCE_FASTA
                        HG37 reference genome FASTA path. This should be
                        specified if the input bam or cram is aligned to HG37.
  --hg38-reference-fasta HG38_REFERENCE_FASTA
                        HG38 reference genome FASTA path. This should be
                        specified if the input bam or cram is aligned to HG38.
  --t2t-reference-fasta T2T_REFERENCE_FASTA
                        T2T reference genome FASTA path. This should be
                        specified if the input bam or cram is aligned to the
                        CHM13 telomere-to-telomere benchmark.
  -o OUTPUT_TSV, --output-tsv OUTPUT_TSV
                        Optional output tsv file path
  -v, --verbose         Whether to print extra details during the run

Output

The output .tsv contains one row per input CRAM or BAM file and has the following columns:

filename_prefix	CRAM or BAM filename prefix. If the input file is /path/sample1.cram this would be "sample1".
file_type	"cram" or "bam"
genome_version	"hg37", "hg38", or "t2t"
sample_id	sample id from the CRAM or BAM file header (parsed from the read group metadata)
sma_status	possible values are: "has SMA" "does not have SMA" "not enough coverage at SMN c.840 position"
confidence_score	PHRED-scaled integer score measuring the level of confidence that the sma_status is correct. The bigger the score, the higher the confidence. It is calculated in a similar way to the PL field in GATK HaplotypeCaller genotypes.
c840_reads_with_smn1_base_C	number of reads that have a 'C' nucleotide at the c.840 position in SMN1 plus SMN2
c840_total_reads	total number of reads overlapping the c.840 position in SMN1 plus SMN2

Details

This poster on SMA Finder was presented at the SVAR22 conference:

Combining results from multiple samples

After running multiple SMA Finder instances, it is possible to combine the output tables into a single table using the following shell command:

combined_table_filename=combined_results.tsv
head -n 1 $(ls *.tsv | head -n 1) > ${combined_table_filename}   # get table header from the 1st table 
for i in *.tsv; do
    tail -n +2 $i >> ${combined_table_filename}    # concatenate all tables
done

Plotting combined results

A scatter plot showing read counts from many samples can be generated using the plot_SMN1_SMN2_scatter command:

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.4.4

Jan 31, 2024

This version

1.4.3

Jan 29, 2024

1.4.2

Jan 16, 2024

1.4.1

Jan 16, 2024

1.4

Jan 16, 2024

1.3

Mar 14, 2023

1.2

Feb 5, 2023

1.1

Oct 30, 2022

1.0.2

Oct 17, 2022

1.0.1

Oct 17, 2022

1.0

Sep 20, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sma_finder-1.4.3.tar.gz (13.2 kB view hashes)

Uploaded Jan 29, 2024 Source

Built Distribution

sma_finder-1.4.3-py3-none-any.whl (13.8 kB view hashes)

Uploaded Jan 29, 2024 Python 3

Hashes for sma_finder-1.4.3.tar.gz

Hashes for sma_finder-1.4.3.tar.gz
Algorithm	Hash digest
SHA256	`659f4321444a4beb0878478eb0269c2c5977e2be586c34aa182eb00671219219`
MD5	`000341315705e1160b5d4b057299083d`
BLAKE2b-256	`a6aacea2a8d0ff2be5ffea0db6e30797c5f991b0847d9cf684ed968f18ef5aa2`

Hashes for sma_finder-1.4.3-py3-none-any.whl

Hashes for sma_finder-1.4.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4527f9dc6753d3fa9b57ce437a189bd04c1851141bd8baf388ba55c8110986e4`
MD5	`f2a720b824d4266c8a7c97515588348a`
BLAKE2b-256	`2ed085439eb76dc108615cbed62155fb96f8a07e791756690fb56a1dd8bf7cb0`