Python module to convert SAM to MIDSV format.
Project description
midsv
midsv
is a Python module to convert SAM to MIDSV format.
MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format representing the difference between a reference and a query with the same length as the reference.
⚠️ MIDSV is for the target amplicon sequence (10-100 kbp). It may crash when whole chromosomes are used as reference due to running out of memory.
MIDSV provides MIDSV
, CSSPLIT
, and QSCORE
.
MIDSV
is a simple representation focusing on mutationsCSSPLIT
keeps original nucleotidesQSCORE
provides Phred quality score on each nucleotide
MIDSV (formerly named MIDS) details are described in our paper.
Installation
From PyPI:
pip install midsv
From Bioconda:
conda install -c bioconda midsv
Usage
midsv.transform(
sam: list[list],
midsv: bool = True,
cssplit: bool = True,
qscore: bool = True) -> list[dict]
midsv.transform()
returns a list of dictionaries incudingQNAME
,RNAME
,MIDSV
,CSSPLIT
, andQSCORE
.MIDSV
,CSSPLIT
, andQSCORE
are comma-separated and have the same reference sequence length.
import midsv
# Perfect match
sam = [
['@SQ', 'SN:example', 'LN:10'],
['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
]
midsv.transform(sam)
# [{
# 'QNAME': 'control',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,M,M,M,M,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]
# Insertion, deletion and substitution
sam = [
['@SQ', 'SN:example', 'LN:10'],
['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
]
midsv.transform(sam)
# [{
# 'QNAME': 'indel_sub',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,S,3M,D,D,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
# 'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]
# Large deletion
sam = [
['@SQ', 'SN:example', 'LN:10'],
['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
]
midsv.transform(sam)
# [
# {'QNAME': 'large-deletion',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,D,D,D,D,D,D,M,M',
# 'CSSPLIT': '=A,=C,N,N,N,N,N,N,=A,=C',
# 'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]
# Inversion
sam = [
['@SQ', 'SN:example', 'LN:10'],
['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
]
midsv.transform(sam)
# [
# {'QNAME': 'inversion',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,M,m,m,m,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]
Operators
MIDSV
Op | Description |
---|---|
M | Identical sequence |
[1-9][0-9]+ | Insertion to the reference |
D | Deletion from the reference |
S | Substitution |
N | Unknown |
[mdsn] | Inversion |
MIDSV
represents insertion as an integer and appends the following operators.
If five insertions follow three matches, MIDSV returns 5M,M,M
(not 5,M,M,M
) since 5M,M,M
keeps reference sequence length in a comma-separated field.
CSSPLIT
Op | Regex | Description |
---|---|---|
= | [ACGTN] | Identical sequence |
+ | [ACGTN] | Insertion to the reference |
- | [ACGTN] | Deletion from the reference |
* | [ACGTN][ACGTN] | Substitution |
[acgtn] | Inversion | |
| | Separater of insertion sites |
CSSPLIT
uses |
to separate nucleotides in insertion sites.
Therefore, +A|+C|+G|+T|=A
can be easily splited to [+A, +C, +G, +T, =A]
by "+A|+C|+G|+T|=A".split("|")
in Python.
QSCORE
Op | Description |
---|---|
-1 | Unknown |
| | Separator at insertion sites |
QSCORE
uses -1
at deletion or unknown nucleotides.
As with CSSPLIT
, QSCORE
uses |
to separate quality scores in insertion sites.
Helper functions
Read SAM file
midsv.read_sam(path_of_sam: str | Path) -> list[list]
midsv.read_sam
read SAM file into a list of lists.
Read/Write JSON Line (JSONL)
midsv.write_jsonl(dict: list[dict], path_of_jsonl: str | Path)
midsv.read_jsonl(path_of_jsonl: str | Path) -> list[dict]
Since midsv
returns a list of dictionaries, midsv.write_jsonl
outputs it to a file in JSONL format.
Conversely, midsv.read_jsonl
reads JSONL as a list of dictionaries.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.