Python module to convert SAM to MIDSV format.
Project description
midsv
midsv
is a Python module to convert SAM to MIDSV format.
MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format representing the difference between a reference and a query with the same length as the reference.
⚠️ MIDSV is for the target amplicon sequence (10-100 kbp). It may crash when whole chromosomes are used as reference due to running out of memory.
MIDSV provides MIDSV
, CSSPLIT
, and QSCORE
.
MIDSV
is a simple representation focusing on mutationsCSSPLIT
keeps original nucleotidesQSCORE
provides Phred quality score on each nucleotide
MIDSV (formerly named MIDS) details are described in our paper.
Installation
From PyPI:
pip install midsv
From Bioconda:
conda install -c bioconda midsv
Usage
midsv.transform(
sam: list[list],
midsv: bool = True,
cssplit: bool = True,
qscore: bool = True) -> list[dict]
midsv.transform()
returns a list of dictionaries incudingQNAME
,RNAME
,MIDSV
,CSSPLIT
, andQSCORE
.MIDSV
,CSSPLIT
, andQSCORE
are comma-separated and have the same reference sequence length.
import midsv
# Perfect match
sam = [
['@SQ', 'SN:example', 'LN:10'],
['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
]
midsv.transform(sam)
# [{
# 'QNAME': 'control',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,M,M,M,M,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]
# Insertion, deletion and substitution
sam = [
['@SQ', 'SN:example', 'LN:10'],
['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
]
midsv.transform(sam)
# [{
# 'QNAME': 'indel_sub',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,S,3M,D,D,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
# 'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]
# Large deletion
sam = [
['@SQ', 'SN:example', 'LN:10'],
['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
]
midsv.transform(sam)
# [
# {'QNAME': 'large-deletion',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,D,D,D,D,D,D,M,M',
# 'CSSPLIT': '=A,=C,N,N,N,N,N,N,=A,=C',
# 'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]
# Inversion
sam = [
['@SQ', 'SN:example', 'LN:10'],
['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
]
midsv.transform(sam)
# [
# {'QNAME': 'inversion',
# 'RNAME': 'example',
# 'MIDSV': 'M,M,M,M,M,m,m,m,M,M',
# 'CSSPLIT': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]
Operators
MIDSV
Op | Description |
---|---|
M | Identical sequence |
[1-9][0-9]+ | Insertion to the reference |
D | Deletion from the reference |
S | Substitution |
N | Unknown |
[mdsn] | Inversion |
MIDSV
represents insertion as an integer and appends the following operators.
If five insertions follow three matches, MIDSV returns 5M,M,M
(not 5,M,M,M
) since 5M,M,M
keeps reference sequence length in a comma-separated field.
CSSPLIT
Op | Regex | Description |
---|---|---|
= | [ACGTN] | Identical sequence |
+ | [ACGTN] | Insertion to the reference |
- | [ACGTN] | Deletion from the reference |
* | [ACGTN][ACGTN] | Substitution |
[acgtn] | Inversion | |
| | Separater of insertion sites |
CSSPLIT
uses |
to separate nucleotides in insertion sites.
Therefore, +A|+C|+G|+T|=A
can be easily splited to [+A, +C, +G, +T, =A]
by "+A|+C|+G|+T|=A".split("|")
in Python.
QSCORE
Op | Description |
---|---|
-1 | Unknown |
| | Separator at insertion sites |
QSCORE
uses -1
at deletion or unknown nucleotides.
As with CSSPLIT
, QSCORE
uses |
to separate quality scores in insertion sites.
Helper functions
Read SAM file
midsv.read_sam(path_of_sam: str | Path) -> list[list]
midsv.read_sam
read SAM file into a list of lists.
Read/Write JSON Line (JSONL)
midsv.write_jsonl(dict: list[dict], path_of_jsonl: str | Path)
midsv.read_jsonl(path_of_jsonl: str | Path) -> list[dict]
Since midsv
returns a list of dictionaries, midsv.write_jsonl
outputs it to a file in JSONL format.
Conversely, midsv.read_jsonl
reads JSONL as a list of dictionaries.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file midsv-0.11.0.tar.gz
.
File metadata
- Download URL: midsv-0.11.0.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 713a164c4bce68f1caa72d50f7b7eef7a62b869bd65f2aebf8024d33ba6cc5b8 |
|
MD5 | a151aec9e1fd61d9084aadf57108aa23 |
|
BLAKE2b-256 | 471c0f54d1b606db402e0852e8976c625301ecf7fcc57a75f919c962fda37e9e |
File details
Details for the file midsv-0.11.0-py3-none-any.whl
.
File metadata
- Download URL: midsv-0.11.0-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bb99b6be568507af6535d9c061b93d2e24f1d5a91f1d9d92f79c49e19a62856 |
|
MD5 | dee044729ab94a228b5f0423d5fa3e37 |
|
BLAKE2b-256 | 276f390b680cb54715b56f1b168470ee250ab849d85cc09f92fb2732bee5f42a |