Skip to main content

Python module to convert SAM to MIDSV format.

Project description

Licence Test Python PyPI Bioconda

midsv

midsv is a Python module to convert SAM to MIDSV format.

MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format representing the difference between a reference and a query with the same length as the reference.

⚠️ MIDSV is for the target amplicon sequence (10-100 kbp). It may crash when whole chromosomes are used as reference due to running out of memory.

MIDSV provides MIDSV, CSSPLIT, and QSCORE.

  • MIDSV is a simple representation focusing on mutations
  • CSSPLIT keeps original nucleotides
  • QSCORE provides Phred quality score on each nucleotide

MIDSV (formerly named MIDS) details are described in our paper.

Installation

From PyPI:

pip install midsv

From Bioconda:

conda install -c bioconda midsv

Usage

midsv.transform(
    sam: list[list],
    midsv: bool = True,
    cssplit: bool = True,
    qscore: bool = True) -> list[dict]
  • midsv.transform() returns a list of dictionaries incuding QNAME, RNAME, MIDSV, CSSPLIT, and QSCORE.
  • MIDSV, CSSPLIT, and QSCORE are comma-separated and have the same reference sequence length.
import midsv

# Perfect match

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
    ]

midsv.transform(sam)
# [{
#   'QNAME': 'control',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,M,M,M,M,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]

# Insertion, deletion and substitution

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
    ]

midsv.transform(sam)
# [{
#   'QNAME': 'indel_sub',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,S,3M,D,D,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
#   'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]

# Large deletion

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
    ['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
    ]

midsv.transform(sam)
# [
#   {'QNAME': 'large-deletion',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,D,D,D,D,D,D,M,M',
#   'CSSPLIT': '=A,=C,N,N,N,N,N,N,=A,=C',
#   'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]

# Inversion

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
    ['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
    ['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
    ]

midsv.transform(sam)
# [
#   {'QNAME': 'inversion',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,M,m,m,m,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]

Operators

MIDSV

Op Description
M Identical sequence
[1-9][0-9]+ Insertion to the reference
D Deletion from the reference
S Substitution
N Unknown
[mdsn] Inversion

MIDSV represents insertion as an integer and appends the following operators.

If five insertions follow three matches, MIDSV returns 5M,M,M (not 5,M,M,M) since 5M,M,M keeps reference sequence length in a comma-separated field.

CSSPLIT

Op Regex Description
= [ACGTN] Identical sequence
+ [ACGTN] Insertion to the reference
- [ACGTN] Deletion from the reference
* [ACGTN][ACGTN] Substitution
[acgtn] Inversion
| Separater of insertion sites

CSSPLIT uses | to separate nucleotides in insertion sites.

Therefore, +A|+C|+G|+T|=A can be easily splited to [+A, +C, +G, +T, =A] by "+A|+C|+G|+T|=A".split("|") in Python.

QSCORE

Op Description
-1 Unknown
| Separator at insertion sites

QSCORE uses -1 at deletion or unknown nucleotides.

As with CSSPLIT, QSCORE uses | to separate quality scores in insertion sites.

Helper functions

Read SAM file

midsv.read_sam(path_of_sam: str | Path) -> list[list]

midsv.read_sam read SAM file into a list of lists.

Read/Write JSON Line (JSONL)

midsv.write_jsonl(dict: list[dict], path_of_jsonl: str | Path)
midsv.read_jsonl(path_of_jsonl: str | Path) -> list[dict]

Since midsv returns a list of dictionaries, midsv.write_jsonl outputs it to a file in JSONL format.

Conversely, midsv.read_jsonl reads JSONL as a list of dictionaries.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

midsv-0.11.0.tar.gz (15.7 kB view hashes)

Uploaded Source

Built Distribution

midsv-0.11.0-py3-none-any.whl (12.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page