Skip to main content

Python module to convert SAM to MIDSV format.

Project description

Licence Test Python PyPI Bioconda

midsv

midsv is a Python module to convert SAM to MIDSV format.

MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format representing the difference between a reference and a query with the same length as the reference.

⚠️ MIDSV is for the target amplicon sequence (10-100 kbp). It may crash when whole chromosomes are used as reference due to running out of memory.

MIDSV provides MIDSV, CSSPLIT, and QSCORE.

  • MIDSV is a simple representation focusing on mutations
  • CSSPLIT keeps original nucleotides
  • QSCORE provides Phred quality score on each nucleotide

MIDSV (formerly named MIDS) details are described in our paper.

Installation

From PyPI:

pip install midsv

From Bioconda:

conda install -c bioconda midsv

Usage

midsv.transform(
    sam: list[list],
    midsv: bool = True,
    cssplit: bool = True,
    qscore: bool = True) -> list[dict]
  • midsv.transform() returns a list of dictionaries incuding QNAME, RNAME, MIDSV, CSSPLIT, and QSCORE.
  • MIDSV, CSSPLIT, and QSCORE are comma-separated and have the same reference sequence length.
import midsv

# Perfect match

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
    ]

midsv.transform(sam)
# [{
#   'QNAME': 'control',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,M,M,M,M,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]

# Insertion, deletion and substitution

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
    ]

midsv.transform(sam)
# [{
#   'QNAME': 'indel_sub',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,S,3M,D,D,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
#   'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]

# Large deletion

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
    ['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
    ]

midsv.transform(sam)
# [
#   {'QNAME': 'large-deletion',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,D,D,D,D,D,D,M,M',
#   'CSSPLIT': '=A,=C,N,N,N,N,N,N,=A,=C',
#   'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]

# Inversion

sam = [
    ['@SQ', 'SN:example', 'LN:10'],
    ['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
    ['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
    ['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
    ]

midsv.transform(sam)
# [
#   {'QNAME': 'inversion',
#   'RNAME': 'example',
#   'MIDSV': 'M,M,M,M,M,m,m,m,M,M',
#   'CSSPLIT': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]

Operators

MIDSV

Op Description
M Identical sequence
[1-9][0-9]+ Insertion to the reference
D Deletion from the reference
S Substitution
N Unknown
[mdsn] Inversion

MIDSV represents insertion as an integer and appends the following operators.

If five insertions follow three matches, MIDSV returns 5M,M,M (not 5,M,M,M) since 5M,M,M keeps reference sequence length in a comma-separated field.

CSSPLIT

Op Regex Description
= [ACGTN] Identical sequence
+ [ACGTN] Insertion to the reference
- [ACGTN] Deletion from the reference
* [ACGTN][ACGTN] Substitution
[acgtn] Inversion
| Separater of insertion sites

CSSPLIT uses | to separate nucleotides in insertion sites.

Therefore, +A|+C|+G|+T|=A can be easily splited to [+A, +C, +G, +T, =A] by "+A|+C|+G|+T|=A".split("|") in Python.

QSCORE

Op Description
-1 Unknown
| Separator at insertion sites

QSCORE uses -1 at deletion or unknown nucleotides.

As with CSSPLIT, QSCORE uses | to separate quality scores in insertion sites.

Helper functions

Read SAM file

midsv.read_sam(path_of_sam: str | Path) -> list[list]

midsv.read_sam read SAM file into a list of lists.

Read/Write JSON Line (JSONL)

midsv.write_jsonl(dict: list[dict], path_of_jsonl: str | Path)
midsv.read_jsonl(path_of_jsonl: str | Path) -> list[dict]

Since midsv returns a list of dictionaries, midsv.write_jsonl outputs it to a file in JSONL format.

Conversely, midsv.read_jsonl reads JSONL as a list of dictionaries.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

midsv-0.11.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

midsv-0.11.0-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file midsv-0.11.0.tar.gz.

File metadata

  • Download URL: midsv-0.11.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for midsv-0.11.0.tar.gz
Algorithm Hash digest
SHA256 713a164c4bce68f1caa72d50f7b7eef7a62b869bd65f2aebf8024d33ba6cc5b8
MD5 a151aec9e1fd61d9084aadf57108aa23
BLAKE2b-256 471c0f54d1b606db402e0852e8976c625301ecf7fcc57a75f919c962fda37e9e

See more details on using hashes here.

File details

Details for the file midsv-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: midsv-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for midsv-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8bb99b6be568507af6535d9c061b93d2e24f1d5a91f1d9d92f79c49e19a62856
MD5 dee044729ab94a228b5f0423d5fa3e37
BLAKE2b-256 276f390b680cb54715b56f1b168470ee250ab849d85cc09f92fb2732bee5f42a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page