Skip to main content

Python module to convert SAM to MIDSV format.

Project description

Licence Test Python PyPI Bioconda

midsv

midsv is a Python module that converts SAM files to MIDSV format.

MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format that represents differences between a reference and a query, with the same length as the reference.

[!CAUTION] MIDSV is intended for targeted amplicon sequences (10-100 kbp).
Using whole chromosomes as references may exhaust memory and crash.

[!IMPORTANT] MIDSV requires long-format cstag tags in the SAM file.
Please use minimap2 with --cs=long option. or use cstag tool to append long-format cstag.

The output includes MIDSV and, optionally, QSCORE.

  • MIDSV preserves original nucleotides while annotating mutations.
  • QSCORE provides Phred quality scores for each nucleotide.

Details of MIDSV (formerly MIDS) are described in our paper.

🛠️Installation

From Bioconda (recommended):

conda install -c bioconda midsv

From PyPI:

pip install midsv

📜Specifications

MIDSV

Op Regex Description
= [ACGTN] Identical sequence
+ [ACGTN] Insertion to the reference
- [ACGTN] Deletion from the reference
* [ACGTN][ACGTN] Substitution
[acgtn] Inversion
| Separator for insertion sites

MIDSV uses | to separate nucleotides in insertion sites so +A|+C|+G|+T|=A can be easily split into [+A, +C, +G, +T, =A] by "+A|+C|+G|+T|=A".split("|").

QSCORE

Op Description
-1 Unknown
| Separator for insertion sites

QSCORE uses -1 for deletions or unknown nucleotides.

As with MIDSV, QSCORE uses | to separate quality scores in insertion sites.

📘Usage

midsv.transform(
    path_sam: str | Path,
    qscore: bool = False,
    keep: str | list[str] = None
) -> list[dict[str, str | int]]
  • path_sam: Path to a SAM file on disk.

  • qscore (bool, optional): Output QSCORE. Defaults to False.

  • keep: Subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'} to include from the SAM file. Defaults to None.

  • midsv.transform() returns a list of dictionaries containing QNAME, RNAME, MIDSV, and optionally QSCORE, plus any fields specified by keep.

  • MIDSV and QSCORE are comma-separated strings and have the same reference sequence length.

🖍️Examples

Perfect match

import midsv
from midsv.io import read_sam

# Perfect match

path_sam = "examples/example_match.sam"
print(list(read_sam(path_sam)))
# sam = [
#     ['@SQ', 'SN:example', 'LN:10'],
#     ['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
# ]

print(midsv.transform(path_sam, qscore=True))
# [{
#   'QNAME': 'control',
#   'RNAME': 'example',
#   'MIDSV': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]

Insertion, deletion, and substitution

import midsv
from midsv.io import read_sam

path_sam = "examples/example_indels.sam"
print(list(read_sam(path_sam)))
# [
#     ['@SQ', 'SN:example', 'LN:10'],
#     ['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
# ]

print(midsv.transform(path_sam, qscore=True))
# [{
#   'QNAME': 'indel_sub',
#   'RNAME': 'example',
#   'MIDSV': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
#   'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]

Large deletion

import midsv
from midsv.io import read_sam

path_sam = "examples/example_large_deletion.sam"
print(list(read_sam(path_sam)))
# [
#     ['@SQ', 'SN:example', 'LN:10'],
#     ['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
#     ['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
# ]

print(midsv.transform(path_sam, qscore=True))
# [
#   {'QNAME': 'large-deletion',
#   'RNAME': 'example',
#   'MIDSV': '=A,=C,=N,=N,=N,=N,=N,=N,=A,=C',
#   'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]

Inversion

import midsv
from midsv.io import read_sam

path_sam = "examples/example_inversion.sam"
print(list(read_sam(path_sam)))
# [
#     ['@SQ', 'SN:example', 'LN:10'],
#     ['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
#     ['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
#     ['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
# ]

print(midsv.transform(path_sam, qscore=True))
# [
#   {'QNAME': 'inversion',
#   'RNAME': 'example',
#   'MIDSV': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
#   'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]

🧩Helper functions

Read SAM file

midsv.io.read_sam(path_sam: str | Path) -> Iterator[list[str]]

midsv.io.read_sam reads a local SAM file into an iterator of string lists.

Read/Write JSON Line (JSONL)

midsv.io.write_jsonl(dicts: list[dict[str, str]], path_output: str | Path)

Since midsv.transform returns a list of dictionaries, midsv.io.write_jsonl outputs it to a file in JSONL format.

midsv.io.read_jsonl(path_input: str | Path) -> Iterator[dict[str, str]]

Conversely, midsv.io.read_jsonl reads JSONL as an iterator of dictionaries.

Reverse complement MIDSV

from midsv import formatter

midsv_tag = "=A,=A,-G,+T|+C|=A,=A,*AG,=C"
revcomp_tag = formatter.revcomp(midsv_tag)
print(revcomp_tag)
# =G,*TC,=T,=T,+G|+A|-C,=T,=T

midsv.formatter.revcomp returns the reverse complement of a MIDSV string. Insertions are reversed and complemented with their anchor moved to the new position, following the MIDSV specification.

Export VCF

from midsv import transform
from midsv.io import write_vcf

alignments = transform("examples/example_indels.sam", qscore=False)
write_vcf(alignments, "variants.vcf", large_sv_threshold=50)

midsv.io.write_vcf writes MIDSV output to VCF and supports insertion, deletion, substitution, large insertion, large deletion, and inversion. Insertions longer than large_sv_threshold are emitted as symbolic <INS>, large deletions (or =N padding) use <DEL>, and inversions use <INV>. The INFO field includes TYPE or SVTYPE, SVLEN, SEQ, and QNAME.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

midsv-0.13.1.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

midsv-0.13.1-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file midsv-0.13.1.tar.gz.

File metadata

  • Download URL: midsv-0.13.1.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for midsv-0.13.1.tar.gz
Algorithm Hash digest
SHA256 49cba77c4fe54fe5cbeedd49d93a8d2945e4aba0e4112b9c024acf30a9494211
MD5 a51a75bcfcfa72b152dbde41268d719a
BLAKE2b-256 ed3a014eb954145b712245a069fe9d35b286c69fe70813d0c6700a7f426218f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for midsv-0.13.1.tar.gz:

Publisher: deploy_pypi.yml on akikuno/midsv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file midsv-0.13.1-py3-none-any.whl.

File metadata

  • Download URL: midsv-0.13.1-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for midsv-0.13.1-py3-none-any.whl
Algorithm Hash digest
SHA256 00ffc531fa73a8c205267e3a2732b03cd98e4b63926e59260b195dc8769956ff
MD5 c6326c38cc40c28f774ad6bdeb87058e
BLAKE2b-256 e1411b896bc9c249bfb9c9c39c1a73519c3dc21cd8c2303a0c760faa4f130f2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for midsv-0.13.1-py3-none-any.whl:

Publisher: deploy_pypi.yml on akikuno/midsv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page