Python module to convert SAM to MIDSV format.
Project description
midsv
midsv is a Python module that converts SAM files to MIDSV format.
MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format that represents differences between a reference and a query, with the same length as the reference.
[!CAUTION] MIDSV is intended for targeted amplicon sequences (10-100 kbp).
Using whole chromosomes as references may exhaust memory and crash.
[!IMPORTANT] MIDSV requires long-format cstag tags in the SAM file.
Please use minimap2 with--cs=longoption. or usecstagtool to append long-format cstag.
The output includes MIDSV and, optionally, QSCORE.
MIDSVpreserves original nucleotides while annotating mutations.QSCOREprovides Phred quality scores for each nucleotide.
Details of MIDSV (formerly MIDS) are described in our paper.
🛠️Installation
From Bioconda (recommended):
conda install -c bioconda midsv
From PyPI:
pip install midsv
📜Specifications
MIDSV
| Op | Regex | Description |
|---|---|---|
| = | [ACGTN] | Identical sequence |
| + | [ACGTN] | Insertion to the reference |
| - | [ACGTN] | Deletion from the reference |
| * | [ACGTN][ACGTN] | Substitution |
| [acgtn] | Inversion | |
| | | Separator for insertion sites |
MIDSV uses | to separate nucleotides in insertion sites so +A|+C|+G|+T|=A can be easily split into [+A, +C, +G, +T, =A] by "+A|+C|+G|+T|=A".split("|").
QSCORE
| Op | Description |
|---|---|
| -1 | Unknown |
| | | Separator for insertion sites |
QSCORE uses -1 for deletions or unknown nucleotides.
As with MIDSV, QSCORE uses | to separate quality scores in insertion sites.
📘Usage
midsv.transform(
path_sam: str | Path,
qscore: bool = False,
keep: str | list[str] = None
) -> list[dict[str, str | int]]
-
path_sam: Path to a SAM file on disk.
-
qscore (bool, optional): Output QSCORE. Defaults to False.
-
keep: Subset of {'FLAG', 'POS', 'SEQ', 'QUAL', 'CIGAR', 'CSTAG'} to include from the SAM file. Defaults to None.
-
midsv.transform()returns a list of dictionaries containingQNAME,RNAME,MIDSV, and optionallyQSCORE, plus any fields specified bykeep. -
MIDSVandQSCOREare comma-separated strings and have the same reference sequence length.
🖍️Examples
Perfect match
import midsv
from midsv.io import read_sam
# Perfect match
path_sam = "examples/example_match.sam"
print(list(read_sam(path_sam)))
# sam = [
# ['@SQ', 'SN:example', 'LN:10'],
# ['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
# ]
print(midsv.transform(path_sam, qscore=True))
# [{
# 'QNAME': 'control',
# 'RNAME': 'example',
# 'MIDSV': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'
# }]
Insertion, deletion, and substitution
import midsv
from midsv.io import read_sam
path_sam = "examples/example_indels.sam"
print(list(read_sam(path_sam)))
# [
# ['@SQ', 'SN:example', 'LN:10'],
# ['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
# ]
print(midsv.transform(path_sam, qscore=True))
# [{
# 'QNAME': 'indel_sub',
# 'RNAME': 'example',
# 'MIDSV': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
# 'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
# }]
Large deletion
import midsv
from midsv.io import read_sam
path_sam = "examples/example_large_deletion.sam"
print(list(read_sam(path_sam)))
# [
# ['@SQ', 'SN:example', 'LN:10'],
# ['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
# ['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
# ]
print(midsv.transform(path_sam, qscore=True))
# [
# {'QNAME': 'large-deletion',
# 'RNAME': 'example',
# 'MIDSV': '=A,=C,=N,=N,=N,=N,=N,=N,=A,=C',
# 'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
# ]
Inversion
import midsv
from midsv.io import read_sam
path_sam = "examples/example_inversion.sam"
print(list(read_sam(path_sam)))
# [
# ['@SQ', 'SN:example', 'LN:10'],
# ['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
# ['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
# ['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
# ]
print(midsv.transform(path_sam, qscore=True))
# [
# {'QNAME': 'inversion',
# 'RNAME': 'example',
# 'MIDSV': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
# ]
🧩Helper functions
Read SAM file
midsv.io.read_sam(path_sam: str | Path) -> Iterator[list[str]]
midsv.io.read_sam reads a local SAM file into an iterator of string lists.
Read/Write JSON Line (JSONL)
midsv.io.write_jsonl(dicts: list[dict[str, str]], path_output: str | Path)
Since midsv.transform returns a list of dictionaries, midsv.io.write_jsonl outputs it to a file in JSONL format.
midsv.io.read_jsonl(path_input: str | Path) -> Iterator[dict[str, str]]
Conversely, midsv.io.read_jsonl reads JSONL as an iterator of dictionaries.
Reverse complement MIDSV
from midsv import formatter
midsv_tag = "=A,=A,-G,+T|+C|=A,=A,*AG,=C"
revcomp_tag = formatter.revcomp(midsv_tag)
print(revcomp_tag)
# =G,*TC,=T,=T,+G|+A|-C,=T,=T
midsv.formatter.revcomp returns the reverse complement of a MIDSV string. Insertions are reversed and complemented with their anchor moved to the new position, following the MIDSV specification.
Export VCF
from midsv import transform
from midsv.io import write_vcf
alignments = transform("examples/example_indels.sam", qscore=False)
write_vcf(alignments, "variants.vcf", large_sv_threshold=50)
midsv.io.write_vcf writes MIDSV output to VCF and supports insertion, deletion, substitution, large insertion, large deletion, and inversion. Insertions longer than large_sv_threshold are emitted as symbolic <INS>, large deletions (or =N padding) use <DEL>, and inversions use <INV>. The INFO field includes TYPE or SVTYPE, SVLEN, SEQ, and QNAME.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file midsv-0.13.1.tar.gz.
File metadata
- Download URL: midsv-0.13.1.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49cba77c4fe54fe5cbeedd49d93a8d2945e4aba0e4112b9c024acf30a9494211
|
|
| MD5 |
a51a75bcfcfa72b152dbde41268d719a
|
|
| BLAKE2b-256 |
ed3a014eb954145b712245a069fe9d35b286c69fe70813d0c6700a7f426218f7
|
Provenance
The following attestation bundles were made for midsv-0.13.1.tar.gz:
Publisher:
deploy_pypi.yml on akikuno/midsv
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
midsv-0.13.1.tar.gz -
Subject digest:
49cba77c4fe54fe5cbeedd49d93a8d2945e4aba0e4112b9c024acf30a9494211 - Sigstore transparency entry: 799051314
- Sigstore integration time:
-
Permalink:
akikuno/midsv@e7d2ee9c8b0acdda7230f8743f9c1b2328e55f4f -
Branch / Tag:
refs/tags/0.13.1 - Owner: https://github.com/akikuno
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
deploy_pypi.yml@e7d2ee9c8b0acdda7230f8743f9c1b2328e55f4f -
Trigger Event:
release
-
Statement type:
File details
Details for the file midsv-0.13.1-py3-none-any.whl.
File metadata
- Download URL: midsv-0.13.1-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00ffc531fa73a8c205267e3a2732b03cd98e4b63926e59260b195dc8769956ff
|
|
| MD5 |
c6326c38cc40c28f774ad6bdeb87058e
|
|
| BLAKE2b-256 |
e1411b896bc9c249bfb9c9c39c1a73519c3dc21cd8c2303a0c760faa4f130f2e
|
Provenance
The following attestation bundles were made for midsv-0.13.1-py3-none-any.whl:
Publisher:
deploy_pypi.yml on akikuno/midsv
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
midsv-0.13.1-py3-none-any.whl -
Subject digest:
00ffc531fa73a8c205267e3a2732b03cd98e4b63926e59260b195dc8769956ff - Sigstore transparency entry: 799051319
- Sigstore integration time:
-
Permalink:
akikuno/midsv@e7d2ee9c8b0acdda7230f8743f9c1b2328e55f4f -
Branch / Tag:
refs/tags/0.13.1 - Owner: https://github.com/akikuno
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
deploy_pypi.yml@e7d2ee9c8b0acdda7230f8743f9c1b2328e55f4f -
Trigger Event:
release
-
Statement type: