Skip to main content

Demultiplex multi-reference BAM files into per-reference buckets and call consensus

Project description

midsplit

Demultiplex a multi-reference BAM file into per-reference buckets and call a consensus sequence for each non-empty bucket.

What it does

When reads are aligned to multiple reference sequences in a single BAM file, midsplit assigns each read (or read pair) to the reference(s) it matches best, then produces a separate BAM, consensus FASTA, and per-site statistics file for each reference that received at least one read.

The classification uses the NM (edit-distance) tag to compute a percent identity for each alignment. A read is assigned to a reference if its percent identity is at least --threshold times the best percent identity that read achieves across all references (default 0.95). Paired-end reads are treated as a unit using an overlap-aware combined percent identity, so both mates always land in the same bucket(s).

Output

For each reference that receives reads, midsplit writes:

File Contents
{ID}.bam / {ID}.bam.bai Sorted, indexed per-reference BAM
{ID}-consensus.fasta Consensus sequence called by ivar
{ID}-per-site.tsv Per-position depth, A/C/G/T counts, ref base, and consensus base
summary.txt Run-level statistics and consensus-vs-reference comparison

The per-site TSV has columns: site, ref_base, consensus_base, depth, A, C, G, T. When --align is used, the consensus base is mapped back to the correct reference position even when ivar has inserted or deleted bases relative to the reference.

Usage

midsplit [options] INPUT_BAM

Options

Option Default Description
--output-dir DIR . Directory for all output files (created if absent)
--threshold FLOAT 0.95 Minimum fraction of best PID to assign a read
--reference FASTA Multi-reference FASTA; enables ref_base column and consensus comparison
--align off Align consensus to reference before comparison (recommended when lengths differ)
--aligner mafft Aligner for --align: mafft, needle, or edlib
--aligner-options OPTIONS Extra options forwarded to the aligner (implies --align)
--consensus-quality INT 20 Minimum base quality passed to ivar (-q)
--consensus-frequency-threshold FLOAT 0.0 Minimum frequency for ivar to call a base (-t)
--consensus-low-coverage INT 0 Depth below which ivar masks with N (-m)
--consensus-id TEMPLATE ID for the consensus sequence; use {ID} to embed the reference name

Example

midsplit \
  --reference references/multi.fasta \
  --output-dir results/ \
  --align \
  --threshold 0.95 \
  alignments/reads-vs-multi.bam

Requirements

  • Python 3.11+
  • samtools and ivar on PATH
  • Python dependencies are managed with uv; run uv sync to install them

Notes

  • Only primary and secondary alignments are used; supplementary alignments (chimeric/split reads) are skipped.
  • Reads aligned with Bowtie2 --all or -k N emit non-best hits as secondary alignments; midsplit includes these in classification so that all alignment evidence is used.
  • For circular genomes (e.g. HBV) mapped against a linearised reference, local alignment (bowtie2 --very-sensitive-local) is strongly recommended over end-to-end alignment. End-to-end mode cannot soft-clip reads that span the linearisation junction, which introduces artefactual bases near position 1 of the reference and can corrupt the consensus at those positions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

midsplit-0.1.2.tar.gz (212.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

midsplit-0.1.2-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file midsplit-0.1.2.tar.gz.

File metadata

  • Download URL: midsplit-0.1.2.tar.gz
  • Upload date:
  • Size: 212.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.6

File hashes

Hashes for midsplit-0.1.2.tar.gz
Algorithm Hash digest
SHA256 91e54eb563cb11f8ecdd7908e5e2155accb2f861c5a3ffc3f758702479e90c94
MD5 6d1617742774fc3f6482b2ad0307c8a0
BLAKE2b-256 0ce933f4a5a29cf7aeba61ab8e337d91dc5fddd7cf39a7f68d2eec55947da937

See more details on using hashes here.

File details

Details for the file midsplit-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: midsplit-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.6

File hashes

Hashes for midsplit-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 97358c9cb494d218ddf6377a0ed286d2d2ea51467c82364db332bdd1ea99a911
MD5 05bced45d6ccdc32a8114043258663ce
BLAKE2b-256 4ba016446db0f5fadc968d5f561e726a5c8996aa33ce1a0f659bbda89cf817d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page