Pattern classification and visualization for MitoSAlt
Project description
SAltShaker
A Python package for classifying and visualizing mitochondrial structural variants from MitoSAlt pipeline output.
Overview
SAltShaker is a Python port and extension of the original MitoSAlt (Basu et al. PLoS Genetics 2020) delplot.R visualization script. The package provides three modular commands for a flexible analysis workflow:
- Event calling: Direct Python port of the original R script's deletion/duplication classification logic
- Pattern classification: Rule-based decision tree to distinguish single and multiple type of events from background
- Visualization: Circular genome plotting based on the original R script visualization with spatial grouping and annotations
Installation
From PyPI (recommended)
pip install saltshaker
For BigWig coverage track support (optional, see Optional dependencies):
pip install saltshaker[coverage]
From GitLab package registry
If you need a pre-release or an internal build not yet on PyPI:
# pip
pip install saltshaker --index-url https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple
# uv
uv add --index https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple saltshaker
# Poetry
poetry source add --priority=primary dpipe https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple
poetry add saltshaker
From source
git clone https://gitlab.com/GenomeDX/annotation/saltshaker.git
cd saltshaker
pip install -e .
Docker
# Build the image
docker build -t saltshaker .
# Run with mounted data directory
docker run -v /path/to/your/data:/data saltshaker [command] [options]
Optional dependencies
SAltShaker's core functionality (call / classify / plot) has no optional dependencies. The following extras unlock additional features:
coverage—pip install saltshaker[coverage]Enables the BigWig coverage track (saltshaker plot --coverage). Requires pyBigWig, which has a compiled C extension and is only available on Linux/macOS.
Why optional?
pyBigWigrequires a C build environment and is not available on Windows. Making it optional keeps the base install lightweight and cross-platform.
Commands
1. saltshaker call - Event calling (R script port)
Python port of the original MitoSAlt R script event calling logic. Calls detected breakpoint clusters as deletions or duplications based on replication origin overlap, following the exact algorithm from delplot.R.
Usage:
saltshaker call \
--prefix sample \ # Sample identifier
--output-dir results/ \ # Output directory
-c sample.cluster \ # cluster file from MitoSAlt
-p sample.breakpoint \ # breakpoint file from MitoSAlt
-r reference.fasta \ # mitochondrial reference genome
-g 16569 \ # genome length
--ori-h-start 16081 --ori-h-end 407 \ # heavy strand origin
--ori-l-start 5730 --ori-l-end 5763 \ # light strand origin
-H 0.01 \ # heteroplasmy threshold
-f 15 \ # flanking sequence size (bp)
--blacklist # Optional: enable with default MT blacklist, OR
--blacklist custom_bl.bed # Optional: enable with custom BED file
Outputs:
results/sample.saltshaker_call.tsv- Human-readable results matching original R script output formatresults/sample.saltshaker_call_metadata.tsv- Metadata for downstream processing (classify/plot)
Algorithm (from original R script):
The logic is a direct port of the original MitoSAlt R implementation:
- Data loading: Parses cluster and breakpoint files, merges data to link clusters with D-loop crossing information
- Deletion size calculation: Handles circular genome wraparound for events crossing position 1
- Origin-based event calling: Events overlapping replication origins (OriH or OriL) are called as duplications; non-overlapping events are deletions
- Coordinate handling: Implements the R script's coordinate swapping logic for D-loop crossing events
- Flanking sequence analysis: Uses Biostrings-equivalent pattern matching to find microhomology sequences near breakpoints
This approach identifies the arc complementary to the actual structural change, consistent with the biological interpretation that origin-overlapping events represent duplications of the non-deleted arc.
Original R script functionality preserved:
- Exact deletion/duplication calling algorithm
- D-loop crossing detection and coordinate handling
- Flanking sequence extraction and microhomology analysis
- Output TSV format and column names
- Heteroplasmy calculation and filtering
Python enhancements in call step:
- Blacklist region detection and flagging if
--blacklistflag is supplied.
2. saltshaker classify - Pattern classification
Extended analysis beyond the original R script. Performs spatial grouping and classifies the overall pattern as Single, Multiple or background based on heteroplasmy levels and spatial event distribution.
Usage:
saltshaker classify \
--prefix sample1 \ # Sample identifier (matches call output)
--input-dir results/ \ # Input directory with .saltshaker_call.tsv
--output-dir results/ \ # Output directory (default: same as input-dir)
--blacklist # Optional: enable with default MT blacklist, OR
--blacklist custom_bl.bed # Optional: enable with custom BED file
--vcf \ # Optional: also output VCF format
--chr-format \ # Optional: specify one of MT or chrM (default: chrM)
--high-het 10 \ # Optional: high heteroplasmy threshold % (default: 20)
--noise 0.3 \ # Optional: noise threshold % (default: 1.0)
--radius 600 \ # Optional: spatial clustering radius bp (default: 600)
--multiple-threshold 5 \ # Optional: event count for Multiple pattern (default: 10)
--dominant-fraction 0.5 # Optional: fraction for dominant group (default: 0.70)
Outputs:
results/sample.saltshaker_classify.txt- Detailed analysis report with classification reasoningresults/sample.saltshaker_classify_metadata.tsv- Events with spatial group assignments (for plotting)results/sample.saltshaker.vcf- events in VCF format (if--vcfspecified)
Classification criteria:
Single pattern:
- One or few high-heteroplasmy events (≥
high-het) - Dominant spatial group (≥
dominant-fractionof events) - Few total events (≤
multiple-threshold) - Consistent with pathogenic single deletion/duplication
Multiple pattern:
- Many events (>
multiple-threshold) - Dispersed spatial distribution (no dominant group)
- No high-heteroplasmy events
- Consistent with mtDNA maintenance defects
Spatial grouping:
Events within radius bp are grouped together.
3. saltshaker plot - Visualization
Generates circular genome plots based on the original R script visualization with enhanced spatial grouping and other features.
Usage:
saltshaker plot \
--prefix sample1 \
--input-dir results/ \
--output-dir results/plots/ \ # Optional: default is input-dir
--genes \ # Optional: enable with default MT genes, OR
--genes custom_genes.bed \ # Optional: enable with custom BED file
--blacklist \ # Optional: enable with default MT blacklist, OR
--blacklist custom_bl.bed \ # Optional: enable with custom BED file
--coverage sample1.bw \ # Optional: BigWig file for coverage track (requires saltshaker[coverage])
--figsize 16 10 \ # Optional: width height (default: 16 10)
--direction clockwise \ # Optional: clockwise or counterclockwise (default: counterclockwise)
--del-color red \ # Optional: red or blue (default: blue)
--dup-color blue \ # Optional: red or blue (default: red)
--scale fixed # Optional: dynamic or fixed (default: dynamic)
Output:
results/plot/sample.saltshaker.png- Circular genome visualization
Visualization features:
- Circular genome with arc-based event display
- Heteroplasmy gradient coloring for all event types (del/dup/blacklist-crossing events - BL)
- Dynamic (min-max %) or fixed (0-100%) heteroplasmy scale
- Spatial grouping for overlapping events
- Optional gene annotations track and labels
- Optional blacklist region marking: light-grey sector shading with dashed boundary lines (BL-crossing events in lime-green gradient)
- Replication origin markers (OriH / OriL): dashed dark-purple radial lines drawn from the circle centre to the genome ring, propagated automatically from
callthroughclassifytoplot - Optional BigWig coverage track: step-function ring drawn outside the gene/genome circle, percentile-normalised, with a scale indicator bracket
- Configurable colors and polar direction
Color scaling considerations
Example sample with 5-25% heteroplasmy:
Dynamic scale:
[████████████████████] ← Colors span full gradient from 5% to 25%
5% 25%
Fixed scale:
[█████░░░░░░░░░░░░░░░] ← Colors span from 0% to 100%, events appear lighter
0% 100%
Input files
Required (from MitoSAlt pipeline)
Cluster file (.cluster)
- Tab-separated clustered breakpoint data
- Generated by MitoSAlt's clustering step
- Columns: cluster ID, read counts, positions, heteroplasmy levels
Breakpoint file (.breakpoint)
- Tab-separated raw breakpoint data
- Generated by MitoSAlt's breakpoint detection step
- Columns: read names, positions, D-loop crossing flags
Reference genome (.fasta)
- Mitochondrial reference sequence
- Used for flanking sequence extraction and coordinate validation
Optional
Blacklist file (.bed)
- BED format regions to flag (e.g., artifacts, repetitive sequences)
- Format: chromosome, start position, end position, name
Genes file (.bed)
- BED format gene regions to plot around the genomic axis
- Format: chromosome, start position, end position, name, score (0), strand (+), thick start (same as start), think end (same as end), color in rgb code (e.g.
255,255,0)
Output formats
Display TSV ({prefix}.saltshaker_call.tsv)
Human-readable format matching original R script output with columns:
sample: Sample identifiercluster_id: Cluster identifieralt_reads,ref_reads: Read countsheteroplasmy: Heteroplasmy percentagedel_start_range,del_end_range: Coordinate rangesdel_size: Event size in base pairsfinal_event: Event type (del/dup)final_start,final_end: Final coordinatesblacklist_crossing: Flag for blacklist overlapseq1,seq2,seq: Flanking sequences and microhomology
Metadata files
Call metadata ({prefix}.saltshaker_call_metadata.tsv):
Internal format preserving all columns for downstream processing. Header contains genome_length, ori_h_start, ori_h_end, ori_l_start, ori_l_end so downstream steps do not need to re-specify origin coordinates.
Classify metadata ({prefix}.saltshaker_classify_metadata.tsv):
Internal format with additional group column for spatial group assignments. Origin coordinate headers are forwarded from call metadata. Used by plot command.
Analysis summary ({prefix}.saltshaker_classify.txt)
Human-readable analysis report including:
- Pattern classification (Single/Multiple) with reasoning
- Event statistics and heteroplasmy distribution
- Spatial clustering metrics
- Classification criteria scores
VCF format ({prefix}.saltshaker.vcf)
Standard VCF 4.3 format with structural variant fields:
SVTYPE: DEL or DUPEND: Variant end positionSVLEN: Variant lengthHF: Heteroplasmy fraction (0-1)GROUP: Spatial group identifierCLUSTER: Original cluster IDDLOOP: Flag for D-loop crossingBLCROSS: Flag for blacklist crossing
Circular plot ({prefix}.saltshaker.png)
Complete workflow example
Single-sample pipeline:
# Step 1: Call events from MitoSAlt output (R script port)
saltshaker call \
--prefix sample1 \
--output-dir results/ \
-c sample1.cluster -p sample1.breakpoint \
-r reference.fasta \
-g 16569 --ori-h-start 16081 --ori-h-end 407 \
--ori-l-start 5730 --ori-l-end 5763 \
--blacklist
# Step 2: Classify pattern and perform spatial grouping (extended analysis)
saltshaker classify \
--prefix sample1 \
--input-dir results/ \
--blacklist \
--vcf
# Step 3: Generate visualization (enhanced R script plotting)
saltshaker plot \
--prefix sample1 \
--input-dir results/ \
--output-dir results/plot/ \
--blacklist \
--genes
Batch processing multiple samples:
for sample in sample1 sample2 sample3; do
mkdir -p results/${sample}
# Call events
saltshaker call --prefix ${sample }--output-dir results/${sample} \
-c ${sample}.cluster -p ${sample}.breakpoint \
-r reference.fasta
-g 16569 --ori-h-start 16081 --ori-h-end 407 \
--ori-l-start 5730 --ori-l-end 5763 \
--blacklist
# Classify events
saltshaker classify --prefix ${sample} \
--input-dir results/${sample} \
--blacklist
# Plot events
saltshaker plot --prefix ${sample} \
--input-dir results/${sample} \
--blacklist \
--genes
done
Configuration
Default classification thresholds are defined in saltshaker/config.py:
# Heteroplasmy thresholds
HIGH_HET_THRESHOLD = 10.0 # High heteroplasmy threshold (%), --high-het
NOISE_THRESHOLD = 0.3 # Noise threshold (%), --noise
# Spatial clustering
CLUSTER_RADIUS = 600 # Spatial grouping radius (bp), --radius
MIN_CLUSTER_SIZE = 2 # Minimum events per cluster (not configurable via CLI)
# Pattern classification
MULTIPLE_EVENT_THRESHOLD = 5 # Event count for Multiple pattern, --multiple-threshold
DOMINANT_GROUP_FRACTION = 0.5 # Fraction for dominant group (70%), --dominant-fraction
These can be customized by modifying the configuration file or via CLI arguments (see saltshaker classify --help).
Command reference
Global options
All commands support:
-h, --help: Show help message--blacklist [FILE]: Enable blacklist regions (default: built-in MT blacklist, use as a flag; optional: custom BED file)
call command
Required:
--prefix STR: Sample prefix for output files-c, --cluster FILE: Cluster file from MitoSAlt-p, --breakpoint FILE: Breakpoint file from MitoSAlt-r, --reference FILE: Reference genome FASTA-g, --genome-length INT: Mitochondrial genome length--ori-h-start INT: Heavy strand origin start--ori-h-end INT: Heavy strand origin end--ori-l-start INT: Light strand origin start--ori-l-end INT: Light strand origin end
Optional:
--output-dir DIR: Output directory (default: .)-H, --het-limit FLOAT: Heteroplasmy threshold (default: 0.01)-f, --flank-size INT: Flanking sequence size in bp (default: 15)
classify command
Required:
--prefix STR: Sample prefix (matches call output)--input-dir DIR: Input directory containing saltshaker_call_metadata.tsv from call
Optional:
--output-dir DIR: Output directory (default: input-dir)--vcf: Also output VCF format--chr-format: Choose chromosome format for VCF output: MT or chrM (default: chrM)--high-het FLOAT: High heteroplasmy threshold % (default: 20)--noise FLOAT: Noise threshold % (default: 1)--radius INT: Spatial clustering radius bp (default: 600)--multiple-threshold INT: Event count for Multiple pattern (default: 10)--dominant-fraction FLOAT: Fraction for dominant group (default: 0.70)
plot command
Required:
--prefix STR: Sample prefix (matches classify output)--input-dir DIR: Input directory containing saltshaker_classify_metadata.tsv from classify
Optional:
--output-dir DIR: Output directory (default: input-dir)--genes [FILE]: Enable gene annotations (default: built-in MT genes; optional: custom BED file)--figsize WIDTH HEIGHT: Figure dimensions (default: 16 10)--direction STR: Plot direction - 'clockwise' or 'counterclockwise' (default: counterclockwise, MitoSAlt original)--del-color STR: Deletion color - 'red' or 'blue' (default: blue, MitoSAlt original)--dup-color STR: Duplication color - 'red' or 'blue' (default: red, MitoSAlt original)--scale STR: Heteroplasmy scale - 'dynamic' (min-max per category) or 'fixed' (0-100%) (default: dynamic)--show-origins STR: Draw replication origin markers - 'on' or 'off' (default: on); origin coordinates are propagated automatically fromcallvia the metadata file--coverage BW_FILE: Path to a BigWig (.bw) file; draws a coverage ring outside the genome circle, normalised to percentile scale. Requirespip install saltshaker[coverage].
Dependencies
Core (always installed)
pandas>=1.3.0
numpy>=1.20.0
matplotlib>=3.3.0
biopython>=1.78
Optional - coverage extra
pyBigWig # BigWig coverage track; Linux/macOS only (C extension)
Install with: pip install saltshaker[coverage]
Package structure
saltshaker/
├── __init__.py
├── __main__.py # CLI entry point with subcommands
├── config.py # Configuration and thresholds
├── types.py # Type definitions and data structures
├── event_caller.py # Event calling (R script port)
├── classifier.py # Pattern classification (single vs multiple vs background)
├── spatial.py # Spatial grouping
├── visualizer.py # Circular plotting
├── utils.py # Utility functions
│
├── assets/
│ └── logo.png
│
├── layout/ # Layout engine
│ ├── __init__.py
│ ├── engine.py # LayoutEngine class
│ └── types.py # Layout-specific types
│
├── cli/
│ ├── __init__.py
│ ├── call.py # Call subcommand
│ ├── classify.py # Classify subcommand
│ └── plot.py # Plot subcommand
│
├── io/
│ ├── __init__.py
│ ├── readers.py # File readers (blacklist, intermediate, gene annotations, BigWig)
│ ├── writers.py # TSV and summary output
│ └── vcf_writer.py # VCF format output
│
├── data/
│ ├── __init__.py # Default file paths
│ ├── gencode.v49.annotation.MT_genes.bed # Default hg38 MT gene annotations
│ └── mt_blacklist_regions.bed # Default MT blacklist regions
│
├── docs/
│ ├── classification_algorithm.md # Classification logic documentation
│ └── visualizer_description.md # Visualizer architecture documentation
│
└── tests/
├── conftest.py # Shared fixtures
├── pytest.ini
├── fixtures/
│ ├── inputs/
│ │ ├── human_mt_rCRS.fasta # Reference FASTA
│ │ ├── test_breakpoints.breakpoint # Raw breakpoints for integration tests
│ │ ├── test_clusters.cluster # Raw clusters for integration tests
│ │ ├── viz_sample_small.tsv # Small test dataset (15 events)
│ │ └── viz_sample_large.tsv # Large test dataset (80 events)
│ └── expected/
│ ├── ground_truth.json # Expected call/classify outputs
│ └── visualizer_layouts.json # Expected visualization layout characteristics
├── unit/
│ ├── test_bigwig_reader.py # BigWig reader unit tests
│ ├── test_helpers.py # Utility function tests
│ ├── test_intermediate_io.py # Intermediate file read/write roundtrip tests
│ ├── test_layout_engine.py # Layout engine unit tests
│ └── test_label_positioning.py # Label collision detection tests
└── integration/
├── test_saltshaker_output.py # End-to-end event calling tests
└── test_visualizer.py # End-to-end visualization tests
Documentation
- Classification algorithm — Detailed explanation of the Single vs Multiple pattern classification logic
- Visualizer architecture — Layout engine, visual elements, configuration reference
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saltshaker-1.1.0.tar.gz.
File metadata
- Download URL: saltshaker-1.1.0.tar.gz
- Upload date:
- Size: 103.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28e0fec85df943ae2a32fb01b69abec039e2bbc9286729068992167bdcfdff26
|
|
| MD5 |
7d5bab10393c68665c93b6e7b0c9a5f9
|
|
| BLAKE2b-256 |
fd11c07ba7ff0a09501044656a85c26cacc397f3a61d73c9a83e4e9e9ab75042
|
File details
Details for the file saltshaker-1.1.0-py3-none-any.whl.
File metadata
- Download URL: saltshaker-1.1.0-py3-none-any.whl
- Upload date:
- Size: 111.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8a214289aa0706cd59ec94e24f069d967ed0e5f047a66d431694ee56cef7498
|
|
| MD5 |
c976f7025c30a2bc44818724924fd489
|
|
| BLAKE2b-256 |
8eb1748d6e38e66d73e64c78d1f421ca42c2bad36a37314e8364c605b0c9ee6a
|