Pattern classification and visualization for MitoSAlt
Project description
SAltShaker
A Python package for classifying and visualizing mitochondrial structural variants from MitoSAlt pipeline output.
Overview
SAltShaker is a Python port and extension of the original MitoSAlt (Basu et al. PLoS Genetics 2020) delplot.R visualization script. The package provides three modular commands for a flexible analysis workflow:
- Event calling: Direct Python port of the original R script's deletion/duplication classification logic
- Pattern classification: Rule-based decision tree to distinguish single and multiple type of events from background
- Visualization: Circular genome plotting based on the original R script visualization with spatial grouping and annotations
Installation
From source
git clone https://gitlab.com/GenomeDX/annotation/saltshaker.git
cd saltshaker
pip install -e .
Docker
# Build the image
docker build -t saltshaker .
# Run with mounted data directory
docker run -v /path/to/your/data:/data saltshaker [command] [options]
Package registry
SAltShaker is published to PyPI and GitLab PyPI (https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple).
Pip
pip install saltshaker or
pip install saltshaker --index-url https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple
Poetry
poetry source add --priority=primary dpipe https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple
poetry add saltshaker
uv
uv add --index https://gitlab.com/api/v4/groups/16758292/-/packages/pypi/simple saltshaker
Commands
1. saltshaker call - Event calling (R script port)
Python port of the original MitoSAlt R script event calling logic. Calls detected breakpoint clusters as deletions or duplications based on replication origin overlap, following the exact algorithm from delplot.R.
Usage:
saltshaker call \
--prefix sample \ # Sample identifier
--output-dir results/ \ # Output directory
-c sample.cluster \ # cluster file from MitoSAlt
-p sample.breakpoint \ # breakpoint file from MitoSAlt
-r reference.fasta \ # mitochondrial reference genome
-g 16569 \ # genome length
--ori-h-start 16081 --ori-h-end 407 \ # heavy strand origin
--ori-l-start 5730 --ori-l-end 5763 \ # light strand origin
-H 0.01 \ # heteroplasmy threshold
-f 15 \ # flanking sequence size (bp)
--blacklist # Optional: enable with default MT blacklist, OR
--blacklist custom_bl.bed # Optional: enable with custom BED file
Outputs:
results/sample.saltshaker_call.tsv- Human-readable results matching original R script output formatresults/sample.saltshaker_call_metadata.tsv- Metadata for downstream processing (classify/plot)
Algorithm (from original R script):
The logic is a direct port of the original MitoSAlt R implementation:
- Data loading: Parses cluster and breakpoint files, merges data to link clusters with D-loop crossing information
- Deletion size calculation: Handles circular genome wraparound for events crossing position 1
- Origin-based event calling: Events overlapping replication origins (OriH or OriL) are called as duplications; non-overlapping events are deletions
- Coordinate handling: Implements the R script's coordinate swapping logic for D-loop crossing events
- Flanking sequence analysis: Uses Biostrings-equivalent pattern matching to find microhomology sequences near breakpoints
This approach identifies the arc complementary to the actual structural change, consistent with the biological interpretation that origin-overlapping events represent duplications of the non-deleted arc.
Original R script functionality preserved:
- Exact deletion/duplication calling algorithm
- D-loop crossing detection and coordinate handling
- Flanking sequence extraction and microhomology analysis
- Output TSV format and column names
- Heteroplasmy calculation and filtering
Python enhancements in call step:
- Blacklist region detection and flagging if
--blacklistflag is supplied.
2. saltshaker classify - Pattern classification
Extended analysis beyond the original R script. Performs spatial grouping and classifies the overall pattern as Single, Multiple or background based on heteroplasmy levels and spatial event distribution.
Usage:
saltshaker classify \
--prefix sample1 \ # Sample identifier (matches call output)
--input-dir results/ \ # Input directory with .saltshaker_call.tsv
--output-dir results/ \ # Output directory (default: same as input-dir)
--blacklist # Optional: enable with default MT blacklist, OR
--blacklist custom_bl.bed # Optional: enable with custom BED file
--vcf \ # Optional: also output VCF format
--chr-format \ # Optional: specify one of MT or chrM (default: chrM)
--high-het 10 \ # Optional: high heteroplasmy threshold % (default: 20)
--noise 0.3 \ # Optional: noise threshold % (default: 1.0)
--radius 600 \ # Optional: spatial clustering radius bp (default: 600)
--multiple-threshold 5 \ # Optional: event count for Multiple pattern (default: 10)
--dominant-fraction 0.5 # Optional: fraction for dominant group (default: 0.70)
Outputs:
results/sample.saltshaker_classify.txt- Detailed analysis report with classification reasoningresults/sample.saltshaker_classify_metadata.tsv- Events with spatial group assignments (for plotting)results/sample.saltshaker.vcf- events in VCF format (if--vcfspecified)
Classification criteria:
Single pattern:
- One or few high-heteroplasmy events (≥
high-het) - Dominant spatial group (≥
dominant-fractionof events) - Few total events (≤
multiple-threshold) - Consistent with pathogenic single deletion/duplication
Multiple pattern:
- Many events (>
multiple-threshold) - Dispersed spatial distribution (no dominant group)
- No high-heteroplasmy events
- Consistent with mtDNA maintenance defects
Spatial grouping:
Events within radius bp are grouped together.
3. saltshaker plot - Visualization
Generates circular genome plots based on the original R script visualization with enhanced spatial grouping and other features.
Usage:
saltshaker plot \
--prefix sample1 \
--input-dir results/ \
--output-dir results/plots/ \ # Optional: default is input-dir
--genes \ # Optional: enable with default MT genes, OR
--genes custom_genes.bed \ # Optional: enable with custom BED file
--blacklist \ # Optional: enable with default MT blacklist, OR
--blacklist custom_bl.bed \ # Optional: enable with custom BED file
--figsize 16 10 \ # Optional: width height (default: 16 10)
--direction clockwise \ # Optional: clockwise or counterclockwise (default: counterclockwise)
--del-color red \ # Optional: red or blue (default: blue)
--dup-color blue \ # Optional: red or blue (default: red)
--scale fixed # Optional: dynamic or fixed (default: dynamic)
Output:
results/plot/sample.saltshaker.png- Circular genome visualization
Visualization features:
- Circular genome with arc-based event display
- Heteroplasmy gradient coloring for all event types (del/dup/blacklist-crossing events - BL)
- Dynamic (min-max %) or fixed (0-100%) heteroplasmy scale
- Spatial grouping for overlapping events
- Optional gene annotations track and labels
- Optional blacklist region marking (BL-crossing events in lime-green gradient)
- Configurable colors and polar direction
Color scaling considerations
Example sample with 5-25% heteroplasmy:
Dynamic scale:
[████████████████████] ← Colors span full gradient from 5% to 25%
5% 25%
Fixed scale:
[█████░░░░░░░░░░░░░░░] ← Colors span from 0% to 100%, events appear lighter
0% 100%
Input files
Required (from MitoSAlt pipeline)
Cluster file (.cluster)
- Tab-separated clustered breakpoint data
- Generated by MitoSAlt's clustering step
- Columns: cluster ID, read counts, positions, heteroplasmy levels
Breakpoint file (.breakpoint)
- Tab-separated raw breakpoint data
- Generated by MitoSAlt's breakpoint detection step
- Columns: read names, positions, D-loop crossing flags
Reference genome (.fasta)
- Mitochondrial reference sequence
- Used for flanking sequence extraction and coordinate validation
Optional
Blacklist file (.bed)
- BED format regions to flag (e.g., artifacts, repetitive sequences)
- Format: chromosome, start position, end position, name
Genes file (.bed)
- BED format gene regions to plot around the genomic axis
- Format: chromosome, start position, end position, name, score (0), strand (+), thick start (same as start), think end (same as end), color in rgb code (e.g.
255,255,0)
Output formats
Display TSV ({prefix}.saltshaker_call.tsv)
Human-readable format matching original R script output with columns:
sample: Sample identifiercluster_id: Cluster identifieralt_reads,ref_reads: Read countsheteroplasmy: Heteroplasmy percentagedel_start_range,del_end_range: Coordinate rangesdel_size: Event size in base pairsfinal_event: Event type (del/dup)final_start,final_end: Final coordinatesblacklist_crossing: Flag for blacklist overlapseq1,seq2,seq: Flanking sequences and microhomology
Metadata files
Call metadata ({prefix}.saltshaker_call_metadata.tsv):
Internal format preserving all columns for downstream processing. Contains metadata header with genome length.
Classify metadata ({prefix}.saltshaker_classify_metadata.tsv):
Internal format with additional group column for spatial group assignments. Used by plot command.
Analysis summary ({prefix}.saltshaker_classify.txt)
Human-readable analysis report including:
- Pattern classification (Single/Multiple) with reasoning
- Event statistics and heteroplasmy distribution
- Spatial clustering metrics
- Classification criteria scores
VCF format ({prefix}.saltshaker.vcf)
Standard VCF 4.3 format with structural variant fields:
SVTYPE: DEL or DUPEND: Variant end positionSVLEN: Variant lengthHF: Heteroplasmy fraction (0-1)GROUP: Spatial group identifierCLUSTER: Original cluster IDDLOOP: Flag for D-loop crossingBLCROSS: Flag for blacklist crossing
Circular plot ({prefix}.saltshaker.png)
Complete workflow example
Single-sample pipeline:
# Step 1: Call events from MitoSAlt output (R script port)
saltshaker call \
--prefix sample1 \
--output-dir results/ \
-c sample1.cluster -p sample1.breakpoint \
-r reference.fasta \
-g 16569 --ori-h-start 16081 --ori-h-end 407 \
--ori-l-start 5730 --ori-l-end 5763 \
--blacklist
# Step 2: Classify pattern and perform spatial grouping (extended analysis)
saltshaker classify \
--prefix sample1 \
--input-dir results/ \
--blacklist \
--vcf
# Step 3: Generate visualization (enhanced R script plotting)
saltshaker plot \
--prefix sample1 \
--input-dir results/ \
--output-dir results/plot/ \
--blacklist \
--genes
Batch processing multiple samples:
for sample in sample1 sample2 sample3; do
mkdir -p results/${sample}
# Call events
saltshaker call --prefix ${sample }--output-dir results/${sample} \
-c ${sample}.cluster -p ${sample}.breakpoint \
-r reference.fasta
-g 16569 --ori-h-start 16081 --ori-h-end 407 \
--ori-l-start 5730 --ori-l-end 5763 \
--blacklist
# Classify events
saltshaker classify --prefix ${sample} \
--input-dir results/${sample} \
--blacklist
# Plot events
saltshaker plot --prefix ${sample} \
--input-dir results/${sample} \
--blacklist \
--genes
done
Configuration
Default classification thresholds are defined in saltshaker/config.py:
# Heteroplasmy thresholds
HIGH_HET_THRESHOLD = 10.0 # High heteroplasmy threshold (%), --high-het
NOISE_THRESHOLD = 0.3 # Noise threshold (%), --noise
# Spatial clustering
CLUSTER_RADIUS = 600 # Spatial grouping radius (bp), --radius
MIN_CLUSTER_SIZE = 2 # Minimum events per cluster (not configurable via CLI)
# Pattern classification
MULTIPLE_EVENT_THRESHOLD = 5 # Event count for Multiple pattern, --multiple-threshold
DOMINANT_GROUP_FRACTION = 0.5 # Fraction for dominant group (70%), --dominant-fraction
These can be customized by modifying the configuration file or via CLI arguments (see saltshaker classify --help).
Command reference
Global options
All commands support:
-h, --help: Show help message--blacklist [FILE]: Enable blacklist regions (default: built-in MT blacklist, use as a flag; optional: custom BED file)
call command
Required:
--prefix STR: Sample prefix for output files-c, --cluster FILE: Cluster file from MitoSAlt-p, --breakpoint FILE: Breakpoint file from MitoSAlt-r, --reference FILE: Reference genome FASTA-g, --genome-length INT: Mitochondrial genome length--ori-h-start INT: Heavy strand origin start--ori-h-end INT: Heavy strand origin end--ori-l-start INT: Light strand origin start--ori-l-end INT: Light strand origin end
Optional:
--output-dir DIR: Output directory (default: .)-H, --het-limit FLOAT: Heteroplasmy threshold (default: 0.01)-f, --flank-size INT: Flanking sequence size in bp (default: 15)
classify command
Required:
--prefix STR: Sample prefix (matches call output)--input-dir DIR: Input directory containing saltshaker_call_metadata.tsv from call
Optional:
--output-dir DIR: Output directory (default: input-dir)--vcf: Also output VCF format--chr-format: Choose chromosome format for VCF output: MT or chrM (default: chrM)--high-het FLOAT: High heteroplasmy threshold % (default: 20)--noise FLOAT: Noise threshold % (default: 1)--radius INT: Spatial clustering radius bp (default: 600)--multiple-threshold INT: Event count for Multiple pattern (default: 10)--dominant-fraction FLOAT: Fraction for dominant group (default: 0.70)
plot command
Required:
--prefix STR: Sample prefix (matches classify output)--input-dir DIR: Input directory containing saltshaker_classify_metadata.tsv from classify
Optional:
--output-dir DIR: Output directory (default: input-dir)--genes [FILE]: Enable gene annotations (default: built-in MT genes; optional: custom BED file)--figsize WIDTH HEIGHT: Figure dimensions (default: 16 10)--direction STR: Plot direction - 'clockwise' or 'counterclockwise' (default: counterclockwise, MitoSAlt original)--del-color STR: Deletion color - 'red' or 'blue' (default: blue, MitoSAlt original)--dup-color STR: Duplication color - 'red' or 'blue' (default: red, MitoSAlt original)--scale STR: Heteroplasmy scale - 'dynamic' (min-max per category) or 'fixed' (0-100%) (default: dynamic)
Dependencies
pandas>=1.3.0
numpy>=1.20.0
matplotlib>=3.3.0
biopython>=1.78
Package structure
saltshaker/
├── __init__.py
├── __main__.py # CLI entry point with subcommands
├── config.py # Configuration and thresholds
├── types.py # Type definitions and data structures
├── event_caller.py # Event calling (R script port)
├── classifier.py # Pattern classification (single vs musltiple vs background)
├── spatial.py # Spatial grouping
├── visualizer.py # Circular plotting
├── utils.py # Utility functions
│
├── layout/ # Layout engine
│ ├── __init__.py
│ ├── engine.py # LayoutEngine class
│ └── types.py # Layout-specific types
├── cli/
│ ├── call.py # Call subcommand
│ ├── classify.py # Classify subcommand
│ └── plot.py # Plot subcommand
└── io/
├── readers.py # File input
├── writers.py # TSV and summary output
└── vcf_writer.py # VCF format output
└── data/
├── __init__.py # Default file paths
├── gencode.v49.annotation.MT_genes.bed # Default MT gene annotations
└── mt_blacklist_regions.bed # Default MT blacklist regions
└── tests/
├── fixtures/
│ ├── inputs/
│ │ ├── human_mt_rCRS.fasta # Reference FASTA
│ │ ├── test_breakpoints.breakpoint # Small test dataset raw breakpoints
│ │ ├── test_clusters.cluster # Small test dataset raw clusters
│ │ ├── viz_sample_small.tsv # Small test dataset (15 events)
│ │ └── viz_sample_large.tsv # Large test dataset (80 events)
│ └── expected/
│ ├── viz_sample_small.tsv # Small test dataset (15 events)
│ └── visualizer_layouts.json # Expected visualization characteristics
│
├── unit/
│ ├── test_helpers.py # Utilities tests
│ ├── test_layout_engine.py # Layout engine tests
│ └── test_label_positioning.py # Visualization unit tests
│
└── integration/
├── test_saltshaker_output.py # End-to-end event calling tests
└── test_visualizer.py # End-to-end visualization tests
docs/
└── classification_algorithm.md # Detailed classification algorithm documentation
Documentation
- Classification algorithm - Detailed explanation of the Single vs Multiple pattern classification logic
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saltshaker-1.0.2.tar.gz.
File metadata
- Download URL: saltshaker-1.0.2.tar.gz
- Upload date:
- Size: 86.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b99fabea102617894ab8f5b40063170733321df6cdb45411f9e79ec4da96801
|
|
| MD5 |
fcd65e0e5c3375b187de5e05e9d818ae
|
|
| BLAKE2b-256 |
bd88f862699ca813b29a05dc35daf00a80ccb3520019446615172b8bbba5eb82
|
File details
Details for the file saltshaker-1.0.2-py3-none-any.whl.
File metadata
- Download URL: saltshaker-1.0.2-py3-none-any.whl
- Upload date:
- Size: 94.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.0 {"installer":{"name":"uv","version":"0.11.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4107641313a15d973cd7fea3ce1d3b232aa869ae84fd169858ae131e98f7f925
|
|
| MD5 |
0564176befa5d4763fe5b03db11aa69e
|
|
| BLAKE2b-256 |
2c91469fe4e8d6a4e86bda4e933d719514a3bd825e2b524c03448748f07231b4
|