A Python tool for visualizing SNP matrices with hierarchical clustering
Project description
HeatCluster: Automated Genomic Epidemiology & Visualization
HeatCluster
HeatCluster is a command-line bioinformatics tool designed to bridge the gap between raw genomic distance matrices and actionable epidemiological insights. HeatCluster integrates unsupervised machine learning (Hierarchical Clustering, PCA, Silhouette Analysis) with publication-quality visualization to automatically identify and classify viral or bacterial lineages.
Table of Contents
- Key Features
- Installation
- Quick Start
- Outputs
- Machine Learning Workflows
- Visualization Controls
- Supported Inputs
- Citation
Key Features
Robust Data Ingestion
HeatCluster can import the outputs of other popular bioinformatic tools.
Machine Learning & Automation
HeatCluster uses common clustering methods.
- Unsupervised Clustering: utilizes
fastcluster, a C++ optimized library, to perform hierarchical clustering. This reduces computational complexity to $O(N^2)$, allowing for rapid processing of large-scale datasets that would choke standard Python implementations. - Automated Model Selection (Auto-K): The tool features an Auto-K module that uses Silhouette Analysis. This algorithm iteratively tests different cluster counts ($K$) and mathematically determines the optimal number of lineages.
- Feature Extraction: Users can flatten hierarchical trees into discrete, exportable clusters (CSV format) using two methods:
- K-Means Style: Dividing the tree into a fixed number of groups.
- Biological Thresholds: defining clusters based on a specific distance metric (e.g., grouping all samples that are less than 10 SNPs apart).
- Dimensionality Reduction: Integrates PCA (Principal Component Analysis) to generate validation scatter plots. These plots color-code samples based on the heatmap clusters, providing a secondary visual confirmation of the biological structure.
Publication-Ready Visualization
The visualization engine is built to produce figures that are ready for manuscripts and posters without post-editing.
- Dynamic Masking: Users can hide irrelevant data noise by masking values based on thresholds (e.g., "Mask all samples with <95% ANI" to focus only on species-level matches).
- High-DPI Output: Images can be generated at 300+ DPI, ensuring crisp resolution for print.
- Smart Formatting: The tool automatically scales font sizes and figure dimensions based on the sample count to prevent overlapping text and cluttered visuals.
Installation
HeatCluster requires Python 3.8 or higher.
From source:
git clone https://github.com/DrB-S/heatcluster.git
cd heatcluster
pip install .
From pypi:
pip install heatcluster
Core Dependencies: pandas, numpy, scipy, seaborn, matplotlib, scikit-learn, fastcluster.
Quick Start Examples
1. Basic Heatmap (SNP Matrix)
Visualize a standard SNP distance matrix with a reverse-red colormap (where darker red indicates fewer SNPs/closer relation):
heatcluster -i matrix.csv -o heatmap.png --cmap Reds_r
2. ANI Visualization (Skani / FastANI)
Visualize Average Nucleotide Identity (ANI) using the skani format and a sequential colormap like viridis:
heatcluster -i skani_results.txt --format skani --cmap viridis -o ani_plot.png
3. Pipeline Analysis (No Plot)
Automatically detect clusters and export them to CSV without generating an image (useful for large datasets or pipelines):
heatcluster -i large_matrix.csv --auto-k --no-plot -l clusters.csv
Outputs
HeatCluster generates the following files:
heatcluster_matrix.png(Default): The high-resolution clustered heatmap.heatcluster_sorted.csv(Default): The input distance matrix re-ordered to match the clustering tree. Useful for aligning metadata.heatcluster_clusters.csv(Optional): Created when using--auto-k,--cluster-k, or--cluster-t. Contains two columns:SampleandCluster_ID.heatcluster_pca.png(Optional): The validation scatter plot created when--pcais used.
Machine Learning Workflows
Automated Lineage Discovery (Auto-K)
Use Case: Exploratory analysis of a new dataset where the population structure is unknown (e.g., initial screening of a hospital database).
If the number of outbreaks in the data is unknown, the Silhouette Analysis engine can determine it automatically.
heatcluster -i large_matrix.csv \
--auto-k \
--cluster-out clusters.csv \
--pca --pca-out pca_plot.png
- --auto-k: Iteratively tests $K=2..10$ and selects the cluster count with the highest Silhouette Coefficient. This mathematically identifies the most distinct natural groupings in the data.
- --cluster-out: Saves the specific samples belonging to each lineage to
clusters.csv. - --pca: Generates a validation scatter plot colored by the detected clusters. This provides a secondary visual check to ensure the detected clusters actually form distinct groups in principal component space.
A Note on Automated Clustering
The Auto-K (Silhouette Analysis) and PCA validation modules included in HeatCluster demonstrate how unsupervised machine learning techniques can be applied to genomic epidemiology.
While these methods are mathematically robust for identifying structure in high-dimensional data, they are not yet standard practice in clinical bioinformatics. Most outbreak detection currently relies on fixed biological thresholds (e.g., SNP cutoffs) rather than dynamic clustering coefficients.
These features are provided as experimental tools for exploratory analysis and to demonstrate the potential of integrating Scikit-learn workflows into genomic surveillance. Users should always validate "automated" clusters against biological context and established epidemiological links.
Threshold-Based Surveillance
Use Case: Routine public health surveillance where "outbreak" is defined by a strict biological rule (e.g., CDC guidelines defining a cluster as isolates within 50 SNPs).
To define an outbreak strictly as "any cluster of samples separated by fewer than 50 SNPs":
heatcluster -i snp_matrix.csv \
--cluster-t 50 \
--cluster-out outbreaks.csv
Outlier Detection & Manual Override
Use Case: Refining analysis when automated metrics are skewed by a single distant outlier (e.g., one sample is 10,000 SNPs away, forcing all other distinct outbreaks into a single "cluster").
If automated metrics fail due to global outliers, specific cluster counts can be enforced based on visual inspection of the heatmap or dendrogram.
heatcluster -i matrix.csv --cluster-k 4 --cluster-out manual_clusters.csv
Visualization Controls
| Flag | Description | Example |
|---|---|---|
--title |
Set a custom plot title | --title "Salmonella Outbreak 2024" |
--cmap |
Set the matplotlib colormap | --cmap viridis |
--no-annot |
Hide numbers inside cells (clean look) | --no-annot |
--no-plot |
Skip image generation (CSV/Analysis only) | --no-plot |
--dendrogram |
Show the hierarchical tree structure | --dendrogram |
--no-cluster |
Disable clustering (simple sorting only) | --no-cluster |
--dpi |
Set image resolution | --dpi 300 |
--hide-below |
Mask values lower than X | --hide-below 95.0 |
--hide-above |
Mask values higher than X | --hide-above 50 |
--vmin / --vmax |
Force min/max values for color scale | --vmin 0 --vmax 100 |
--width / --height |
Force figure dimensions (inches) | --width 12 --height 12 |
--font-scale |
Scale text size by a factor | --font-scale 1.5 |
Supported Input Formats
HeatCluster accepts a wide variety of inputs from standard bioinformatics tools. Use the --format flag to specify the input type.
A sample file from each analysis can be found in the repository in test.
A list of general sample commands that were used to generate each test file can be found in docs.
Important: Distance vs. Identity
HeatCluster requires a Distance Matrix (where 0 = Identical).
If an Identity or Similarity matrix (where 100 or 1.0 = Identical) is provided, HeatCluster will automatically invert the values.
| Format | Automatic Conversion Applied |
|---|---|
| FastANI, Skani, EzAAI | Distance = 100.0 - Identity |
| Sourmash, PyANI (Id), Dashing | Distance = 1.0 - Similarity |
Genomic Identity (ANI / AAI)
- FastANI (
--format fastani) Parses the standard 5-column tab-delimited output (query,ref,ANI,frag_map,total_frag). - Skani (
--format skani) Supportsskani triangleoutputs. - EzAAI (
--format ezaai) Parses the summary file (Label 1, Label 2, AAI/ANI). - Sourmash (
--format sourmash) Parses standard CSV comparison matrices. - PyANI
--format pyani_identity: Parses percent identity matrices (inverts 1.0 -> 0.0).--format pyani_errors: Parses the similarity error count matrix (mismatches/gaps).
SNP & Variant Calling
- SNP-dists (
--format snp-dists) Default. Expects a standard square CSV/TSV matrix. - VCF / VCF.gz (
--format vcf) Calculates Manhattan Distance between samples using a ploidy-agnostic dosage approach (Compatible with both Haploid and Diploid organisms).- Note: Strictly filters for biallelic sites to ensure accurate distance counts.
- NCBI Pathogen Detection (
--format pathogen_detection) Parses thereference_target.SNP_distances.tsvfile using the biosample accession for naming. Fills missing pairs with the maximum observed distance. - SKA (
--format ska) Parses the pairwise distance output from Split K-mer Analysis.
Gene Content & Allele Calling (MLST)
- Roary / Panaroo (
--format gene_presence_absence) Parses thegene_presence_absence.Rtabbinary matrix.- Metric: Calculates Jaccard Distance (0.0 = Identical gene content).
Sketching & K-mer Distance
- Mash (
--format mash) Parses the standard 5-column list output. - Dashing (
--format dashing) Parses asymmetric distance matrices. - BinDash (
--format bindash) Parses the 5-column TSV output. - KWIP (
--format kwip) Parses the weighted inner-product matrix.
Phylogeny & Clusters
- Newick Trees (
--format nwk) Calculates the Patristic Distance (sum of branch lengths) between all tips in the tree. Requiresbiopython. - IQ-TREE (
--format iqtree) Parses the.mldistfile (Phylip format square matrix) generated by IQ-TREE. - PopPUNK (
--format poppunk) Parses the.dists.npyand companion.pklfiles directly.
Generic Data
- Melted / Long-Format (
--format melted)- Accepts any generic 3-column CSV/TSV list:
SampleA,SampleB,Value. - Does not have a header line.
- This is useful for custom metrics.
- There is a row for sample-sample comparison.
- The value of 0 must be used for identical, and the remaining values are positive.
- Accepts any generic 3-column CSV/TSV list:
Example:
A B 1
B A 1
B B 0
A A 0
Gallery & Examples
Visualization Controls
HeatCluster gives you fine-grained control over how your data is displayed.
| Default View | Custom Color Scale | Threshold Clustering |
|---|---|---|
| Standard Output | --vmax 200 |
--cluster-t 10 |
| Standard hierarchical clustering | Adjusting contrast for high-variance data | Defining clusters by distance < 10 |
Machine Learning & Validation
Automate the discovery of lineages or validate your clusters with PCA.
| Manual K-Means | Auto-K (Silhouette Analysis) | PCA Validation |
|---|---|---|
--cluster-k 3 |
--auto-k |
--pca |
| Forcing 3 specific lineages | Automatically detecting optimal clusters | Dimensionality reduction check |
Supported Input Formats
HeatCluster natively handles diverse genomic data types.
Genomic Identity (ANI) & SNP Matrices
| SNP Distance | Average Nucleotide Identity |
|---|---|
| SNP-dists | Skani / FastANI |
Diverging colormap (RdYlBu_r) |
Sequential colormap (viridis) inverted |
Phylogenetics & K-mer Sketching
| Phylogenetic Tree | PopPUNK Distances |
|---|---|
| Newick (.nwk) | PopPUNK (.npy) |
| Patristic distance from tree topology | Binary matrix visualization |
Citation
If you use HeatCluster in your research, please cite:
HeatCluster: Automated Genomic Epidemiology & Visualization. https://github.com/DrB-S/heatcluster
About this fork
This repository is a maintained fork of the original heatcluster project by Stephen Beckstrom-Sternberg. Development of the original repository paused in ~2024. This fork continues maintenance, bug fixes, packaging, and releases.
Contributing
Contributions are welcome! Please submit issues for bug reports or feature requests. Pull requests are appreciated for new parsers or plotting features.
License
MIT License. Free for academic and commercial use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file heatcluster-1.3.0.tar.gz.
File metadata
- Download URL: heatcluster-1.3.0.tar.gz
- Upload date:
- Size: 65.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8ccde127fd5c6a71ae44bfb691554fa31ca5117d86c85bacab0859cf573924c
|
|
| MD5 |
f47ff38f8063cb010b49ff66eeaac830
|
|
| BLAKE2b-256 |
505524a53aa83a8e2bfdd7ad1b2014d8c06c053616a91aec537e59d9721ba848
|
File details
Details for the file heatcluster-1.3.0-py3-none-any.whl.
File metadata
- Download URL: heatcluster-1.3.0-py3-none-any.whl
- Upload date:
- Size: 45.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
946ba6748fe2b958ef791ffee9929382a2ab5dd9db360b50fd5090d27eee4d7f
|
|
| MD5 |
334b646c09c663ab0262afc9b9d16d86
|
|
| BLAKE2b-256 |
1d0a8ef88912c4731b6b7c3933f566abb75859be62d0c396855585b38eac65bd
|