Pure-Python port of the R/CRAN package shazam -- immunoglobulin somatic hypermutation (SHM) analysis: distance-to-nearest, clonal threshold, SHM targeting models, mutation profiling and BASELINe selection analysis.
Project description
py-shazam
Pure-Python port of the R/CRAN package shazam -- Immunoglobulin Somatic Hypermutation Analysis, part of the Immcantation framework (Gupta, Vander Heiden, et al., Bioinformatics 2015; Kleinstein Lab, Yale).
pyshazam is a standalone, dependency-light implementation of shazam's
computational core: distance-to-nearest neighbour, clonal-threshold
detection, SHM targeting models, observed/expected mutation profiling and
the BASELINe antigen-driven selection framework. It does not require R.
| PyPI / import name | pyshazam |
| License | AGPL-3 (same as upstream shazam) |
| Upstream | CRAN shazam 1.3.2 |
Install
pip install pyshazam
Dependencies: numpy, scipy, pandas, matplotlib. The Hartigans'
dip-test p-value for the GMM threshold method is optional
(pip install pyshazam[diptest]).
What is ported
pyshazam is a faithful re-implementation -- numerical parity with
shazam 1.3.2 is the design goal. All the shazam bundled data
(HH_S5F, HH_S1F, MK_RS5NF, MK_RS1NF, HKL_S5F, HKL_S1F,
U5N, the IMGT region schemes and the mutation-class schemes) ship inside
the package.
- Distance-to-nearest + clonal threshold --
distToNearest,findThreshold(densityandgmmmethods),calcTargetingDistance,nearestDist,getDNAMatrix,getAAMatrix. - SHM targeting models --
createSubstitutionMatrix,createMutabilityMatrix,createTargetingMatrix,createTargetingModel,extendSubstitutionMatrix,extendMutabilityMatrix,calculateMutability,symmetrize. - Mutation analysis --
observedMutations/calcObservedMutations,expectedMutations/calcExpectedMutations,setRegionBoundaries,slideWindowSeq/slideWindowDb/slideWindowTune. - BASELINe selection --
calcBaseline,groupBaseline,summarizeBaseline,testBaseline,createBaseline,editBaseline. - SHM simulation --
shmulateSeq,shmulateTree. - Consensus / collapsing --
collapseClones,consensusSequence. - Plotting (matplotlib) -- distance/threshold histograms, mutability heatmaps, BASELINe density and summary plots, mutation-frequency plots.
Quick start
import pandas as pd
import pyshazam as sh
# AIRR-format data frame of IMGT-aligned Ig sequences
db = pd.read_csv("example_db.tsv", sep="\t")
# --- Distance to nearest + clonal threshold ---
dtn = sh.distToNearest(db, sequenceColumn="junction",
vCallColumn="v_call", jCallColumn="j_call",
model="ham", normalize="len", first=False)
thr = sh.findThreshold(dtn["dist_nearest"].dropna().values, method="density")
print("clonal threshold:", thr.threshold)
# --- Observed mutations (R/S by CDR/FWR region) ---
db = sh.observedMutations(db, regionDefinition=sh.IMGT_V, frequency=False)
# --- BASELINe selection analysis ---
clones = sh.collapseClones(db, cloneColumn="clone_id",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostCommon")
baseline = sh.calcBaseline(clones, sequenceColumn="clonal_sequence",
germlineColumn="clonal_germline",
testStatistic="focused",
regionDefinition=sh.IMGT_V,
targetingModel=sh.HH_S5F)
grouped = sh.groupBaseline(baseline, groupBy="sample_id")
summary = sh.summarizeBaseline(grouped, returnType="df")
print(summary)
# --- Build a custom SHM targeting model ---
model = sh.createTargetingModel(db, model="s",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
vCallColumn="v_call")
R-parity
py-shazam is validated against shazam 1.3.2. The deterministic functions match R to machine precision:
| Function | Agreement vs shazam 1.3.2 |
|---|---|
observedMutations (R/S counts and frequencies) |
bit-exact (max abs diff 0) |
distToNearest (Hamming model) |
bit-exact (max abs diff 0) |
calcTargetingDistance (HH_S5F) |
bit-exact (max abs diff 0) |
createSubstitutionMatrix (1-mer) |
rel-diff < 1e-15 |
expectedMutations |
rel-diff < 1e-15 |
findThreshold density bandwidth / threshold |
bit-exact (< 1e-8) |
BASELINe selection sigma (summarizeBaseline) |
rel-diff < 1e-13 |
baselineCI confidence intervals |
rel-diff < 1e-8 |
tests/test_r_parity.py regenerates the R references from
tests/r_reference_driver.R and asserts these tolerances; it is skipped
automatically when R / shazam is unavailable.
Citation
If you use py-shazam, please cite the original shazam package:
Gupta NT, Vander Heiden JA, Uduman M, Gadala-Maria D, Yaari G, Kleinstein SH. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 2015.
and, for the BASELINe selection methods and the SHM targeting models:
Yaari G, Uduman M, Kleinstein SH. Quantifying selection in high-throughput immunoglobulin sequencing data sets. Nucleic Acids Research 2012; 40(17):e134.
Yaari G, Vander Heiden JA, et al. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Frontiers in Immunology 2013; 4:358.
License
AGPL-3, the same license as the upstream shazam package. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyshazam-0.1.0.tar.gz.
File metadata
- Download URL: pyshazam-0.1.0.tar.gz
- Upload date:
- Size: 666.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e68cb870f9aef12a0c2c366c085a3493bc50740a655c94693558dc1a1467f0a
|
|
| MD5 |
cd20e4b330ace5018210e63ee82b640f
|
|
| BLAKE2b-256 |
ea97d0ff68ddc174c765e4f34d9fa4b2d622e9ae358081542de79b63e3ffe939
|
Provenance
The following attestation bundles were made for pyshazam-0.1.0.tar.gz:
Publisher:
publish.yml on omicverse/py-shazam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyshazam-0.1.0.tar.gz -
Subject digest:
5e68cb870f9aef12a0c2c366c085a3493bc50740a655c94693558dc1a1467f0a - Sigstore transparency entry: 1590996212
- Sigstore integration time:
-
Permalink:
omicverse/py-shazam@d3070ba73a2ba7d9fba0abe6f19882265bb17a1e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/omicverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d3070ba73a2ba7d9fba0abe6f19882265bb17a1e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file pyshazam-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pyshazam-0.1.0-py3-none-any.whl
- Upload date:
- Size: 661.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73a7353ba5e71d8fa803534ebde98420d239bcc7e36baaf3001e5eb01bb4585d
|
|
| MD5 |
04aefe0d07db68207cb164127d58ada8
|
|
| BLAKE2b-256 |
f137e77f0f4819d4ab4ed3fea133b472e21fe208f37d29a8a6f1f4695fa9f82e
|
Provenance
The following attestation bundles were made for pyshazam-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on omicverse/py-shazam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyshazam-0.1.0-py3-none-any.whl -
Subject digest:
73a7353ba5e71d8fa803534ebde98420d239bcc7e36baaf3001e5eb01bb4585d - Sigstore transparency entry: 1590996265
- Sigstore integration time:
-
Permalink:
omicverse/py-shazam@d3070ba73a2ba7d9fba0abe6f19882265bb17a1e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/omicverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d3070ba73a2ba7d9fba0abe6f19882265bb17a1e -
Trigger Event:
workflow_dispatch
-
Statement type: