Skip to main content

Pure-Python port of the R/CRAN package shazam -- immunoglobulin somatic hypermutation (SHM) analysis: distance-to-nearest, clonal threshold, SHM targeting models, mutation profiling and BASELINe selection analysis.

Project description

py-shazam

Pure-Python port of the R/CRAN package shazam -- Immunoglobulin Somatic Hypermutation Analysis, part of the Immcantation framework (Gupta, Vander Heiden, et al., Bioinformatics 2015; Kleinstein Lab, Yale).

pyshazam is a standalone, dependency-light implementation of shazam's computational core: distance-to-nearest neighbour, clonal-threshold detection, SHM targeting models, observed/expected mutation profiling and the BASELINe antigen-driven selection framework. It does not require R.

PyPI / import name pyshazam
License AGPL-3 (same as upstream shazam)
Upstream CRAN shazam 1.3.2

Install

pip install pyshazam

Dependencies: numpy, scipy, pandas, matplotlib. The Hartigans' dip-test p-value for the GMM threshold method is optional (pip install pyshazam[diptest]).

What is ported

pyshazam is a faithful re-implementation -- numerical parity with shazam 1.3.2 is the design goal. All the shazam bundled data (HH_S5F, HH_S1F, MK_RS5NF, MK_RS1NF, HKL_S5F, HKL_S1F, U5N, the IMGT region schemes and the mutation-class schemes) ship inside the package.

  • Distance-to-nearest + clonal threshold -- distToNearest, findThreshold (density and gmm methods), calcTargetingDistance, nearestDist, getDNAMatrix, getAAMatrix.
  • SHM targeting models -- createSubstitutionMatrix, createMutabilityMatrix, createTargetingMatrix, createTargetingModel, extendSubstitutionMatrix, extendMutabilityMatrix, calculateMutability, symmetrize.
  • Mutation analysis -- observedMutations / calcObservedMutations, expectedMutations / calcExpectedMutations, setRegionBoundaries, slideWindowSeq / slideWindowDb / slideWindowTune.
  • BASELINe selection -- calcBaseline, groupBaseline, summarizeBaseline, testBaseline, createBaseline, editBaseline.
  • SHM simulation -- shmulateSeq, shmulateTree.
  • Consensus / collapsing -- collapseClones, consensusSequence.
  • Plotting (matplotlib) -- distance/threshold histograms, mutability heatmaps, BASELINe density and summary plots, mutation-frequency plots.

Quick start

import pandas as pd
import pyshazam as sh

# AIRR-format data frame of IMGT-aligned Ig sequences
db = pd.read_csv("example_db.tsv", sep="\t")

# --- Distance to nearest + clonal threshold ---
dtn = sh.distToNearest(db, sequenceColumn="junction",
                       vCallColumn="v_call", jCallColumn="j_call",
                       model="ham", normalize="len", first=False)
thr = sh.findThreshold(dtn["dist_nearest"].dropna().values, method="density")
print("clonal threshold:", thr.threshold)

# --- Observed mutations (R/S by CDR/FWR region) ---
db = sh.observedMutations(db, regionDefinition=sh.IMGT_V, frequency=False)

# --- BASELINe selection analysis ---
clones = sh.collapseClones(db, cloneColumn="clone_id",
                           sequenceColumn="sequence_alignment",
                           germlineColumn="germline_alignment_d_mask",
                           method="mostCommon")
baseline = sh.calcBaseline(clones, sequenceColumn="clonal_sequence",
                           germlineColumn="clonal_germline",
                           testStatistic="focused",
                           regionDefinition=sh.IMGT_V,
                           targetingModel=sh.HH_S5F)
grouped = sh.groupBaseline(baseline, groupBy="sample_id")
summary = sh.summarizeBaseline(grouped, returnType="df")
print(summary)

# --- Build a custom SHM targeting model ---
model = sh.createTargetingModel(db, model="s",
                                sequenceColumn="sequence_alignment",
                                germlineColumn="germline_alignment_d_mask",
                                vCallColumn="v_call")

R-parity

py-shazam is validated against shazam 1.3.2. The deterministic functions match R to machine precision:

Function Agreement vs shazam 1.3.2
observedMutations (R/S counts and frequencies) bit-exact (max abs diff 0)
distToNearest (Hamming model) bit-exact (max abs diff 0)
calcTargetingDistance (HH_S5F) bit-exact (max abs diff 0)
createSubstitutionMatrix (1-mer) rel-diff < 1e-15
expectedMutations rel-diff < 1e-15
findThreshold density bandwidth / threshold bit-exact (< 1e-8)
BASELINe selection sigma (summarizeBaseline) rel-diff < 1e-13
baselineCI confidence intervals rel-diff < 1e-8

tests/test_r_parity.py regenerates the R references from tests/r_reference_driver.R and asserts these tolerances; it is skipped automatically when R / shazam is unavailable.

Citation

If you use py-shazam, please cite the original shazam package:

Gupta NT, Vander Heiden JA, Uduman M, Gadala-Maria D, Yaari G, Kleinstein SH. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 2015.

and, for the BASELINe selection methods and the SHM targeting models:

Yaari G, Uduman M, Kleinstein SH. Quantifying selection in high-throughput immunoglobulin sequencing data sets. Nucleic Acids Research 2012; 40(17):e134.

Yaari G, Vander Heiden JA, et al. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Frontiers in Immunology 2013; 4:358.

License

AGPL-3, the same license as the upstream shazam package. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyshazam-0.1.0.tar.gz (666.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyshazam-0.1.0-py3-none-any.whl (661.4 kB view details)

Uploaded Python 3

File details

Details for the file pyshazam-0.1.0.tar.gz.

File metadata

  • Download URL: pyshazam-0.1.0.tar.gz
  • Upload date:
  • Size: 666.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyshazam-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5e68cb870f9aef12a0c2c366c085a3493bc50740a655c94693558dc1a1467f0a
MD5 cd20e4b330ace5018210e63ee82b640f
BLAKE2b-256 ea97d0ff68ddc174c765e4f34d9fa4b2d622e9ae358081542de79b63e3ffe939

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyshazam-0.1.0.tar.gz:

Publisher: publish.yml on omicverse/py-shazam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyshazam-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyshazam-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 661.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyshazam-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73a7353ba5e71d8fa803534ebde98420d239bcc7e36baaf3001e5eb01bb4585d
MD5 04aefe0d07db68207cb164127d58ada8
BLAKE2b-256 f137e77f0f4819d4ab4ed3fea133b472e21fe208f37d29a8a6f1f4695fa9f82e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyshazam-0.1.0-py3-none-any.whl:

Publisher: publish.yml on omicverse/py-shazam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page