A Python package for explaining biological sequence models using Shapley values and interactions.

These details have not been verified by PyPI

Project links

Project description

SHAP zero: Explaining Biological Sequence Models

SHAP zero is a Python package that enables the amortized computation of Shapley values and interactions. It does this by paying a one-time cost to sketch the model's Fourier transform. After this one-time cost, SHAP zero enables near-zero marginal cost for future query sequences by mapping the Fourier transform to Shapley values and interactions.

Installation

shapzero is designed to work with Python 3.10 and above. Installation can be done via pip:

pip install shapzero

Quickstart

Initialize your model using shapzero.init and compute the Fourier transform using compute_fourier_transform. From there, you can explain SHAP values and interactions using explain.

import shapzero

# Train example model
X, y = shapzero.load_dna_example()
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
model = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2, interaction_only=True)),  
    ('linear_regression', Ridge(alpha=0.5)) 
])
model.fit(X, y)

# Set up SHAP zero explainer
q = 4   # alphabet size (q=4 nucleotides for DNA and RNA, q=20 amino acids for proteins)
n = 10  # sequence length
explainer = shapzero.init(
    q=q,
    n=n,
    model=model,
    exp_dir=output_directory
)
# pay one-time cost to compute the Fourier transform
explainer.compute_fourier_transform(
    budget=30000, verbose=True
)
>> ----------
>> R^2 is 0.96
>> There are 20 1-order interactions.
>> There are 208 2-order interactions.
>> There are 1 0-order interactions.
>> ----------

# Explain sequences using SHAP values
seqs = shapzero.load_dna_sequences_to_explain() # list of strings
print(seqs)
>> ['ACTCTTGAGG', 'TATATCTGTG', 'GATGTATAGG'...
shap = explainer.explain(seqs, explanation='shap_value') # list of SHAP values
print(shap[0])
>> {(0,): 1.3241669688536364,   # SHAP value of the 1st nucleotide
>>  (1,): 0.4545280155565195,     
>>  (2,): -3.6661905864093125, 
>>  ...}
# plot and save SHAP values
explainer.plot()
explainer.save()

Plot of SHAP values over DNA sequences

# Explain sequences using Shapley interactions
interaction = explainer.explain(sample, explanation='interaction')  # list of interactions
print(interactions[0])
>> {(0, 7): 2.867415008537887,   # interaction between the 1st and 8th nucleotides
>>  (6, 7): -1.2684576082389891,
>>  (4, 5): 0.4493051300654991: 
>>  ...}
# plot and save interactions
explainer.plot()
explainer.save()

Plot of Shapley interactions over DNA sequences

Load a previously computed Fourier transform

If you previously ran explainer.compute_fourier_transform(), SHAP zero will automatically save the Fourier transform to output_directory/fourier_transform.pickle. To resume explaining from that previous checkpoint, you can load in the Fourier transform path into shapzero.init.

explainer = shapzero.init(
    q=q,
    n=n,
    fourier_transform=f"{output_directory}/fourier_transform.pickle"
    exp_dir=output_directory
)
# explain using the pre-computed Fourier transform! 
shap = explainer.explain(seqs, explanation='shap_value')
explainer.plot()
explainer.save()
interaction = explainer.explain(sample, explanation='interaction') 
explainer.plot()
explainer.save()

What types of models is SHAP zero compatible with?

SHAP zero aims to be compatible with most biological sequence models out of the box! SHAP zero will automatically detect what type of model you have (e.g. PyTorch, sklearn, XGBoost, etc.) and attempt to query from said model. To streamline the process, we ask that your model takes in either one-hot (with input dimension $q \times n$) or $q$-ary inputs.

By default, SHAP zero uses the following $q$-ary encoding scheme:

DNA_ENCODING = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
RNA_ENCODING = {'A': 0, 'C': 1, 'G': 2, 'U': 3}
PROTEIN_ENCODING = {
    'A': 0, 'C': 1, 'D': 2, 'E': 3, 'F': 4, 'G': 5, 'H': 6, 'I': 7, 'K': 8,
    'L': 9, 'M': 10, 'N': 11, 'P': 12, 'Q': 13, 'R': 14, 'S': 15, 'T': 16,
    'V': 17, 'W': 18, 'Y': 19
}

For example, if our one-hot DNA model takes in as an input [[1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1]], SHAP zero will attempt to query the sequence 'AGCT'. If our $q$-ary protein model takes as an input [18, 2, 5, 15], SHAP zero will attempt to query the sequence 'WDGS'.

What if my model uses a different input scheme/uses a unique architecture?

In an effort to be compatible with every possible biological sequence model, SHAP zero is also fully capable of taking in user-written functions. We request that the input of the function is capable of taking in a 2D $q$-ary numpy array of shape (num_samples, n) and outputs a 1D numpy of shape (num_samples,). Alternatively, your function can also take in as an input a list of sequences, where each sequence is a string of length n, and the list is of length num_samples.

Examples of possible functions:

1. Mathematical functions

def model(X):
    """
    Computes y = 5 * X[:, 0] - 2 * X[:, 3] + X[:, 1] * X[:, 4]
    """
    y = 5 * X[:, 0] - 2 * X[:, 3] + X[:, 1] * X[:, 4]
    return y
explainer = shapzero.init(
    q=q,
    n=n,
    exp_dir=output_dir,
    model=model
)

2. Models with pre-defined initializations

# Assume 'load_model' and 'compute_model_scores' are functions from an external library
def load_model(model_path):
    ...
def compute_model_scores(model, samples, context_data):
    # This is a placeholder for the function that gets predictions.
    # It might take the model, the new sequences (samples), and other contextual data.
    ...

# Define a wrapper that will interface with SHAP zero
class ModelScorer:
    def __init__(self, model_path, context_data=None):
        """
        Initializes the scorer by loading the model and storing any
        contextual data needed for predictions.

        Args:
            model_path (str): Path to the pre-trained model.
            context_data (dict, optional): A dictionary of any other data the
                model needs outside of just the length-n sequence.
        """
        self.model = load_model(model_path)
        self.context_data = context_data if context_data is not None else {}

    def predict(self, samples_numpy_array):
        """
        This is the method that will be passed to SHAP zero.
        It takes a 2D q-ary numpy array and returns a 1D numpy array of scores.
        """
        # This function calls your underlying model's prediction logic,
        # passing along the model object, the new samples, and any other
        # contextual data that was stored during initialization.
        scores = compute_model_scores(
            model=self.model,
            samples=samples_numpy_array,
            context_data=self.context_data
        )
        return np.array(scores)

model_path = "model"
context_data = {
    "target_sequence": "ACGTACGT",
    "positions_of_interest": [2, 3, 6]
}
scorer = ModelScorer(model_path=model_path, context_data=context_data)
# pass scorer.predict into SHAP zero! 
explainer = shapzero.init(
    q=q,
    n=n,
    exp_dir=output_dir,
    model=scorer.predict  # pass the wrapper here
)

3. Models with three channels (e.g., CNNs)

def load_model(model_path):
    ...

def compute_scores(model, qary_numpy_array, q, n):
    """
    Handles the data conversion from q-ary to a model that takes as an input (batch, q, n)
    """
    num_samples = qary_numpy_array.shape[0]
    # 1. One-hot encode the (batch, n) q-ary data to (batch, n, q)
    one_hot = np.zeros((num_samples, n, q))
    one_hot[np.arange(num_samples)[:, None], np.arange(n), qary_numpy_array] = 1
    # 2. Transpose from (batch, n, q) to (batch, q, n)
    one_hot_transposed = np.transpose(one_hot, (0, 2, 1))
    input_tensor = torch.from_numpy(one_hot_transposed).float()
    with torch.no_grad():
        output_tensor = model(input_tensor)
    return output_tensor.numpy().flatten()

# Define a wrapper
class ModelScorer:
    def __init__(self, model_path, q, n):
        self.model = load_cnn_model(model_path, q, n)
        self.q = q
        self.n = n

    def predict(self, samples_numpy_array):
        scores = compute_cnn_scores(
            model=self.model,
            qary_numpy_array=samples_numpy_array,
            q=self.q,
            n=self.n
        )
        return np.array(scores)

model_path = "model"
scorer = ModelScorer(model_path="model", q=q, n=n)
explainer = shapzero.init(
    q=q,
    n=n,
    exp_dir=output_dir,
    model=scorer.predict  # pass the wrapper here
)

Citation

If you use shapzero and enjoy it, please consider citing our paper! SHAP zero was recently accepted into NeurIPS 2025, and we look forward to the great discussions!

@inproceedings{tsui2025shapzero,
  title={{SHAP} zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries},
  author={Tsui, Darin and Musharaf, Aryan and Erginbas, Yigit E. and Kang, Justin S. and Aghazadeh, Amirali},
  booktitle={Advances in Neural Information Processing Systems (Accepted)},
  year={2025}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.4

Sep 27, 2025

0.0.3

Sep 26, 2025

0.0.2

Sep 20, 2025

0.0.1

Sep 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shapzero-0.0.4.tar.gz (66.2 kB view details)

Uploaded Sep 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

shapzero-0.0.4-py3-none-any.whl (68.4 kB view details)

Uploaded Sep 27, 2025 Python 3

File details

Details for the file shapzero-0.0.4.tar.gz.

File metadata

Download URL: shapzero-0.0.4.tar.gz
Upload date: Sep 27, 2025
Size: 66.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for shapzero-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`2f6995cf87d38619b0c834365a2685a2c765f88ae2809d381bad7e79fc599d6f`
MD5	`07c6a993abf90950b484481ee6d0b716`
BLAKE2b-256	`5445b1b42ad59a7b198d326f4bc7565d21897006e696bca1b4ca10469df4759f`

See more details on using hashes here.

File details

Details for the file shapzero-0.0.4-py3-none-any.whl.

File metadata

Download URL: shapzero-0.0.4-py3-none-any.whl
Upload date: Sep 27, 2025
Size: 68.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for shapzero-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ae2b437e495e3b4a8d365870960405d4ba14fb3c97d2cf628b1099abe1656af5`
MD5	`5922d2beab813cbb49684a5b2158ed2d`
BLAKE2b-256	`d62d74bd3f29dc561504c0fd885e94a9ffed84e716feb685a4b2e92e61a26061`

See more details on using hashes here.

shapzero 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SHAP zero: Explaining Biological Sequence Models

Installation

Quickstart

Load a previously computed Fourier transform

What types of models is SHAP zero compatible with?

What if my model uses a different input scheme/uses a unique architecture?

1. Mathematical functions

2. Models with pre-defined initializations

3. Models with three channels (e.g., CNNs)

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes