shparkley

Scaling Shapley Value computation using Spark

These details have not been verified by PyPI

Project links

Homepage

Project description

Shparkley is a PySpark implementation of Shapley values which uses a monte-carlo approximation algorithm.

Given a dataset and machine learning model, Shparkley can compute Shapley values for all features for a feature vector. Shparkley also handles training weights and is model agnostic.

Installation

pip install shparkley

Requirements

You must have Apache Spark installed on your machine/cluster.

Example Usage

from affirm.model_interpretation.shparkley.spark_shapley import (
    compute_shapley_for_sample,
    ShparkleyModel
)

class MyShparkleyModel(ShparkleyModel):
"""
You need to wrap your model with the ShparkleyModel interface.
"""
    def get_required_features(self):
        # type: () -> Set[str]
        """
        Needs to return a set of feature names for the model.
        """
        return ['feature-1', 'feature-2', 'feature-3']

    def predict(self, feature_matrix):
        # type: (List[Dict[str, Any]]) -> List[float]
        """
        Wrapper function to convert the feature matrix into an acceptable format for your model.
        This function should return the predicted probabilities.
        The feature_matrix is a list of feature dictionaries.
        Each dictionary has a mapping from the feature name to the value.
        :return: Model predictions for all feature vectors
        """
        # Convert the feature matrix into an appropriate form for your model object.
        pd_df = pd.DataFrame.from_dict(feature_matrix)
        preds = self._model.my_predict(pd_df)
        return preds

row = dataset.filter(dataset.row_id = 'xxxx').rdd.first()
shparkley_wrapped_model = MyShparkleyModel(my_model)

# You need to sample your dataset based on convergence criteria.
# More samples results in more accurate shapley values.
# Repartitioning and caching the sampled dataframe will speed up computation.
sampled_df = training_df.sample(0.1, True).repartition(75).cache()

shapley_scores_by_feature = compute_shapley_for_sample(
    df=sampled_df,
    model=shparkley_wrapped_model,
    row_to_investigate=row,
    weight_col_name='training_weight_column_name'
)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.1

Nov 6, 2020

1.0.0

Nov 5, 2020

This version

0.0.4

May 28, 2020

0.0.3

May 21, 2020

0.0.2

May 21, 2020

0.0.1

May 21, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shparkley-0.0.4.tar.gz (5.8 kB view hashes)

Uploaded May 28, 2020 Source

Built Distribution

shparkley-0.0.4-py3-none-any.whl (8.3 kB view hashes)

Uploaded May 28, 2020 Python 3

Hashes for shparkley-0.0.4.tar.gz

Hashes for shparkley-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`bacca63d9444d4c00c704d151c1ce458134ce85b43c5677bdba578b9fb305de6`
MD5	`148d7d3af4267c4fadf6f56cbdd572d8`
BLAKE2b-256	`9de562dcfdd8df832a8cba91088311336476be35a329242b9bba719e0cf998ef`

Hashes for shparkley-0.0.4-py3-none-any.whl

Hashes for shparkley-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7170032835767befb954bb9afc59c15524eb66a977de45debf2f0b2e29f142fe`
MD5	`d5333ec49e22db7f35369a0671747ee5`
BLAKE2b-256	`591ec31e33ac581f59c59a685676bfe39e1a622202071ed071dcdde3e0e8e11a`