Skip to main content

LEMON: Explainable Entity Matching

Project description

LEMON: Explainable Entity Matching

Illustration of LEMON

LEMON is an explainability method that addresses the unique challenges of explaining entity matching models.

Installation

pip install lemon-explain

or

pip install lemon-explain[storage]  # Save and load explanations
pip install lemon-explain[matchers] # To run matchers in lemon.utils
pip install lemon-explain[all]      # All dependencies

Usage

import lemon


# You need a matcher that follows this api:
def predict_proba(records_a, records_b, record_id_pairs):
    ... # predict probabilities / confidence scores
    return proba

exp = lemon.explain(records_a, records_b, record_id_pairs, predict_proba)

# exp can be visualized in a Jupyter notebook or saved to a json file
exp.save("explanation.json")

See the example notebook

Open In Colab

Example of explanation from LEMON

Documentation

lemon.explain()

lemon.explain(
    records_a: pd.DataFrame,
    records_b: pd.DataFrame,
    record_id_pairs: pd.DataFrame,
    predict_proba: Callable,
    *,
    num_features: int = 5,
    dual_explanation: bool = True,
    estimate_potential: bool = True,
    granularity: str = "counterfactual",
    num_samples: int = None,
    token_representation: str = "record-bow",
    token_patterns: Union[str, List[str], Dict] = "[^ ]+",
    explain_attrs: bool = False,
    attribution_method: str = "lime",
    show_progress: bool = True,
    random_state: Union[int, np.random.Generator, None] = 0,
    return_dict: bool = None,
) -> Union[MatchingAttributionExplanation, Dict[any, MatchingAttributionExplanation]]:

Parameters

  • records_a : pd.DataFrame
    • Records from data source a.
  • records_b : pd.DataFrame
    • Records from data source b.
  • record_id_pairs : pd.DataFrame
    • Which record pairs to explain. Must be a pd.DataFrame with columns "a.rid" and "b.rid" that reference the index of records_a and records_b respectively.
  • predict_proba : Callable
    • Matcher function that predicts the probability of match. Must accept three arguments: records_a, records_b, and record_id_pairs. Should return array-like (list, np.ndarray, pd.Series, ...) of floats between 0 and 1 - the predicted probability that a record pair is a match - for all record pairs described by record_id_pairs in the same order.
  • num_features : int, default = 5
    • The number of features to select for the explanation.
  • dual_explanation : bool, default = True
    • Whether to use dual explanations or not.
  • estimate_potential : bool, default = True
    • Whether to estimate potential or not.
  • granularity : {"tokens", "attributes", "counterfactual"}, default = "counterfactual"
    • The granularity of the explanation. For more info on "counterfactual" granularity see our paper.
  • num_samples : int, default = None
    • The number of neighborhood samples to use. If None a heuristic will automatically pick the number of samples.
  • token_representation : {"independent", "shared-bow", "record-bow"}, default = "record-bow"
    • Which token representation to use.
      • independent: All tokens are unique.
      • shared-bow: Bag-of-words representation shared across both records
      • record-bow: Bag-of-words representation per individual record
  • token_patterns : str, List[str], or Dict, default = "[^ ]+"
    • Regex patterns for valid tokens in strings. A single string will be interpreted as a regex pattern and all strings will be tokenized into non-overlapping matches of this pattern. You can specify a list of patterns to tokenize into non-overlapping matches of any pattern. For fine-grained control of how different parts of records are tokenized you can provide a dictionary with keys on the format ("a" or "b", attribute_name, "attr" or "val") and values that are lists of token regex patterns.
  • explain_attrs : bool, default = False
    • Whether to explain attribution names or not. If True, predict_proba should accept the keyword argument attr_strings - a list that specifies what strings to use as attributes for each prediction. Each list element is on the format {("a" or "b", record_column_name): attr_string}.
  • attribution_method : {"lime", "shap"}, default = False
    • Which underlying method to use contribution estimation. Note that in order to use shap estimate_potential must be False and the shap package must be installed.
  • show_progress : bool, default = True
    • Whether to show progress or not. This is passed to predict_proba if it accepts this parameter.
  • return_dict : bool, default = None
    • If True a dictionary of explanations will be returned where the keys are labels from the index of record_id_pairs. If False a single explanation will be returned (an exception is raised if len(record_id_pairs) > 1). If None it will return a single explanation if len(record_id_pairs) and a dictionary otherwise.

Returns

lemon.MatchingAttributionExplanation isntance or an Dict[any, lemon.MatchingAttributionExplanation], depending on the input to the return_dict parameter.

lemon.MatchingAttributionExplanation

Attributes

  • record_pair : pd.DataFrame
  • string_representation : Dict[Tuple, Union[None, str, TokenizedString]],
  • attributions : List[Attribution],
  • prediction_score : float
  • dual : bool
  • metadata : Dict[str, any]

Methods

  • save(path: str = None) -> Optional[Dict]
    • Save the explanation to a json file. If path is not specified a json-serializable dictionary will be returned. Requires pyarrow to be installed (pip install lemon-explain[storage]).
  • static load(path: Union[str, Dict]) -> MatchingAttributionExplanation
    • Load an explanation from a json file. Instead of a path, one can instead provide a json-serializable dictionary. Requires pyarrow to be installed (pip install lemon-explain[storage]).

lemon.Attribution

Attributes

  • weight: float
  • potential: Optional[float]
  • positions: List[Union[Tuple[str, str, str, Optional[int]]]]
  • name: Optional[str]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lemon-explain-0.1.0.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

lemon_explain-0.1.0-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file lemon-explain-0.1.0.tar.gz.

File metadata

  • Download URL: lemon-explain-0.1.0.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.8 CPython/3.8.10 Linux/5.11.0-27-generic

File hashes

Hashes for lemon-explain-0.1.0.tar.gz
Algorithm Hash digest
SHA256 66f5e58cd04420ca68cd8751ccc3f0b79ba81000181b3ce1f0a7a3f49f09094d
MD5 c95ca9ca95bd7b67ec6e24e316e0fe96
BLAKE2b-256 3b6bc0783c38c4f9ec7f9115f3d4d5ec22489c1395238499156407c1b5be7976

See more details on using hashes here.

File details

Details for the file lemon_explain-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lemon_explain-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.8 CPython/3.8.10 Linux/5.11.0-27-generic

File hashes

Hashes for lemon_explain-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 97b6ecaedeafd485131027d9c113e6e09e738eb442b8cc7be6fdb03915e0fe6d
MD5 0d35e7a17342884a264b18af7060f197
BLAKE2b-256 390fce3d5f08700f678bc2bb30e51f733419446696ad3c1485fc3d3527c550ea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page