Skip to main content

LEMON: Explainable Entity Matching

Project description

LEMON: Explainable Entity Matching

Illustration of LEMON

LEMON is an explainability method that addresses the unique challenges of explaining entity matching models.

Installation

pip install lemon-explain

or

pip install lemon-explain[storage]  # Save and load explanations
pip install lemon-explain[matchers] # To run matchers in lemon.utils
pip install lemon-explain[all]      # All dependencies

Usage

import lemon


# You need a matcher that follows this api:
def predict_proba(records_a, records_b, record_id_pairs):
    ... # predict probabilities / confidence scores
    return proba

exp = lemon.explain(records_a, records_b, record_id_pairs, predict_proba)

# exp can be visualized in a Jupyter notebook or saved to a json file
exp.save("explanation.json")

See the example notebook

Open In Colab

Example of explanation from LEMON

Documentation

lemon.explain()

lemon.explain(
    records_a: pd.DataFrame,
    records_b: pd.DataFrame,
    record_id_pairs: pd.DataFrame,
    predict_proba: Callable,
    *,
    num_features: int = 5,
    dual_explanation: bool = True,
    estimate_potential: bool = True,
    granularity: str = "counterfactual",
    num_samples: int = None,
    token_representation: str = "record-bow",
    token_patterns: Union[str, List[str], Dict] = "[^ ]+",
    explain_attrs: bool = False,
    attribution_method: str = "lime",
    show_progress: bool = True,
    random_state: Union[int, np.random.Generator, None] = 0,
    return_dict: bool = None,
) -> Union[MatchingAttributionExplanation, Dict[any, MatchingAttributionExplanation]]:

Parameters

  • records_a : pd.DataFrame
    • Records from data source a.
  • records_b : pd.DataFrame
    • Records from data source b.
  • record_id_pairs : pd.DataFrame
    • Which record pairs to explain. Must be a pd.DataFrame with columns "a.rid" and "b.rid" that reference the index of records_a and records_b respectively.
  • predict_proba : Callable
    • Matcher function that predicts the probability of match. Must accept three arguments: records_a, records_b, and record_id_pairs. Should return array-like (list, np.ndarray, pd.Series, ...) of floats between 0 and 1 - the predicted probability that a record pair is a match - for all record pairs described by record_id_pairs in the same order.
  • num_features : int, default = 5
    • The number of features to select for the explanation.
  • dual_explanation : bool, default = True
    • Whether to use dual explanations or not.
  • estimate_potential : bool, default = True
    • Whether to estimate potential or not.
  • granularity : {"tokens", "attributes", "counterfactual"}, default = "counterfactual"
    • The granularity of the explanation. For more info on "counterfactual" granularity see our paper.
  • num_samples : int, default = None
    • The number of neighborhood samples to use. If None a heuristic will automatically pick the number of samples.
  • token_representation : {"independent", "shared-bow", "record-bow"}, default = "record-bow"
    • Which token representation to use.
      • independent: All tokens are unique.
      • shared-bow: Bag-of-words representation shared across both records
      • record-bow: Bag-of-words representation per individual record
  • token_patterns : str, List[str], or Dict, default = "[^ ]+"
    • Regex patterns for valid tokens in strings. A single string will be interpreted as a regex pattern and all strings will be tokenized into non-overlapping matches of this pattern. You can specify a list of patterns to tokenize into non-overlapping matches of any pattern. For fine-grained control of how different parts of records are tokenized you can provide a dictionary with keys on the format ("a" or "b", attribute_name, "attr" or "val") and values that are lists of token regex patterns.
  • explain_attrs : bool, default = False
    • Whether to explain attribution names or not. If True, predict_proba should accept the keyword argument attr_strings - a list that specifies what strings to use as attributes for each prediction. Each list element is on the format {("a" or "b", record_column_name): attr_string}.
  • attribution_method : {"lime", "shap"}, default = False
    • Which underlying method to use contribution estimation. Note that in order to use shap estimate_potential must be False and the shap package must be installed.
  • show_progress : bool, default = True
    • Whether to show progress or not. This is passed to predict_proba if it accepts this parameter.
  • return_dict : bool, default = None
    • If True a dictionary of explanations will be returned where the keys are labels from the index of record_id_pairs. If False a single explanation will be returned (an exception is raised if len(record_id_pairs) > 1). If None it will return a single explanation if len(record_id_pairs) and a dictionary otherwise.

Returns

lemon.MatchingAttributionExplanation isntance or an Dict[any, lemon.MatchingAttributionExplanation], depending on the input to the return_dict parameter.

lemon.MatchingAttributionExplanation

Attributes

  • record_pair : pd.DataFrame
  • string_representation : Dict[Tuple, Union[None, str, TokenizedString]],
  • attributions : List[Attribution],
  • prediction_score : float
  • dual : bool
  • metadata : Dict[str, any]

Methods

  • save(path: str = None) -> Optional[Dict]
    • Save the explanation to a json file. If path is not specified a json-serializable dictionary will be returned. Requires pyarrow to be installed (pip install lemon-explain[storage]).
  • static load(path: Union[str, Dict]) -> MatchingAttributionExplanation
    • Load an explanation from a json file. Instead of a path, one can instead provide a json-serializable dictionary. Requires pyarrow to be installed (pip install lemon-explain[storage]).

lemon.Attribution

Attributes

  • weight: float
  • potential: Optional[float]
  • positions: List[Union[Tuple[str, str, str, Optional[int]]]]
  • name: Optional[str]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lemon-explain-0.1.1.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

lemon_explain-0.1.1-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file lemon-explain-0.1.1.tar.gz.

File metadata

  • Download URL: lemon-explain-0.1.1.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.10 Linux/5.11.0-43-generic

File hashes

Hashes for lemon-explain-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d5f61d4b8c1f405f976e99915b1fc14a555e2160ebedc1722e2993cf47407079
MD5 4ca1c931b15b66cb5ce254426ea5c155
BLAKE2b-256 926014a8ba32b18d7e8769a3e2b0fb887bf037fde3010a97445bd19943bc6f69

See more details on using hashes here.

File details

Details for the file lemon_explain-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: lemon_explain-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.10 Linux/5.11.0-43-generic

File hashes

Hashes for lemon_explain-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2cf4a0425b572fd9a71b27670b1bd58aa0a0df21afb6bef1a3382193b96b5775
MD5 1f59f2087742998e1f338f2a67eccb0e
BLAKE2b-256 631431361a8f784e071aabbfd13515d828f2dab4a64c4a87d3459b3ebd327931

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page