LEMON: Explainable Entity Matching

These details have not been verified by PyPI

Project links

Project description

LEMON: Explainable Entity Matching

Illustration of LEMON

LEMON is an explainability method that addresses the unique challenges of explaining entity matching models.

Installation

pip install lemon-explain

pip install lemon-explain[storage]  # Save and load explanations
pip install lemon-explain[matchers] # To run matchers in lemon.utils
pip install lemon-explain[all]      # All dependencies

Usage

import lemon


# You need a matcher that follows this api:
def predict_proba(records_a, records_b, record_id_pairs):
    ... # predict probabilities / confidence scores
    return proba

exp = lemon.explain(records_a, records_b, record_id_pairs, predict_proba)

# exp can be visualized in a Jupyter notebook or saved to a json file
exp.save("explanation.json")

See the example notebook

Example of explanation from LEMON

Documentation

lemon.explain()

lemon.explain(
    records_a: pd.DataFrame,
    records_b: pd.DataFrame,
    record_id_pairs: pd.DataFrame,
    predict_proba: Callable,
    *,
    num_features: int = 5,
    dual_explanation: bool = True,
    estimate_potential: bool = True,
    granularity: str = "counterfactual",
    num_samples: int = None,
    token_representation: str = "record-bow",
    token_patterns: Union[str, List[str], Dict] = "[^ ]+",
    explain_attrs: bool = False,
    attribution_method: str = "lime",
    show_progress: bool = True,
    random_state: Union[int, np.random.Generator, None] = 0,
    return_dict: bool = None,
) -> Union[MatchingAttributionExplanation, Dict[any, MatchingAttributionExplanation]]:

Parameters

records_a : pd.DataFrame
- Records from data source a.
records_b : pd.DataFrame
- Records from data source b.
record_id_pairs : pd.DataFrame
- Which record pairs to explain. Must be a pd.DataFrame with columns "a.rid" and "b.rid" that reference the index of records_a and records_b respectively.
predict_proba : Callable
- Matcher function that predicts the probability of match. Must accept three arguments: records_a, records_b, and record_id_pairs. Should return array-like (list, np.ndarray, pd.Series, ...) of floats between 0 and 1 - the predicted probability that a record pair is a match - for all record pairs described by record_id_pairs in the same order.
num_features : int, default = 5
- The number of features to select for the explanation.
dual_explanation : bool, default = True
- Whether to use dual explanations or not.
estimate_potential : bool, default = True
- Whether to estimate potential or not.
granularity : {"tokens", "attributes", "counterfactual"}, default = "counterfactual"
- The granularity of the explanation. For more info on "counterfactual" granularity see our paper.
num_samples : int, default = None
- The number of neighborhood samples to use. If None a heuristic will automatically pick the number of samples.
token_representation : {"independent", "shared-bow", "record-bow"}, default = "record-bow"
- Which token representation to use.
  - independent: All tokens are unique.
  - shared-bow: Bag-of-words representation shared across both records
  - record-bow: Bag-of-words representation per individual record
token_patterns : str, List[str], or Dict, default = "[^ ]+"
- Regex patterns for valid tokens in strings. A single string will be interpreted as a regex pattern and all strings will be tokenized into non-overlapping matches of this pattern. You can specify a list of patterns to tokenize into non-overlapping matches of any pattern. For fine-grained control of how different parts of records are tokenized you can provide a dictionary with keys on the format ("a" or "b", attribute_name, "attr" or "val") and values that are lists of token regex patterns.
explain_attrs : bool, default = False
- Whether to explain attribution names or not. If True, predict_proba should accept the keyword argument attr_strings - a list that specifies what strings to use as attributes for each prediction. Each list element is on the format {("a" or "b", record_column_name): attr_string}.
attribution_method : {"lime", "shap"}, default = False
- Which underlying method to use contribution estimation. Note that in order to use shap estimate_potential must be False and the shap package must be installed.
show_progress : bool, default = True
- Whether to show progress or not. This is passed to predict_proba if it accepts this parameter.
return_dict : bool, default = None
- If True a dictionary of explanations will be returned where the keys are labels from the index of record_id_pairs. If False a single explanation will be returned (an exception is raised if len(record_id_pairs) > 1). If None it will return a single explanation if len(record_id_pairs) and a dictionary otherwise.

Returns

lemon.MatchingAttributionExplanation isntance or an Dict[any, lemon.MatchingAttributionExplanation], depending on the input to the return_dict parameter.

lemon.MatchingAttributionExplanation

Attributes

record_pair : pd.DataFrame
string_representation : Dict[Tuple, Union[None, str, TokenizedString]],
attributions : List[Attribution],
prediction_score : float
dual : bool
metadata : Dict[str, any]

Methods

save(path: str = None) -> Optional[Dict]
- Save the explanation to a json file. If path is not specified a json-serializable dictionary will be returned. Requires pyarrow to be installed (pip install lemon-explain[storage]).
static load(path: Union[str, Dict]) -> MatchingAttributionExplanation
- Load an explanation from a json file. Instead of a path, one can instead provide a json-serializable dictionary. Requires pyarrow to be installed (pip install lemon-explain[storage]).

lemon.Attribution

Attributes

weight: float
potential: Optional[float]
positions: List[Union[Tuple[str, str, str, Optional[int]]]]
name: Optional[str]

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Apr 6, 2022

0.1.0

Oct 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lemon-explain-0.1.1.tar.gz (25.5 kB view details)

Uploaded Apr 6, 2022 Source

Built Distribution

lemon_explain-0.1.1-py3-none-any.whl (26.4 kB view details)

Uploaded Apr 6, 2022 Python 3

File details

Details for the file lemon-explain-0.1.1.tar.gz.

File metadata

Download URL: lemon-explain-0.1.1.tar.gz
Upload date: Apr 6, 2022
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.8.10 Linux/5.11.0-43-generic

File hashes

Hashes for lemon-explain-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d5f61d4b8c1f405f976e99915b1fc14a555e2160ebedc1722e2993cf47407079`
MD5	`4ca1c931b15b66cb5ce254426ea5c155`
BLAKE2b-256	`926014a8ba32b18d7e8769a3e2b0fb887bf037fde3010a97445bd19943bc6f69`

See more details on using hashes here.

File details

Details for the file lemon_explain-0.1.1-py3-none-any.whl.

File metadata

Download URL: lemon_explain-0.1.1-py3-none-any.whl
Upload date: Apr 6, 2022
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.8.10 Linux/5.11.0-43-generic

File hashes

Hashes for lemon_explain-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2cf4a0425b572fd9a71b27670b1bd58aa0a0df21afb6bef1a3382193b96b5775`
MD5	`1f59f2087742998e1f338f2a67eccb0e`
BLAKE2b-256	`631431361a8f784e071aabbfd13515d828f2dab4a64c4a87d3459b3ebd327931`