LEMON: Explainable Entity Matching
Project description
LEMON: Explainable Entity Matching
LEMON is an explainability method that addresses the unique challenges of explaining entity matching models.
Installation
pip install lemon-explain
or
pip install lemon-explain[storage] # Save and load explanations
pip install lemon-explain[matchers] # To run matchers in lemon.utils
pip install lemon-explain[all] # All dependencies
Usage
import lemon
# You need a matcher that follows this api:
def predict_proba(records_a, records_b, record_id_pairs):
... # predict probabilities / confidence scores
return proba
exp = lemon.explain(records_a, records_b, record_id_pairs, predict_proba)
# exp can be visualized in a Jupyter notebook or saved to a json file
exp.save("explanation.json")
Documentation
lemon.explain()
lemon.explain(
records_a: pd.DataFrame,
records_b: pd.DataFrame,
record_id_pairs: pd.DataFrame,
predict_proba: Callable,
*,
num_features: int = 5,
dual_explanation: bool = True,
estimate_potential: bool = True,
granularity: str = "counterfactual",
num_samples: int = None,
token_representation: str = "record-bow",
token_patterns: Union[str, List[str], Dict] = "[^ ]+",
explain_attrs: bool = False,
attribution_method: str = "lime",
show_progress: bool = True,
random_state: Union[int, np.random.Generator, None] = 0,
return_dict: bool = None,
) -> Union[MatchingAttributionExplanation, Dict[any, MatchingAttributionExplanation]]:
Parameters
- records_a : pd.DataFrame
- Records from data source a.
- records_b : pd.DataFrame
- Records from data source b.
- record_id_pairs : pd.DataFrame
- Which record pairs to explain.
Must be a pd.DataFrame with columns
"a.rid"
and"b.rid"
that reference the index ofrecords_a
andrecords_b
respectively.
- Which record pairs to explain.
Must be a pd.DataFrame with columns
- predict_proba : Callable
- Matcher function that predicts the probability of match.
Must accept three arguments:
records_a
,records_b
, andrecord_id_pairs
. Should return array-like (list, np.ndarray, pd.Series, ...) of floats between 0 and 1 - the predicted probability that a record pair is a match - for all record pairs described byrecord_id_pairs
in the same order.
- Matcher function that predicts the probability of match.
Must accept three arguments:
- num_features : int, default = 5
- The number of features to select for the explanation.
- dual_explanation : bool, default = True
- Whether to use dual explanations or not.
- estimate_potential : bool, default = True
- Whether to estimate potential or not.
- granularity : {"tokens", "attributes", "counterfactual"}, default = "counterfactual"
- The granularity of the explanation.
For more info on
"counterfactual"
granularity see our paper.
- The granularity of the explanation.
For more info on
- num_samples : int, default = None
- The number of neighborhood samples to use. If None a heuristic will automatically pick the number of samples.
- token_representation : {"independent", "shared-bow", "record-bow"}, default = "record-bow"
- Which token representation to use.
- independent: All tokens are unique.
- shared-bow: Bag-of-words representation shared across both records
- record-bow: Bag-of-words representation per individual record
- Which token representation to use.
- token_patterns : str, List[str], or Dict, default =
"[^ ]+"
- Regex patterns for valid tokens in strings.
A single string will be interpreted as a regex pattern and all strings will be tokenized into non-overlapping matches of this pattern.
You can specify a list of patterns to tokenize into non-overlapping matches of any pattern.
For fine-grained control of how different parts of records are tokenized you can provide a dictionary with keys on the format
("a" or "b", attribute_name, "attr" or "val")
and values that are lists of token regex patterns.
- Regex patterns for valid tokens in strings.
A single string will be interpreted as a regex pattern and all strings will be tokenized into non-overlapping matches of this pattern.
You can specify a list of patterns to tokenize into non-overlapping matches of any pattern.
For fine-grained control of how different parts of records are tokenized you can provide a dictionary with keys on the format
- explain_attrs : bool, default = False
- Whether to explain attribution names or not.
If True,
predict_proba
should accept the keyword argumentattr_strings
- a list that specifies what strings to use as attributes for each prediction. Each list element is on the format {("a" or "b", record_column_name): attr_string}.
- Whether to explain attribution names or not.
If True,
- attribution_method : {"lime", "shap"}, default = False
- Which underlying method to use contribution estimation.
Note that in order to use shap
estimate_potential
must be False and the shap package must be installed.
- Which underlying method to use contribution estimation.
Note that in order to use shap
- show_progress : bool, default = True
- Whether to show progress or not. This is passed to
predict_proba
if it accepts this parameter.
- Whether to show progress or not. This is passed to
- return_dict : bool, default = None
- If True a dictionary of explanations will be returned where the keys are labels from the index of
record_id_pairs
. If False a single explanation will be returned (an exception is raised iflen(record_id_pairs) > 1
). If None it will return a single explanation iflen(record_id_pairs)
and a dictionary otherwise.
- If True a dictionary of explanations will be returned where the keys are labels from the index of
Returns
lemon.MatchingAttributionExplanation
isntance or an Dict[any, lemon.MatchingAttributionExplanation]
,
depending on the input to the return_dict
parameter.
lemon.MatchingAttributionExplanation
Attributes
- record_pair : pd.DataFrame
- string_representation : Dict[Tuple, Union[None, str, TokenizedString]],
- attributions : List[Attribution],
- prediction_score : float
- dual : bool
- metadata : Dict[str, any]
Methods
- save(path: str = None) -> Optional[Dict]
- Save the explanation to a json file.
If path is not specified a json-serializable dictionary will be returned.
Requires pyarrow to be installed (
pip install lemon-explain[storage]
).
- Save the explanation to a json file.
If path is not specified a json-serializable dictionary will be returned.
Requires pyarrow to be installed (
- static load(path: Union[str, Dict]) -> MatchingAttributionExplanation
- Load an explanation from a json file.
Instead of a path, one can instead provide a json-serializable dictionary.
Requires pyarrow to be installed (
pip install lemon-explain[storage]
).
- Load an explanation from a json file.
Instead of a path, one can instead provide a json-serializable dictionary.
Requires pyarrow to be installed (
lemon.Attribution
Attributes
- weight: float
- potential: Optional[float]
- positions: List[Union[Tuple[str, str, str, Optional[int]]]]
- name: Optional[str]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lemon-explain-0.1.0.tar.gz
(25.5 kB
view details)
Built Distribution
File details
Details for the file lemon-explain-0.1.0.tar.gz
.
File metadata
- Download URL: lemon-explain-0.1.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.8 CPython/3.8.10 Linux/5.11.0-27-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 66f5e58cd04420ca68cd8751ccc3f0b79ba81000181b3ce1f0a7a3f49f09094d |
|
MD5 | c95ca9ca95bd7b67ec6e24e316e0fe96 |
|
BLAKE2b-256 | 3b6bc0783c38c4f9ec7f9115f3d4d5ec22489c1395238499156407c1b5be7976 |
File details
Details for the file lemon_explain-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: lemon_explain-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.8 CPython/3.8.10 Linux/5.11.0-27-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97b6ecaedeafd485131027d9c113e6e09e738eb442b8cc7be6fdb03915e0fe6d |
|
MD5 | 0d35e7a17342884a264b18af7060f197 |
|
BLAKE2b-256 | 390fce3d5f08700f678bc2bb30e51f733419446696ad3c1485fc3d3527c550ea |