A Python package for explaining text similarity
Project description
Explaining Similarity
A package for explaining and exploring semantic similarity through the eyes of text embedding models.
Overview of Repository / Table of Contents
Installation
To install our package you can install using:
pip install xplainsim
To obtain attributions for an off-the-shelf transformer
from xplain.attribution import ModelFactory
print(ModelFactory.show_options()) # shows available model names, use in build below
model = ModelFactory.build("huggingface_id") # e.g sentence-transformers/all-mpnet-base-v2
texta = 'The dog runs after the kitten in the yard.'
textb = 'Outside in the garden the cat is chased by the dog.'
A, tokens_a, tokens_b = model.explain_similarity(texta, textb, move_to_cpu=True, sim_measure='cos')
A, tokens_a, tokens_b = model.postprocess_attributions(A, tokens_a, tokens_b, sparsification_method="FlowAlign")
# Last line is optional, postprocess attributions to discretize and/or merge subtokens into original tokens.
Space partitioning
Idea
The idea is as follows: You have a bunch of interpreatble measures (my_metrics) and wish that these are reflected within sub-embeddings (features), while not disturbing the overall similarity too much.
from sentence_transformers import InputExample
from xplain.spaceshaping import PartitionedSentenceTransformer
# need some documents pairs, don't need to be paraphrases, or similar, just some documents
list_with_strings, other_list_with_strings = ["abc",....], ["xyz",...]
examples = []
# compute the training/partitioning target
for x, y in zip(list_with_strings, other_list_with_strings):
similarities = []
for metric in my_metrics:
similarities.append(metric.score(x, y))
examples.append(InputExample(texts=[x, y], label=similarities))
# instantiate model and train, here we use 16 dimensions to express each metric
pt = PartitionedSentenceTransformer(feature_names=[metric.name for metric in my_metrics],
feature_dims=[16]*len(my_metrics))
pt.train(examples)
Space Paritioning Example
Here's a very simple example for training and inferring with a custom model.
Needed: A training target. For every input text pair, a list with numbers. These numbers can be fine-grained interpretable measurements. They are then used to structure the embedding space. In this example, we would like to build a model that reflects superficial semantic similarity in one part of its embedding, similarity of named entities in another, and "deep" semantic similarity in the other. Concretely, we paritition the embedding into three features/parts
- Bag-of-words: Learns to reflect bag-of-words distance
- Named entity similarity: Learns to reflect similarity of named entities
- (Not explicitly trained): Residual features for capturing the semantic similarity that makes for "the rest"
Note that this is only a toy code, and the training happens on little data, however, the feature paritioning will already have some effect.
from scipy.stats import pearsonr
from xplain.spaceshaping import PartitionedSentenceTransformer
from sentence_transformers import InputExample
from datasets import load_dataset
import spacy
nlp=spacy.load("en_core_web_sm")
# let's first load a toy train dataset of sentence pairs
ds = load_dataset("mteb/stsbenchmark-sts")
some_pairs = list(zip([dic["sentence1"] for dic in ds["train"]], [dic["sentence2"] for dic in ds["train"]]))
# dev dataset of sentence pairs
some_pairs_dev = list(zip([dic["sentence1"] for dic in ds["validation"]], [dic["sentence2"] for dic in ds["validation"]]))
# let's build our target metrics that should be reflected within the embedding space,
def bow_sim(x1, x2):
x1 = set(x1.split())
x2 = set(x2.split())
inter = x1.intersection(x2)
union = x1.union(x2)
return len(inter) / len(union)
def ner_sim(doc1, doc2):
x1_ner = " ".join([ne.text for ne in doc1.ents])
x2_ner = " ".join([ne.text for ne in doc2.ents])
if not x1_ner and not x2_ner:
return 1.0
return bow_sim(x1_ner, x2_ner)
docs1, docs2 = [nlp(x) for x, _ in some_pairs], [nlp(y) for _, y in some_pairs]
target = [[bow_sim(x1, x2), ner_sim(docs1[i], docs2[i])] for i, (x1, x2) in enumerate(some_pairs)]
some_examples = [InputExample(texts=[x1, x2], label=target[i]) for (i, (x1, x2)) in enumerate(some_pairs)]
docs1_dev, docs2_dev = [nlp(x) for x, _ in some_pairs_dev], [nlp(y) for _, y in some_pairs_dev]
target_dev = [[bow_sim(x1, x2), ner_sim(docs1_dev[i], docs2_dev[i])] for i, (x1, x2) in enumerate(some_pairs_dev)]
some_examples_dev = [InputExample(texts=[x1, x2], label=target_dev[i]) for (i, (x1, x2)) in enumerate(some_pairs_dev)]
# init model
pt = PartitionedSentenceTransformer(feature_names=["bow", "ner"], feature_dims=[32, 32])
json = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])
# eval correlation to custom metric before training
print(pearsonr([x.label[0] for x in some_examples_dev], [dic["bow"] for dic in json]))
print(pearsonr([x.label[1] for x in some_examples_dev], [dic["ner"] for dic in json]))
# print a toy example before training
print(pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))
# train
pt.train(some_examples, some_examples_dev)
# eval correlation to custom metric after train
json = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])
print(pearsonr([x.label[0] for x in some_examples_dev], [dic["bow"] for dic in json]))
print(pearsonr([x.label[1] for x in some_examples_dev], [dic["ner"] for dic in json]))
# print a toy example after training
print(pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))
Symbolic
AMR Parsing and Multi-Subgraph Metric
The approch consists roughly in two steps:
- Parse each input text to an Abstract Meaning Representation Graph
- Match those Meaning Graphs with Graph Similarity Metrics, also with regard to aspectual subgraphs as elicited in AMR (e.g., Agent, Patient, Negation,...)
from xplain.symbolic.model import AMRSimilarity
explainer = AMRSimilarity()
sents1 = ["Barack Obama holds a talk"]
sents2 = ["Hillary Clinton holds a talk"]
exp = explainer.explain_similarity(sents1, sents2)
print(exp)
This will print a json dictionary with aspectual graph matching scores.
To also return the graphs and aspectual subgraphs, use return_graphs=True in explain_similarity.
FAQ
Citation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file xplainsim-0.0.2.tar.gz.
File metadata
- Download URL: xplainsim-0.0.2.tar.gz
- Upload date:
- Size: 33.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e4eca14e556b92c9b4d0e126cf69b6d08527f8ee1a247c90a8f36d055bb7dcb
|
|
| MD5 |
2a8b623cdb249b4c6c7be6a84580e2f5
|
|
| BLAKE2b-256 |
35e432a1bb40719cb0be11987be1e82cc0644fb9743478aa1b510ba537a788e8
|