Skip to main content

A Python package for explaining text similarity

Project description

Explaining Similarity

A package for explaining and exploring semantic similarity through the eyes of text embedding models.

Overview of Repository / Table of Contents

Installation

To obtain attributions for an off-the-shelf transformer

from xplain.attribution import ModelFactory
print(ModelFactory.show_options()) # shows available model names, use in build below
model = ModelFactory.build("huggingface_id") # e.g sentence-transformers/all-mpnet-base-v2
texta = 'The dog runs after the kitten in the yard.'
textb = 'Outside in the garden the cat is chased by the dog.'
A, tokens_a, tokens_b = model.explain_similarity(texta, textb, move_to_cpu=True, sim_measure='cos')
A, tokens_a, tokens_b = model.postprocess_attributions(A, tokens_a, tokens_b, sparsification_method="FlowAlign")
# Last line is optional, postprocess attributions to discretize and/or merge subtokens into original tokens.

Space partitioning

Idea

The idea is as follows: You have a bunch of interpreatble measures (my_metrics) and wish that these are reflected within sub-embeddings (features), while not disturbing the overall similarity too much.

from sentence_transformers import InputExample
from xplain.spaceshaping import PartitionedSentenceTransformer

# need some documents pairs, don't need to be paraphrases, or similar, just some documents
list_with_strings, other_list_with_strings = ["abc",....], ["xyz",...]
examples = []

# compute the training/partitioning target
for x, y in zip(list_with_strings, other_list_with_strings):
	similarities = []
	for metric in my_metrics:
		similarities.append(metric.score(x, y))
	examples.append(InputExample(texts=[x, y], label=similarities))

# instantiate model and train, here we use 16 dimensions to express each metric
pt = PartitionedSentenceTransformer(feature_names=[metric.name for metric in my_metrics], 
                                    feature_dims=[16]*len(my_metrics))
pt.train(examples)

Space Paritioning Example

Here's a very simple example for training and inferring with a custom model.

Needed: A training target. For every input text pair, a list with numbers. These numbers can be fine-grained interpretable measurements. They are then used to structure the embedding space. In this example, we would like to build a model that reflects superficial semantic similarity in one part of its embedding, similarity of named entities in another, and "deep" semantic similarity in the other. Concretely, we paritition the embedding into three features/parts

  1. Bag-of-words: Learns to reflect bag-of-words distance
  2. Named entity similarity: Learns to reflect similarity of named entities
  3. (Not explicitly trained): Residual features for capturing the semantic similarity that makes for "the rest"

Note that this is only a toy code, and the training happens on little data, however, the feature paritioning will already have some effect.

from scipy.stats import pearsonr
from xplain.spaceshaping import PartitionedSentenceTransformer
from sentence_transformers import InputExample
from datasets import load_dataset
import spacy
nlp=spacy.load("en_core_web_sm")

# let's first load a toy train dataset of sentence pairs
ds = load_dataset("mteb/stsbenchmark-sts")
some_pairs = list(zip([dic["sentence1"] for dic in ds["train"]], [dic["sentence2"] for dic in ds["train"]]))

# dev dataset of sentence pairs
some_pairs_dev = list(zip([dic["sentence1"] for dic in ds["validation"]], [dic["sentence2"] for dic in ds["validation"]]))

# let's build our target metrics that should be reflected within the embedding space,
def bow_sim(x1, x2):
	x1 = set(x1.split())
	x2 = set(x2.split())
	inter = x1.intersection(x2)
	union = x1.union(x2)
	return len(inter) / len(union)

def ner_sim(doc1, doc2):
	x1_ner = " ".join([ne.text for ne in doc1.ents])
	x2_ner = " ".join([ne.text for ne in doc2.ents])
	if not x1_ner and not x2_ner:
		return 1.0
	return bow_sim(x1_ner, x2_ner)

docs1, docs2 = [nlp(x) for x, _ in some_pairs], [nlp(y) for _, y in some_pairs]
target = [[bow_sim(x1, x2), ner_sim(docs1[i], docs2[i])] for i, (x1, x2) in enumerate(some_pairs)]
some_examples = [InputExample(texts=[x1, x2], label=target[i]) for (i, (x1, x2)) in enumerate(some_pairs)]

docs1_dev, docs2_dev = [nlp(x) for x, _ in some_pairs_dev], [nlp(y) for _, y in some_pairs_dev]
target_dev = [[bow_sim(x1, x2), ner_sim(docs1_dev[i], docs2_dev[i])] for i, (x1, x2) in enumerate(some_pairs_dev)]
some_examples_dev = [InputExample(texts=[x1, x2], label=target_dev[i]) for (i, (x1, x2)) in enumerate(some_pairs_dev)]

# init model
pt = PartitionedSentenceTransformer(feature_names=["bow", "ner"], feature_dims=[32, 32])
json = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])

# eval correlation to custom metric before training
print(pearsonr([x.label[0] for x in some_examples_dev], [dic["bow"] for dic in json]))
print(pearsonr([x.label[1] for x in some_examples_dev], [dic["ner"] for dic in json]))

# print a toy example before training
print(pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))

# train
pt.train(some_examples, some_examples_dev)

# eval correlation to custom metric after train
json = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])
print(pearsonr([x.label[0] for x in some_examples_dev], [dic["bow"] for dic in json]))
print(pearsonr([x.label[1] for x in some_examples_dev], [dic["ner"] for dic in json]))

# print a toy example after training
print(pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))

Symbolic

AMR Parsing and Multi-Subgraph Metric

The approch consists roughly in two steps:

  1. Parse each input text to an Abstract Meaning Representation Graph
  2. Match those Meaning Graphs with Graph Similarity Metrics, also with regard to aspectual subgraphs as elicited in AMR (e.g., Agent, Patient, Negation,...)
from xplain.symbolic.model import AMRSimilarity
explainer = AMRSimilarity()
sents1 = ["Barack Obama holds a talk"]
sents2 = ["Hillary Clinton holds a talk"]
exp = explainer.explain_similarity(sents1, sents2)
print(exp)

This will print a json dictionary with aspectual graph matching scores. To also return the graphs and aspectual subgraphs, use return_graphs=True in explain_similarity.

FAQ

Citation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xplainsim-0.0.1.tar.gz (33.0 kB view details)

Uploaded Source

File details

Details for the file xplainsim-0.0.1.tar.gz.

File metadata

  • Download URL: xplainsim-0.0.1.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.8

File hashes

Hashes for xplainsim-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5d3c25e8d23791fe85d9b7d819d8e4cb9dacbd49e52d84394da2b0a5fa9f9bf2
MD5 9a9f4504e448c556f50aa22abef7b9d2
BLAKE2b-256 cee656353763b64c309fc0d685bdaac392cab37387328a0e40cb2a73242f51d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page