A Python package for explaining text similarity

These details have not been verified by PyPI

Project links

Homepage

Project description

Explaining Similarity

A package for explaining and exploring semantic similarity through the eyes of text embedding models.

Overview of Repository / Table of Contents

Installation
Attributions
Space Shaping
- Idea
- Toy Example
Symbolic
- AMR parsing and multi-subgraph metric
FAQ
Citation

Installation

To obtain attributions for an off-the-shelf transformer

from xplain.attribution import ModelFactory
print(ModelFactory.show_options()) # shows available model names, use in build below
model = ModelFactory.build("huggingface_id") # e.g sentence-transformers/all-mpnet-base-v2
texta = 'The dog runs after the kitten in the yard.'
textb = 'Outside in the garden the cat is chased by the dog.'
A, tokens_a, tokens_b = model.explain_similarity(texta, textb, move_to_cpu=True, sim_measure='cos')
A, tokens_a, tokens_b = model.postprocess_attributions(A, tokens_a, tokens_b, sparsification_method="FlowAlign")
# Last line is optional, postprocess attributions to discretize and/or merge subtokens into original tokens.

Space partitioning

Idea

The idea is as follows: You have a bunch of interpreatble measures (my_metrics) and wish that these are reflected within sub-embeddings (features), while not disturbing the overall similarity too much.

from sentence_transformers import InputExample
from xplain.spaceshaping import PartitionedSentenceTransformer

# need some documents pairs, don't need to be paraphrases, or similar, just some documents
list_with_strings, other_list_with_strings = ["abc",....], ["xyz",...]
examples = []

# compute the training/partitioning target
for x, y in zip(list_with_strings, other_list_with_strings):
	similarities = []
	for metric in my_metrics:
		similarities.append(metric.score(x, y))
	examples.append(InputExample(texts=[x, y], label=similarities))

# instantiate model and train, here we use 16 dimensions to express each metric
pt = PartitionedSentenceTransformer(feature_names=[metric.name for metric in my_metrics], 
                                    feature_dims=[16]*len(my_metrics))
pt.train(examples)

Space Paritioning Example

Here's a very simple example for training and inferring with a custom model.

Needed: A training target. For every input text pair, a list with numbers. These numbers can be fine-grained interpretable measurements. They are then used to structure the embedding space. In this example, we would like to build a model that reflects superficial semantic similarity in one part of its embedding, similarity of named entities in another, and "deep" semantic similarity in the other. Concretely, we paritition the embedding into three features/parts

Bag-of-words: Learns to reflect bag-of-words distance
Named entity similarity: Learns to reflect similarity of named entities
(Not explicitly trained): Residual features for capturing the semantic similarity that makes for "the rest"

Note that this is only a toy code, and the training happens on little data, however, the feature paritioning will already have some effect.

from scipy.stats import pearsonr
from xplain.spaceshaping import PartitionedSentenceTransformer
from sentence_transformers import InputExample
from datasets import load_dataset
import spacy
nlp=spacy.load("en_core_web_sm")

# let's first load a toy train dataset of sentence pairs
ds = load_dataset("mteb/stsbenchmark-sts")
some_pairs = list(zip([dic["sentence1"] for dic in ds["train"]], [dic["sentence2"] for dic in ds["train"]]))

# dev dataset of sentence pairs
some_pairs_dev = list(zip([dic["sentence1"] for dic in ds["validation"]], [dic["sentence2"] for dic in ds["validation"]]))

# let's build our target metrics that should be reflected within the embedding space,
def bow_sim(x1, x2):
	x1 = set(x1.split())
	x2 = set(x2.split())
	inter = x1.intersection(x2)
	union = x1.union(x2)
	return len(inter) / len(union)

def ner_sim(doc1, doc2):
	x1_ner = " ".join([ne.text for ne in doc1.ents])
	x2_ner = " ".join([ne.text for ne in doc2.ents])
	if not x1_ner and not x2_ner:
		return 1.0
	return bow_sim(x1_ner, x2_ner)

docs1, docs2 = [nlp(x) for x, _ in some_pairs], [nlp(y) for _, y in some_pairs]
target = [[bow_sim(x1, x2), ner_sim(docs1[i], docs2[i])] for i, (x1, x2) in enumerate(some_pairs)]
some_examples = [InputExample(texts=[x1, x2], label=target[i]) for (i, (x1, x2)) in enumerate(some_pairs)]

docs1_dev, docs2_dev = [nlp(x) for x, _ in some_pairs_dev], [nlp(y) for _, y in some_pairs_dev]
target_dev = [[bow_sim(x1, x2), ner_sim(docs1_dev[i], docs2_dev[i])] for i, (x1, x2) in enumerate(some_pairs_dev)]
some_examples_dev = [InputExample(texts=[x1, x2], label=target_dev[i]) for (i, (x1, x2)) in enumerate(some_pairs_dev)]

# init model
pt = PartitionedSentenceTransformer(feature_names=["bow", "ner"], feature_dims=[32, 32])
json = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])

# eval correlation to custom metric before training
print(pearsonr([x.label[0] for x in some_examples_dev], [dic["bow"] for dic in json]))
print(pearsonr([x.label[1] for x in some_examples_dev], [dic["ner"] for dic in json]))

# print a toy example before training
print(pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))

# train
pt.train(some_examples, some_examples_dev)

# eval correlation to custom metric after train
json = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])
print(pearsonr([x.label[0] for x in some_examples_dev], [dic["bow"] for dic in json]))
print(pearsonr([x.label[1] for x in some_examples_dev], [dic["ner"] for dic in json]))

# print a toy example after training
print(pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))

Symbolic

AMR Parsing and Multi-Subgraph Metric

The approch consists roughly in two steps:

Parse each input text to an Abstract Meaning Representation Graph
Match those Meaning Graphs with Graph Similarity Metrics, also with regard to aspectual subgraphs as elicited in AMR (e.g., Agent, Patient, Negation,...)

from xplain.symbolic.model import AMRSimilarity
explainer = AMRSimilarity()
sents1 = ["Barack Obama holds a talk"]
sents2 = ["Hillary Clinton holds a talk"]
exp = explainer.explain_similarity(sents1, sents2)
print(exp)

This will print a json dictionary with aspectual graph matching scores. To also return the graphs and aspectual subgraphs, use return_graphs=True in explain_similarity.

FAQ

Citation

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.9.4

Mar 20, 2026

0.9.3

Mar 19, 2026

0.9.2

Mar 11, 2026

0.9.1

Mar 9, 2026

0.9

Mar 9, 2026

0.0.2

Mar 27, 2025

This version

0.0.1

Mar 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xplainsim-0.0.1.tar.gz (33.0 kB view details)

Uploaded Mar 26, 2025 Source

File details

Details for the file xplainsim-0.0.1.tar.gz.

File metadata

Download URL: xplainsim-0.0.1.tar.gz
Upload date: Mar 26, 2025
Size: 33.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.8

File hashes

Hashes for xplainsim-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`5d3c25e8d23791fe85d9b7d819d8e4cb9dacbd49e52d84394da2b0a5fa9f9bf2`
MD5	`9a9f4504e448c556f50aa22abef7b9d2`
BLAKE2b-256	`cee656353763b64c309fc0d685bdaac392cab37387328a0e40cb2a73242f51d0`

See more details on using hashes here.

xplainsim 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Explaining Similarity

Overview of Repository / Table of Contents

Installation

To obtain attributions for an off-the-shelf transformer

Space partitioning

Idea

Space Paritioning Example

Symbolic

AMR Parsing and Multi-Subgraph Metric

FAQ

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes