A Python package for explaining text similarity
Project description
XPLAINSIM: A Toolkit for Explaining Text Similarity
A research toolkit for decomposing and explaining text similarity across neural, structured, and symbolic levels. It's designed for interpretability research, controlled embedding and metric alignment, and hybrid neural-symbolic text analysis.
The toolkit is modular: each explanation paradigm can be used independently or combined in hybrid setups.
Conceptual Overview
XPLAINSIM currently provides three complementary explanation paradigms:
| Module | Explanation Level | What it Does |
|---|---|---|
| Attribution | Token level | Explain which tokens drive similarity |
| SpaceShaping | Embedding space | Shape features to encode custom aspects |
| Symbolic | Graph level | Explain which semantic roles/aspects align |
Overview of Repository / Table of Contents
Installation
You can install via pip with:
pip install xplainsim
That's it. Only when using the Symbolic module with the default parser one small extra installation is necessary.
Attributions
Idea
Token-level attribution decomposes embedding similarity into fine-grained token interactions between two texts.
Given a neural embedding model and two texts we trace the similarity back to interactions of individual input tokens.
The explanation is a matrix over the tokens from each input (the sum of this matrix approximates the similarity of the embeddings).
Example
Show Currently Available Models
from xplain.attribution import ModelFactory
print(ModelFactory.show_options()) # shows available model names, use in build below
Compute Attributions
from xplain.attribution import ModelFactory
model = ModelFactory.build("sentence-transformers/all-mpnet-base-v2") # use print(ModelFactory.show_options()) to show others
texta = 'The dog runs after the kitten in the yard.'
textb = 'Outside in the garden the cat is chased by the dog.'
A, tokens_a, tokens_b = model.explain_similarity(texta, textb, move_to_cpu=True, sim_measure='cosine')
Example output structure:
A: token-level contribution matrixtokens_a: token list for text Atokens_b: token list for text B
Expansion: Token Alignment
# same as above, then
A, tokens_a, tokens_b = model.postprocess_attributions(A, tokens_a, tokens_b, sparsification_method="FlowAlign")
Expansion: Cross-Linguality
from xplain.attribution import ModelFactory
model = ModelFactory.build("Alibaba-NLP/gte-multilingual-base") # use print(ModelFactory.show_options()) to show others
texta = 'The dog runs after the kitten in the yard.'
textb = 'Im Garten rennt der Hund der Katze hinterher.'
A, tokens_a, tokens_b = model.explain_similarity(texta, textb, move_to_cpu=True, sim_measure='cosine')
Space Shaping
Idea
Space Shaping enforces interpretable structure inside embedding spaces.
Instead of learning a monolithic embedding, the vector is partitioned into dedicated subspaces, each trained to reflect a predefined interpretable metric (e.g., bag-of-words overlap, named entity similarity, sentiment, etc.).
This enables:
- Controllable similarity decomposition
- Feature-aligned embeddings
- Hybrid symbolic–neural objectives
from sentence_transformers import InputExample
from xplain.spaceshaping import PartitionedSentenceTransformer
examples = []
# compute the training/partitioning target
for x, y in zip(list_with_strings, other_list_with_strings):
similarities = []
# Metrics/aspects that should be reflected in the embedding space
for metric in my_metrics:
similarities.append(metric.score(x, y))
examples.append(InputExample(texts=[x, y], label=similarities))
# instantiate model and train
pt = PartitionedSentenceTransformer(feature_names, feature_dims)
pt.train_model(examples)
Space Partitioning Example
Here's a very simple example for training and inferring with a custom model.
Concretely, we partition the embedding into three features/parts
- Bag-of-words: Learns to reflect bag-of-words distance
- Named entity similarity: Learns to reflect similarity of named entities
- (Not explicitly trained): Residual features for capturing the semantic similarity that makes for "the rest"
Note that this is only a toy code, and the training happens on little data, however, the feature partitioning will already have some effect.
from scipy.stats import pearsonr
from xplain.spaceshaping import PartitionedSentenceTransformer
from sentence_transformers import InputExample
from datasets import load_dataset
# We will later use this to create a custom "Named Entity" metric
import spacy
nlp=spacy.load("en_core_web_sm")
# let's first load a toy train dataset of sentence pairs
ds = load_dataset("mteb/stsbenchmark-sts")
some_pairs = list(zip([dic["sentence1"] for dic in ds["train"]], [dic["sentence2"] for dic in ds["train"]]))
# dev dataset of sentence pairs
some_pairs_dev = list(zip([dic["sentence1"] for dic in ds["validation"]], [dic["sentence2"] for dic in ds["validation"]]))
# let's build our target metrics that should be reflected within the embedding space,
def bow_sim(x1, x2):
x1, x2 = set(x1.split()), set(x2.split())
inter, union = x1.intersection(x2), x1.union(x2)
return len(inter) / len(union)
def ner_sim(doc1, doc2):
x1_ner = " ".join([ne.text for ne in doc1.ents])
x2_ner = " ".join([ne.text for ne in doc2.ents])
if not x1_ner and not x2_ner:
return 1.0
return bow_sim(x1_ner, x2_ner)
# we create training examples
docs1, docs2 = [nlp(x) for x, _ in some_pairs], [nlp(y) for _, y in some_pairs]
target = [[bow_sim(x1, x2), ner_sim(docs1[i], docs2[i])] for i, (x1, x2) in enumerate(some_pairs)]
some_examples = [InputExample(texts=[x1, x2], label=target[i]) for (i, (x1, x2)) in enumerate(some_pairs)]
# some development examples
docs1_dev, docs2_dev = [nlp(x) for x, _ in some_pairs_dev], [nlp(y) for _, y in some_pairs_dev]
target_dev = [[bow_sim(x1, x2), ner_sim(docs1_dev[i], docs2_dev[i])] for i, (x1, x2) in enumerate(some_pairs_dev)]
some_examples_dev = [InputExample(texts=[x1, x2], label=target_dev[i]) for (i, (x1, x2)) in enumerate(some_pairs_dev)]
# initialize model
pt = PartitionedSentenceTransformer(feature_names=["bow", "ner"], feature_dims=[32, 32])
# explanation can be called before training, but it's meaningless, just to compare to later
decomposed_predictions = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])
def feature_correlation(feature_name, preds):
return pearsonr([dic[feature_name] for dic in preds],
[ex.label[pt.feature_names.index(feature_name)] for ex in some_examples_dev])[0]
pearsonr_before_training = [feature_correlation(name, decomposed_predictions) for name in pt.feature_names]
# print a toy example before training
print("Text before training", pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))
# train
pt.train_model(some_examples, some_examples_dev)
# eval correlation to custom metric after train
decomposed_predictions = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])
pearsonr_after_training = [feature_correlation(name, decomposed_predictions) for name in pt.feature_names]
for index, pr in enumerate(pearsonr_after_training):
print(f"Correlation for {pt.feature_names[index]} delta: {pr - pearsonr_before_training[index]}")
# print a toy example after training
print("Text after training:", pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))
Symbolic
Idea
Unlike pure neural similarity, this approach decomposes similarity along semantic roles (Agent, Patient, Negation, etc.), enabling aspect-level semantic comparison.
This is based on comparing AMR graphs of texts. Abstract Meaning Representation (AMR) encodes sentence meaning as a graph of concepts and semantic roles.
For using the Symbolic module with the default parser small extra installations are necessary:
!pip install amrlib
!xplain-install-amr
!pip install transformers[torch]==4.49.0
The last line is to ensure that an older transformer version (transformers<5) is installed, as the default AMR parser is not yet compatible with version 5.
Example
Explaining Similarity
The approach consists roughly in two steps:
- Parse each input text to an AMR Graph that expresses the text semantics in a symbolic way
- Match those Meaning Graphs with Graph Similarity Metrics to elicit meaning similarity aspects (e.g., Agent, Patient, Negation,...)
from xplain.symbolic.model import AMRSimilarity
explainer = AMRSimilarity()
sents1 = ["Barack Obama holds a talk"]
sents2 = ["Hillary Clinton holds a talk"]
exp = explainer.explain_similarity(sents1, sents2)
print(exp)
This will print a json dictionary with aspectual graph matching scores.
Return AMR graphs And Graph Alignments
To also return the graphs and aspectual subgraphs (including node alignments), use return_graphs=True in explain_similarity.
FAQ
Citation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file xplainsim-0.9.4.tar.gz.
File metadata
- Download URL: xplainsim-0.9.4.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3432bf52b4b07d910b604e9e0ee1c47dad0f5f9e6f1b87924eac011dca8f1d44
|
|
| MD5 |
b5209a399849ad8c3111803159867440
|
|
| BLAKE2b-256 |
eaf337f7284881a98a384a967c351dbcccf36ae583c03519332b276462190c27
|