Skip to main content

A Python package for explaining text similarity

Project description

XPLAINSIM: A Toolkit for Explaining Text Similarity

A research toolkit for decomposing and explaining text similarity across neural, structured, and symbolic levels. It's designed for interpretability research, controlled embedding and metric alignment, and hybrid neural-symbolic text analysis.

The toolkit is modular: each explanation paradigm can be used independently or combined in hybrid setups.

Conceptual Overview

XPLAINSIM currently provides three complementary explanation paradigms:

Module Explanation Level What it Does
Attribution Token level Explain which tokens drive similarity
SpaceShaping Embedding space Shape features to encode custom aspects
Symbolic Graph level Explain which semantic roles/aspects align

Overview of Repository / Table of Contents

Installation

You can install via pip with:

pip install xplainsim

That's it. Only when using the Symbolic module with the default parser one small extra installation is necessary.

Attributions

Idea

Token-level attribution decomposes embedding similarity into fine-grained token interactions between two texts.

Given a neural embedding model and two texts we trace the similarity back to interactions of individual input tokens.

The explanation is a matrix over the tokens from each input (the sum of this matrix approximates the similarity of the embeddings).

Example

Show Currently Available Models

from xplain.attribution import ModelFactory
print(ModelFactory.show_options()) # shows available model names, use in build below

Compute Attributions

from xplain.attribution import ModelFactory
model = ModelFactory.build("sentence-transformers/all-mpnet-base-v2") # use print(ModelFactory.show_options()) to show others
texta = 'The dog runs after the kitten in the yard.'
textb = 'Outside in the garden the cat is chased by the dog.'
A, tokens_a, tokens_b = model.explain_similarity(texta, textb, move_to_cpu=True, sim_measure='cosine')

Example output structure:

  • A: token-level contribution matrix
  • tokens_a: token list for text A
  • tokens_b: token list for text B

Expansion: Token Alignment

# same as above, then
A, tokens_a, tokens_b = model.postprocess_attributions(A, tokens_a, tokens_b, sparsification_method="FlowAlign")

Expansion: Cross-Linguality

from xplain.attribution import ModelFactory
model = ModelFactory.build("Alibaba-NLP/gte-multilingual-base") # use print(ModelFactory.show_options()) to show others
texta = 'The dog runs after the kitten in the yard.'
textb = 'Im Garten rennt der Hund der Katze hinterher.'
A, tokens_a, tokens_b = model.explain_similarity(texta, textb, move_to_cpu=True, sim_measure='cosine')

Space Shaping

Idea

Space Shaping enforces interpretable structure inside embedding spaces.

Instead of learning a monolithic embedding, the vector is partitioned into dedicated subspaces, each trained to reflect a predefined interpretable metric (e.g., bag-of-words overlap, named entity similarity, sentiment, etc.).

This enables:

  • Controllable similarity decomposition
  • Feature-aligned embeddings
  • Hybrid symbolic–neural objectives
from sentence_transformers import InputExample
from xplain.spaceshaping import PartitionedSentenceTransformer

examples = []

# compute the training/partitioning target
for x, y in zip(list_with_strings, other_list_with_strings):
	similarities = []
        # Metrics/aspects that should be reflected in the embedding space
	for metric in my_metrics:
		similarities.append(metric.score(x, y))
	examples.append(InputExample(texts=[x, y], label=similarities))

# instantiate model and train
pt = PartitionedSentenceTransformer(feature_names, feature_dims)

pt.train_model(examples)

Space Partitioning Example

Here's a very simple example for training and inferring with a custom model.

Concretely, we partition the embedding into three features/parts

  1. Bag-of-words: Learns to reflect bag-of-words distance
  2. Named entity similarity: Learns to reflect similarity of named entities
  3. (Not explicitly trained): Residual features for capturing the semantic similarity that makes for "the rest"

Note that this is only a toy code, and the training happens on little data, however, the feature partitioning will already have some effect.

from scipy.stats import pearsonr
from xplain.spaceshaping import PartitionedSentenceTransformer
from sentence_transformers import InputExample
from datasets import load_dataset

# We will later use this to create a custom "Named Entity" metric
import spacy
nlp=spacy.load("en_core_web_sm")

# let's first load a toy train dataset of sentence pairs
ds = load_dataset("mteb/stsbenchmark-sts")
some_pairs = list(zip([dic["sentence1"] for dic in ds["train"]], [dic["sentence2"] for dic in ds["train"]]))

# dev dataset of sentence pairs
some_pairs_dev = list(zip([dic["sentence1"] for dic in ds["validation"]], [dic["sentence2"] for dic in ds["validation"]]))

# let's build our target metrics that should be reflected within the embedding space,
def bow_sim(x1, x2):
	x1, x2 = set(x1.split()), set(x2.split())
	inter, union = x1.intersection(x2), x1.union(x2)
	return len(inter) / len(union)

def ner_sim(doc1, doc2):
	x1_ner = " ".join([ne.text for ne in doc1.ents])
	x2_ner = " ".join([ne.text for ne in doc2.ents])
	if not x1_ner and not x2_ner:
		return 1.0
	return bow_sim(x1_ner, x2_ner)

# we create training examples
docs1, docs2 = [nlp(x) for x, _ in some_pairs], [nlp(y) for _, y in some_pairs]
target = [[bow_sim(x1, x2), ner_sim(docs1[i], docs2[i])] for i, (x1, x2) in enumerate(some_pairs)]
some_examples = [InputExample(texts=[x1, x2], label=target[i]) for (i, (x1, x2)) in enumerate(some_pairs)]

# some development examples
docs1_dev, docs2_dev = [nlp(x) for x, _ in some_pairs_dev], [nlp(y) for _, y in some_pairs_dev]
target_dev = [[bow_sim(x1, x2), ner_sim(docs1_dev[i], docs2_dev[i])] for i, (x1, x2) in enumerate(some_pairs_dev)]
some_examples_dev = [InputExample(texts=[x1, x2], label=target_dev[i]) for (i, (x1, x2)) in enumerate(some_pairs_dev)]

# initialize model
pt = PartitionedSentenceTransformer(feature_names=["bow", "ner"], feature_dims=[32, 32])

# explanation can be called before training, but it's meaningless, just to compare to later
decomposed_predictions = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])

def feature_correlation(feature_name, preds):
    return pearsonr([dic[feature_name] for dic in preds],
                    [ex.label[pt.feature_names.index(feature_name)] for ex in some_examples_dev])[0]

pearsonr_before_training = [feature_correlation(name, decomposed_predictions) for name in pt.feature_names]

# print a toy example before training
print("Text before training", pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))

# train
pt.train_model(some_examples, some_examples_dev)

# eval correlation to custom metric after train
decomposed_predictions = pt.explain_similarity([x for x, y in some_pairs_dev], [y for x, y in some_pairs_dev])
pearsonr_after_training = [feature_correlation(name, decomposed_predictions) for name in pt.feature_names]

for index, pr in enumerate(pearsonr_after_training):
    print(f"Correlation for {pt.feature_names[index]} delta: {pr - pearsonr_before_training[index]}")

# print a toy example after training
print("Text after training:", pt.explain_similarity(["The kitten drinks milk"], ["A cat slurps something"]))

Symbolic

Idea

Unlike pure neural similarity, this approach decomposes similarity along semantic roles (Agent, Patient, Negation, etc.), enabling aspect-level semantic comparison.

This is based on comparing AMR graphs of texts. Abstract Meaning Representation (AMR) encodes sentence meaning as a graph of concepts and semantic roles.

Installation note:

For using the Symbolic module with the default parser one small extra installation is necessary:

xplain-install-amr

Ensure also that for this transformers<5 is installed, as the default AMR parser is not yet compatible with version 5.

Example

Explaining Similarity

The approach consists roughly in two steps:

  1. Parse each input text to an AMR Graph that expresses the text semantics in a symbolic way
  2. Match those Meaning Graphs with Graph Similarity Metrics to elicit meaning similarity aspects (e.g., Agent, Patient, Negation,...)
from xplain.symbolic.model import AMRSimilarity
explainer = AMRSimilarity()
sents1 = ["Barack Obama holds a talk"]
sents2 = ["Hillary Clinton holds a talk"]
exp = explainer.explain_similarity(sents1, sents2)
print(exp)

This will print a json dictionary with aspectual graph matching scores.

Return AMR graphs And Graph Alignments

To also return the graphs and aspectual subgraphs (including node alignments), use return_graphs=True in explain_similarity.

FAQ

Citation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xplainsim-0.9.2.tar.gz (37.1 kB view details)

Uploaded Source

File details

Details for the file xplainsim-0.9.2.tar.gz.

File metadata

  • Download URL: xplainsim-0.9.2.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for xplainsim-0.9.2.tar.gz
Algorithm Hash digest
SHA256 6da80ee0bb1c55b07ed5febd027e0e3349cf1128f03673b9cbed9ec35439b99f
MD5 e1190e166bb29e770709ffb21fc67758
BLAKE2b-256 4c8fcfc2de7e867cb556762060b3ac8266c28692a9bf9a9a366da8b0d2a720ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page