Skip to main content

This is a python tool to evaluate alignment and uniformity of sentence embedding

Project description

AlignUnformEval

This is a python tool to evaluate alignment and uniformity of sentence embedding like SimCSE paper.

SimCSE paper explains alignment and uniformity as below:

Given a distribution of positive pairs p_pos, alignment calculates expected distance between embeddings of the paired instances (assuming representations are already normalized):

On the other hand, uniformity measures how well the embeddings are uniformly distributed:

> where p_data denotes the data distribution.

Install

by pip

pip install alignuniformeval

by source

pip install https://github.com/akiFQC/AlignUnformEval

Usage

You can easily evaluate alignment and uniformity with this library.
This is a minimal example that evaluate alignment and uniformity of STS Benchmark.

from alignunformeval import STSBEval

evaluator = STSBEval(sentence_encoder)
# sentence_encoder is a callable from List[str] to numpy.array. The output numpy.array must be [dimention_of_sentence_vector].
result = evaluator.eval_summary()
# result =  {"alignment": value_of_aligenment, "uniformity": value_of_uniformity}

STSBEval get callable whose input is list of str and output is n dimentional numpy.array.

Dataset

STS Benchmark

This dataset (especially, sts-dev.csv) was used in SimCSE paper. In the paper, the threshold of similarity score was st at 4.0; pairs of sentences whose similarity score is higher than 4.0 are used for evaluation of alignment. You can set other threshold as the following example.

from alignunformeval import STSBEval
# sentence_encoder : some function List[str] to np.array[dimention_of_sentence_vector]
evaluator = STSBEval(sentence_encoder, threshold=3.0) # set threshold at 3.0
result = evaluator.eval_summary()

Please see test/test_stsb.py if you want more details.

Tokyo Metropolitan University Paraphrase Corpus (TMUP)

Tokyo Metropolitan University Paraphrase Corpus (TMUP) is a Japanese paraphrase dataset.

from alignunformeval import TMUPEval
# sentence_encoder : some function List[str] to np.array[dimention_of_sentence_vector]
evaluator = TMUPEval(sentence_encoder)
result = evaluator.eval_summary()

License

The license of this tool follows each dataset. Please read the documents of datasets you use.

Reference

  1. https://arxiv.org/pdf/2104.08821.pdf
  2. https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
  3. https://github.com/tmu-nlp/paraphrase-corpus

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alignunformeval-0.0.1.1.tar.gz (4.3 kB view hashes)

Uploaded Source

Built Distribution

alignunformeval-0.0.1.1-py3-none-any.whl (5.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page