This is a python tool to evaluate alignment and uniformity of sentence embedding
Project description
AlignUnformEval
This is a python tool to evaluate alignment and uniformity of sentence embedding like SimCSE paper.
SimCSE paper explains alignment and uniformity as below:
Given a distribution of positive pairs p_pos, alignment calculates expected distance between embeddings of the paired instances (assuming representations are already normalized):
On the other hand, uniformity measures how well the embeddings are uniformly distributed:
Install
by pip
pip install alignuniformeval
by source
pip install https://github.com/akiFQC/AlignUnformEval
Usage
You can easily evaluate alignment and uniformity with this library.
This is a minimal example that evaluate alignment and uniformity of STS Benchmark.
from alignunformeval import STSBEval
evaluator = STSBEval(sentence_encoder)
# sentence_encoder is a callable from List[str] to numpy.array. The output numpy.array must be [dimention_of_sentence_vector].
result = evaluator.eval_summary()
# result = {"alignment": value_of_aligenment, "uniformity": value_of_uniformity}
STSBEval
get callable whose input is list
of str
and output is n dimentional numpy.array
.
Dataset
STS Benchmark
This dataset (especially, sts-dev.csv
) was used in SimCSE paper. In the paper, the threshold of similarity score was st at 4.0; pairs of sentences whose similarity score is higher than 4.0 are used for evaluation of alignment. You can set other threshold as the following example.
from alignunformeval import STSBEval
# sentence_encoder : some function List[str] to np.array[dimention_of_sentence_vector]
evaluator = STSBEval(sentence_encoder, threshold=3.0) # set threshold at 3.0
result = evaluator.eval_summary()
Please see test/test_stsb.py
if you want more details.
Tokyo Metropolitan University Paraphrase Corpus (TMUP)
Tokyo Metropolitan University Paraphrase Corpus (TMUP) is a Japanese paraphrase dataset.
from alignunformeval import TMUPEval
# sentence_encoder : some function List[str] to np.array[dimention_of_sentence_vector]
evaluator = TMUPEval(sentence_encoder)
result = evaluator.eval_summary()
License
The license of this tool follows each dataset. Please read the documents of datasets you use.
Reference
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for alignunformeval-0.0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28cd162622fd6c24fb0dbfb5bef6d0c1eb87745fcd51a04332501a5e2c1e4e09 |
|
MD5 | 2079c8d3ef2d5a9c8ad1491517c979dc |
|
BLAKE2b-256 | 2f92f94a4101bf9c1a66b3a2d0173190ac00228ef99d895827189d0e654fc347 |