This is a python tool to evaluate alignment and uniformity of sentence embedding
Project description
AlignUnformEval
This is a python tool to evaluate alignment and uniformity of sentence embedding like SimCSE paper.
SimCSE paper explains alignment and uniformity as below:
Given a distribution of positive pairs p_pos, alignment calculates expected distance between embeddings of the paired instances (assuming representations are already normalized):
> where p_data denotes the data distribution.On the other hand, uniformity measures how well the embeddings are uniformly distributed:
Install
by pip
pip install alignuniformeval
by source
pip install https://github.com/akiFQC/AlignUnformEval
Usage
You can easily evaluate alignment and uniformity with this library.
This is a minimal example that evaluate alignment and uniformity of STS Benchmark.
from alignunformeval import STSBEval
evaluator = STSBEval(sentence_encoder)
# sentence_encoder is a callable from List[str] to numpy.array. The output numpy.array must be [dimention_of_sentence_vector].
result = evaluator.eval_summary()
# result = {"alignment": value_of_aligenment, "uniformity": value_of_uniformity}
STSBEval
get callable whose input is list
of str
and output is n dimentional numpy.array
.
Dataset
STS Benchmark
This dataset (especially, sts-dev.csv
) was used in SimCSE paper. In the paper, the threshold of similarity score was st at 4.0; pairs of sentences whose similarity score is higher than 4.0 are used for evaluation of alignment. You can set other threshold as the following example.
from alignunformeval import STSBEval
# sentence_encoder : some function List[str] to np.array[dimention_of_sentence_vector]
evaluator = STSBEval(sentence_encoder, threshold=3.0) # set threshold at 3.0
result = evaluator.eval_summary()
Please see test/test_stsb.py
if you want more details.
Tokyo Metropolitan University Paraphrase Corpus (TMUP)
Tokyo Metropolitan University Paraphrase Corpus (TMUP) is a Japanese paraphrase dataset.
from alignunformeval import TMUPEval
# sentence_encoder : some function List[str] to np.array[dimention_of_sentence_vector]
evaluator = TMUPEval(sentence_encoder)
result = evaluator.eval_summary()
License
The license of this tool follows each dataset. Please read the documents of datasets you use.
Reference
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file alignunformeval-0.0.1.1.tar.gz
.
File metadata
- Download URL: alignunformeval-0.0.1.1.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.11 CPython/3.9.7 Linux/5.11.0-43-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b3cbd4f022a0902054fc607f04a330c871f629b381d97708bca0683665299d3 |
|
MD5 | 574fa5031585f5a6ed8a80bc60c9c33f |
|
BLAKE2b-256 | c900c5cf9499f9405f11cebe0c08a8a733c9a6455c5638778dfe08afbec8545e |
File details
Details for the file alignunformeval-0.0.1.1-py3-none-any.whl
.
File metadata
- Download URL: alignunformeval-0.0.1.1-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.11 CPython/3.9.7 Linux/5.11.0-43-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28cd162622fd6c24fb0dbfb5bef6d0c1eb87745fcd51a04332501a5e2c1e4e09 |
|
MD5 | 2079c8d3ef2d5a9c8ad1491517c979dc |
|
BLAKE2b-256 | 2f92f94a4101bf9c1a66b3a2d0173190ac00228ef99d895827189d0e654fc347 |