A Unified Framework for Intrinsic Evaluation of Word-Embedding Algorithms
Project description
vec2best: A Unified Framework for Intrinsic Evaluation of Word-Embedding Algorithms
Description • Requirements • Installation • Usage
Description
vec2best is a library for Python which represents a framework for evaluating word embeddings trained using various methods and hyper-parameters on a range of tasks from the literature. The tool yields a holistic evaluation metric for each model called the $PCE$ (Principal Component Evaluation).
vec2best implements the state-of-the-art intrinsic evaluations tasks of word similarity, word analogy, concept categorisation, and outlier detection over the benchmarks in the following table.
Task | Evaluation | Metric | Benchmark |
---|---|---|---|
Similarity | Spearman correlation | Cosine similarity | WS353, RG65, RW, MEN, MTurk287, SimLex999, MC30, MTurk771, YP130, Verb143, SimVerb3500, SemEval17, WS353REL, WS353SIM |
Analogy | Accuracy | 3CosAdd, 3CosMul | Google, MSR |
Spearman correlation | 3CosAdd | SemEval2012 | |
Categorization | Purity | Clustering | AP, BLESS, BM (battig), ESSLI 1a, ESSLI 2b, ESSLI 2c |
Outlier detection | Accuracy | Compactness score | 8-8-8, WordSim500 |
Requirements
- Python 3.6
- scikit-learn
- six
- word-embeddings-benchmarks
The package also relies on a modified version on the following repositories for outlier detection:
Installation
vec2best can be installed through pip
(the Python package manager) in the following way:
pip install vec2best
Usage
To compute the $PCE$ you need to apply the function compute_pce(path_to_model)
and the only parameter that you need to set is the path in which you saved the embedding models (in a .vec or .txt format) you want to evaluate.
The function compute_pce(path_to_model)
has other six parameters (categorization=True, similarity=True}, analogy=True, outlier_detection=True, pce_min=True, pce_max=True, pce_mean=True)
set by default as True
, and so the output consists in the evaluation of the models over the three tasks and over the $PCE^{MIN}$, $PCE^{MAX}$, $PCE^{MEAN}$. By setting some of those parameters as False
, the $PCE$ can be computed over a subset of those tasks or the evaluation could be computed only for one or two of the three types of $PCE$.
The output is saved in the folder results/pce, and the output on the screen shows the percentage of explained variance of the first principal component, and the top 3 models according to the chosen $PCE$.
See the following example:
from vec2best import compute_pce
path_to_model = 'data/example_models'
compute_pce(path_to_model, analogy=False,outlier_detection=False,
pce_max=False, pce_mean=False)
The output will look like:
PCE min - percentage of explained variance: 0.95
categorization similarity PCE_min
example_models/ft_0_5_50_5.vec 0.38 0.29 1.00
example_models/glove_5_50_5.vec 0.41 0.25 0.94
example_models/wv2_model_11.vec 0.24 0.17 0.34
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vec2best-1.1.0.tar.gz
.
File metadata
- Download URL: vec2best-1.1.0.tar.gz
- Upload date:
- Size: 94.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/1.0.0 urllib3/1.26.18 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1674b0b166c3eae6a762a91b3207391e51a562860283054439a3fbb4c6bca8b |
|
MD5 | 5e63855571c851e9cd6118e8941ee67c |
|
BLAKE2b-256 | da662bf124d5b5a25fb2cce9e59e578ced4f974a43b0db79620e683f1aeedb47 |
File details
Details for the file vec2best-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: vec2best-1.1.0-py3-none-any.whl
- Upload date:
- Size: 209.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/1.0.0 urllib3/1.26.18 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe3bc0830a1491f38986ff46615812919767907ae06f4ed414ce9909d001d0f9 |
|
MD5 | 4f6128e5afdf6acab6f0b803d9ba8ff4 |
|
BLAKE2b-256 | 0b8078ac73057c3b3afab777c4cf4eb1cab994ec3f24f9f19947aa99138075c3 |