Skip to main content

Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence Embeddings used in the TSDAE paper.

Project description

Unsupervised Sentence Embedding Benchmark (USEB)

This repository hosts the data and the evaluation script for reproducing the results reported in the paper: "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning". This benchmark (USEB) contains four heterogeous, task- and domain-specific datasets: AskUbuntu, CQADupStack, TwitterPara and SciDocs. For details, pleasae refer to the paper.

Install

pip install useb  # Or git clone and pip install .
python -m useb.downloading all  # Download both training and evaluation data

Usage & Example

After data downloading, one can either run

python -m useb.examples.eval_sbert

to evaluate an SBERT model on all the datasets (it needs ~8min on a GPU); or run this same code below:

from useb import run
from sentence_transformers import SentenceTransformer  # SentenceTransformer is an awesome library for providing SOTA sentence embedding methods. TSDAE is also integrated into it.
import torch

sbert = SentenceTransformer('bert-base-nli-mean-tokens')  # Build an SBERT model

# The only thing needed for the evaluation: a function mapping a list of sentences into a batch of vectors (torch.Tensor)
@torch.no_grad()
def semb_fn(sentences) -> torch.Tensor:
    return torch.Tensor(sbert.encode(sentences, show_progress_bar=False))

results, results_main_metric = run(
    semb_fn_askubuntu=semb_fn, 
    semb_fn_cqadupstack=semb_fn,  
    semb_fn_twitterpara=semb_fn, 
    semb_fn_scidocs=semb_fn,
    eval_type='test',
    data_eval_path='data-eval'  # This should be the path to the folder of data-eval
)

assert round(results_main_metric['avg'], 1) == 47.6

It is also supported to evaluate on a single dataset (please see useb/examples/eval_sbert_askubuntu.py):

python -m useb.examples.eval_sbert_askubuntu

Data Organization

.
├── data-eval  # For evaluation usage. One can refer to ./unsupse_benchmark/evaluators to learn about how to loading these data.   ├── askubuntu
│      ├── dev.txt
│      ├── test.txt
│      └── text_tokenized.txt
│   ├── cqadupstack
│      ├── corpus.json
│      └── retrieval_split.json
│   ├── scidocs
│      ├── cite
│         ├── test.qrel
│         └── val.qrel
│      ├── cocite
│         ├── test.qrel
│         └── val.qrel
│      ├── coread
│         ├── test.qrel
│         └── val.qrel
│      ├── coview
│         ├── test.qrel
│         └── val.qrel
│      └── data.json
│   └── twitterpara
│       ├── Twitter_URL_Corpus_test.txt
│       ├── test.data
│       └── test.label
├── data-train  # For training usage.   ├── askubuntu
│      ├── supervised  # For supervised training. *.org and *.para are parallel files, each line are aligned and compose a gold relevant sentence pair (to work with MultipleNegativeRankingLoss in the SBERT repo).         ├── train.org
│         └── train.para
│      └── unsupervised  # For unsupervised training. Each line is a sentence.          └── train.txt
│   ├── cqadupstack
│      ├── supervised
│         ├── train.org
│         └── train.para
│      └── unsupervised
│          └── train.txt
│   ├── scidocs
│      ├── supervised
│         ├── train.org
│         └── train.para
│      └── unsupervised
│          └── train.txt
│   └── twitter  # For supervised training on TwitterPara, the float labels are also available (to work with CosineSimilarityLoss in the SBERT repo). As reported in the paper, using the float labels can achieve higher performance.       ├── supervised
│          ├── train.lbl
│          ├── train.org
│          ├── train.para
│          ├── train.s1
│          └── train.s2
│       └── unsupervised
│           └── train.txt
└── tree.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

useb-0.0.1.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

useb-0.0.1-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file useb-0.0.1.tar.gz.

File metadata

  • Download URL: useb-0.0.1.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13

File hashes

Hashes for useb-0.0.1.tar.gz
Algorithm Hash digest
SHA256 650feaa003cee1c08f8722290fdf36448475b074e889770fb6846e5c165c15df
MD5 13a6ca72340fe06473684539c3265df6
BLAKE2b-256 1505f1a746aa4f18dfd08caf5ea6e61730489f5b3eb44a79ae938ce07f8e2b61

See more details on using hashes here.

File details

Details for the file useb-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: useb-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13

File hashes

Hashes for useb-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 64626124c44fa4bc8c60225ed15e1c48b433b98d94583a8830fa46a85670f461
MD5 6e5afea65abacf936724c785aa87cd60
BLAKE2b-256 984d712a770803ff4d3e18bf75177ccb1b2313072d062c0a524ca51d91d5ac3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page