Sentence embedding evaluation for German
Project description
sentence-embedding-evaluation-german
Sentence embedding evaluation for German.
This library is inspired by SentEval but focuses on German language downstream tasks.
Downstream tasks
task | type | properties | #train | #test | target | info |
---|---|---|---|---|---|---|
TOXIC | ๐ฟ toxic comments | facebook comments | 3244 | 944 | binary {0,1} | GermEval 2021, comments subtask 1, ๐ ๐ |
ENGAGE | ๐ค engaging comments | facebook comments | 3244 | 944 | binary {0,1} | GermEval 2021, comments subtask 2, ๐ ๐ |
FCLAIM | โ๏ธ fact-claiming comments | facebook comments | 3244 | 944 | binary {0,1} | GermEval 2021, comments subtask 3, ๐ ๐ |
VMWE | verbal idioms | newspaper | 6652 | 1447 | binary (figuratively, literally) | GermEval 2021, verbal idioms, ๐ ๐ |
OL19-A | ๐ฟ offensive language | tweets | 3980 | 3031 | binary {0,1} | GermEval 2018, ๐ ๐ |
OL19-B | ๐ฟ offensive language, fine-grained | tweets | 3980 | 3031 | 4 catg. (profanity, insult, abuse, oth.) | GermEval 2018, ๐ ๐ |
OL19-C | ๐ฟ explicit vs. implicit offense | tweets | 1921 | 930 | binary (explicit, implicit) | GermEval 2018, ๐ ๐ |
OL18-A | ๐ฟ offensive language | tweets | 5009 | 3398 | binary {0,1} | GermEval 2018, ๐ |
OL18-B | ๐ฟ offensive language, fine-grained | tweets | 5009 | 3398 | 4 catg. (profanity, insult, abuse, oth.) | GermEval 2018, ๐ |
ABSD-1 | ๐ค๏ธ relevance classification | 'Deutsche Bahn' customer feedback, lang:de-DE |
19432 | 2555 | binary | GermEval 2017, ๐ |
ABSD-2 | ๐ค๏ธ Sentiment analysis | 'Deutsche Bahn' customer feedback, lang:de-DE |
19432 | 2555 | 3 catg. (pos., neg., neutral) | GermEval 2017, ๐ |
ABSD-3 | ๐ค๏ธ aspect categories | 'Deutsche Bahn' customer feedback, lang:de-DE |
19432 | 2555 | 20 catg. | GermEval 2017, ๐ |
MIO-S | Sentiment analysis | 'Der Standard' newspaper article web comments, lang:de-AT |
1799 | 1800 | 3 catg. | One Million Posts Corpus, ๐ |
MIO-O | off-topic comments | 'Der Standard' newspaper article web comments, lang:de-AT |
1799 | 1800 | binary | One Million Posts Corpus, ๐ |
MIO-I | inappropriate comments | 'Der Standard' newspaper article web comments, lang:de-AT |
1799 | 1800 | binary | One Million Posts Corpus, ๐ |
MIO-D | discriminating comments | 'Der Standard' newspaper article web comments, lang:de-AT |
1799 | 1800 | binary | One Million Posts Corpus, ๐ |
MIO-F | feedback comments | 'Der Standard' newspaper article web comments, lang:de-AT |
3019 | 3019 | binary | One Million Posts Corpus, ๐ |
MIO-P | personal story comments | 'Der Standard' newspaper article web comments, lang:de-AT |
4668 | 4668 | binary | One Million Posts Corpus, ๐ |
MIO-A | argumentative comments | 'Der Standard' newspaper article web comments, lang:de-AT |
1799 | 1800 | binary | One Million Posts Corpus, ๐ |
SBCH-L | Swiss German detection | 'chatmania' app comments, lang:gsw |
748 | 748 | binary | SB-CH Corpus, ๐ |
SBCH-S | Sentiment analysis | 'chatmania' app comments, only comments labelled as Swiss German are included, lang:gsw |
394 | 394 | 3 catg. | SB-CH Corpus, ๐ |
ARCHI | Swiss German Dialect Classification | lang:gsw |
18809 | 4743 | 4 catg. | ArchiMob, ๐ ๐ |
LSDC | Lower Saxon Dialect Classification | lang:nds |
74140 | 8602 | 14 catg. | LSDC, ๐ ๐ |
Download datasets
bash download-datasets.sh
Usage example
from typing import List
import sentence_embedding_evaluation_german as seeg
import torch
# (1) Instantiate your Embedding model
emb_dim = 512
vocab_sz = 128
emb = torch.randn((vocab_sz, emb_dim), requires_grad=False)
emb = torch.nn.Embedding.from_pretrained(emb)
assert emb.weight.requires_grad == False
# (2) Specify the preprocessing
def preprocesser(batch: List[str], params: dict=None) -> List[List[float]]:
""" Specify your embedding or pretrained encoder here
Paramters:
----------
params : dict
The params dictionary
batch : List[str]
A list of sentence as string
Returns:
--------
List[List[float]]
A list of embedding vectors
"""
features = []
for sent in batch:
try:
ids = torch.tensor([ord(c) % 128 for c in sent])
except:
print(sent)
h = emb(ids)
features.append(h.mean(axis=0))
features = torch.stack(features, dim=0)
return features
# (3) Training settings
params = {
'datafolder': '../datasets',
'batch_size': 128,
'num_epochs': 20,
# 'early_stopping': True,
# 'split_ratio': 0.2, # if early_stopping=True
# 'patience': 5, # if early_stopping=True
}
# (4) Specify downstream tasks
downstream_tasks = ['FCLAIM', 'VMWE', 'OL19-C', 'ABSD-2', 'MIO-P', 'ARCHI', 'LSDC']
# (5) Run experiments
results = seeg.evaluate(downstream_tasks, preprocesser, **params)
Appendix
Installation
The sentence-embedding-evaluation-german
git repo is available as PyPi package
pip install sentence-embedding-evaluation-german
pip install git+ssh://git@github.com/ulf1/sentence-embedding-evaluation-german.git
Install a virtual environment
python3.7 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv
. Use an absolute path without whitespaces.)
Python commands
- Jupyter for the examples:
jupyter lab
- Check syntax:
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
Publish
pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*
Clean up
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
Support
Please open an issue for support.
Contributing
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Close
Hashes for sentence-embedding-evaluation-german-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7a95247c0d048da28c4abcaa02050070df841030c35967cde1a4e2a2004f47c |
|
MD5 | 748a49e8f111ab3155a96604af0bb535 |
|
BLAKE2b-256 | 396222f16b4a26f9935748fc585aecf5ebf7c9a572134446f1295cba7bee9776 |