Sentence embedding evaluation for German

Project description

sentence-embedding-evaluation-german

Sentence embedding evaluation for German.

This library is inspired by SentEval but focuses on German language downstream tasks.

Downstream tasks

task	type	properties	#train	#test	target	info
TOXIC	👿 toxic comments	facebook comments	3244	944	binary {0,1}	GermEval 2021, comments subtask 1, 📁 📖
ENGAGE	🤗 engaging comments	facebook comments	3244	944	binary {0,1}	GermEval 2021, comments subtask 2, 📁 📖
FCLAIM	☝️ fact-claiming comments	facebook comments	3244	944	binary {0,1}	GermEval 2021, comments subtask 3, 📁 📖
VMWE	verbal idioms	newspaper	6652	1447	binary (figuratively, literally)	GermEval 2021, verbal idioms, 📁 📖
OL19-A	👿 offensive language	tweets	3980	3031	binary {0,1}	GermEval 2018, 📁 📖
OL19-B	👿 offensive language, fine-grained	tweets	3980	3031	4 catg. (profanity, insult, abuse, oth.)	GermEval 2018, 📁 📖
OL19-C	👿 explicit vs. implicit offense	tweets	1921	930	binary (explicit, implicit)	GermEval 2018, 📁 📖
OL18-A	👿 offensive language	tweets	5009	3398	binary {0,1}	GermEval 2018, 📁
OL18-B	👿 offensive language, fine-grained	tweets	5009	3398	4 catg. (profanity, insult, abuse, oth.)	GermEval 2018, 📁
ABSD-1	🛤️ relevance classification	'Deutsche Bahn' customer feedback, `lang:de-DE`	19432	2555	binary	GermEval 2017, 📁
ABSD-2	🛤️ Sentiment analysis	'Deutsche Bahn' customer feedback, `lang:de-DE`	19432	2555	3 catg. (pos., neg., neutral)	GermEval 2017, 📁
ABSD-3	🛤️ aspect categories	'Deutsche Bahn' customer feedback, `lang:de-DE`	19432	2555	20 catg.	GermEval 2017, 📁
MIO-S	Sentiment analysis	'Der Standard' newspaper article web comments, `lang:de-AT`	1799	1800	3 catg.	One Million Posts Corpus, 📁
MIO-O	off-topic comments	'Der Standard' newspaper article web comments, `lang:de-AT`	1799	1800	binary	One Million Posts Corpus, 📁
MIO-I	inappropriate comments	'Der Standard' newspaper article web comments, `lang:de-AT`	1799	1800	binary	One Million Posts Corpus, 📁
MIO-D	discriminating comments	'Der Standard' newspaper article web comments, `lang:de-AT`	1799	1800	binary	One Million Posts Corpus, 📁
MIO-F	feedback comments	'Der Standard' newspaper article web comments, `lang:de-AT`	3019	3019	binary	One Million Posts Corpus, 📁
MIO-P	personal story comments	'Der Standard' newspaper article web comments, `lang:de-AT`	4668	4668	binary	One Million Posts Corpus, 📁
MIO-A	argumentative comments	'Der Standard' newspaper article web comments, `lang:de-AT`	1799	1800	binary	One Million Posts Corpus, 📁
SBCH-L	Swiss German detection	'chatmania' app comments, `lang:gsw`	748	748	binary	SB-CH Corpus, 📁
SBCH-S	Sentiment analysis	'chatmania' app comments, only comments labelled as Swiss German are included, `lang:gsw`	394	394	3 catg.	SB-CH Corpus, 📁
ARCHI	Swiss German Dialect Classification	`lang:gsw`	18809	4743	4 catg.	ArchiMob, 📁 📖
LSDC	Lower Saxon Dialect Classification	`lang:nds`	74140	8602	14 catg.	LSDC, 📁 📖

Download datasets

bash download-datasets.sh

Usage example

from typing import List
import sentence_embedding_evaluation_german as seeg
import torch

# (1) Instantiate your Embedding model
emb_dim = 512
vocab_sz = 128
emb = torch.randn((vocab_sz, emb_dim), requires_grad=False)
emb = torch.nn.Embedding.from_pretrained(emb)
assert emb.weight.requires_grad == False

# (2) Specify the preprocessing
def preprocesser(batch: List[str], params: dict=None) -> List[List[float]]:
    """ Specify your embedding or pretrained encoder here
    Paramters:
    ----------
    params : dict
        The params dictionary
    batch : List[str]
        A list of sentence as string
    Returns:
    --------
    List[List[float]]
        A list of embedding vectors
    """
    features = []
    for sent in batch:
        try:
            ids = torch.tensor([ord(c) % 128 for c in sent])
        except:
            print(sent)
        h = emb(ids)
        features.append(h.mean(axis=0))
    features = torch.stack(features, dim=0)
    return features

# (3) Training settings
params = {
    'datafolder': '../datasets',
    'batch_size': 128, 
    'num_epochs': 20,
    # 'early_stopping': True,
    # 'split_ratio': 0.2,  # if early_stopping=True
    # 'patience': 5,  # if early_stopping=True
}

# (4) Specify downstream tasks
downstream_tasks = ['FCLAIM', 'VMWE', 'OL19-C', 'ABSD-2', 'MIO-P', 'ARCHI', 'LSDC']

# (5) Run experiments
results = seeg.evaluate(downstream_tasks, preprocesser, **params)

Appendix

Installation

The sentence-embedding-evaluation-german git repo is available as PyPi package

pip install sentence-embedding-evaluation-german
pip install git+ssh://git@github.com/ulf1/sentence-embedding-evaluation-german.git

Install a virtual environment

python3.7 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv. Use an absolute path without whitespaces.)

Python commands

Jupyter for the examples: jupyter lab
Check syntax: flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')

Publish

pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist 
twine upload -r pypi dist/*

Clean up

find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Project details

Release history Release notifications | RSS feed

0.1.12

Nov 20, 2022

0.1.11

Nov 17, 2022

0.1.10

Nov 17, 2022

0.1.9

Sep 13, 2022

0.1.8

Aug 29, 2022

0.1.7

May 29, 2022

0.1.6

Apr 19, 2022

0.1.5

Apr 19, 2022

0.1.4

Apr 19, 2022

0.1.3

Apr 19, 2022

0.1.2

Apr 19, 2022

0.1.1

Apr 18, 2022

This version

0.1.0

Apr 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence-embedding-evaluation-german-0.1.0.tar.gz (19.8 kB view hashes)

Uploaded Apr 17, 2022 Source

Hashes for sentence-embedding-evaluation-german-0.1.0.tar.gz

Hashes for sentence-embedding-evaluation-german-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b7a95247c0d048da28c4abcaa02050070df841030c35967cde1a4e2a2004f47c`
MD5	`748a49e8f111ab3155a96604af0bb535`
BLAKE2b-256	`396222f16b4a26f9935748fc585aecf5ebf7c9a572134446f1295cba7bee9776`