Evaluation of pretrained language models on mono- or multilingual language tasks.
Project description
Evaluation of pretrained language models on mono- or multilingual language tasks.
Maintainers
- Dan Saattrup Nielsen (@saattrupdan, dan.nielsen@alexandra.dk)
- Kenneth Enevoldsen (@KennethEnevoldsen, kenneth.enevoldsen@cas.au.dk)
Installation
To install the package simply write the following command in your favorite terminal:
$ pip install scandeval
Quickstart
Benchmarking from the Command Line
The easiest way to benchmark pretrained models is via the command line interface. After having installed the package, you can benchmark your favorite model like so:
$ scandeval --model-id <model-id>
Here model_id
is the HuggingFace model ID, which can be found on the HuggingFace
Hub. By default this will benchmark the model on all
the datasets eligible. If you want to benchmark on a specific dataset, this can be done
via the --dataset
flag. This will for instance evaluate the model on the
AngryTweets
dataset:
$ scandeval --model-id <model-id> --dataset angry-tweets
We can also separate by language. To benchmark all Danish models on all Danish
datasets, say, this can be done using the language
tag, like so:
$ scandeval --language da
Multiple models, datasets and/or languages can be specified by just attaching multiple arguments. Here is an example with two models:
$ scandeval --model-id <model-id1> --model-id <model-id2> --dataset angry-tweets
The specific model version to use can also be added after the suffix '@':
$ scandeval --model-id <model-id>@<commit>
It can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
See all the arguments and options available for the scandeval
command by typing
$ scandeval --help
Benchmarking from a Script
In a script, the syntax is similar to the command line interface. You simply initialise
an object of the Benchmarker
class, and call this benchmark object with your favorite
models and/or datasets:
>>> from scandeval import Benchmarker
>>> benchmark = Benchmarker()
>>> benchmark('<model-id>')
To benchmark on a specific dataset, you simply specify the second argument, shown here
with the AngryTweets
dataset again:
>>> benchmark('<model_id>', 'angry-tweets')
If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
specify several parameters in the Benchmarker
initializer to narrow down the list of
models to the ones you care about. As a simple example, the following would benchmark
all the Nynorsk models on Nynorsk datasets:
>>> benchmark = Benchmarker(language='nn')
>>> benchmark()
Citing ScandEval
If you want to cite the framework then feel free to use this:
@inproceedings{nielsen2023scandeval,
title={ScandEval: A Benchmark for Scandinavian Natural Language Processing},
author={Nielsen, Dan Saattrup},
booktitle={The 24rd Nordic Conference on Computational Linguistics},
year={2023}
}
Remarks
The image used in the logo has been created by the amazing Scandinavia and the World team. Go check them out!
Project structure
.
├── .github
│ └── workflows
│ └── ci.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── CHANGELOG.md
├── LICENSE
├── README.md
├── docs
├── gfx
│ └── scandeval.png
├── makefile
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── src
│ ├── scandeval
│ │ ├── __init__.py
│ │ ├── benchmark_config_factory.py
│ │ ├── benchmark_dataset.py
│ │ ├── benchmarker.py
│ │ ├── callbacks.py
│ │ ├── cli.py
│ │ ├── config.py
│ │ ├── dataset_configs.py
│ │ ├── dataset_factory.py
│ │ ├── dataset_tasks.py
│ │ ├── enums.py
│ │ ├── exceptions.py
│ │ ├── finetuning.py
│ │ ├── generation.py
│ │ ├── languages.py
│ │ ├── model_config.py
│ │ ├── model_loading.py
│ │ ├── model_setups
│ │ │ ├── __init__.py
│ │ │ ├── fresh.py
│ │ │ ├── hf.py
│ │ │ ├── local.py
│ │ │ ├── openai.py
│ │ │ └── utils.py
│ │ ├── named_entity_recognition.py
│ │ ├── openai_models.py
│ │ ├── protocols.py
│ │ ├── question_answering.py
│ │ ├── question_answering_trainer.py
│ │ ├── scores.py
│ │ ├── sequence_classification.py
│ │ ├── speed_benchmark.py
│ │ ├── text_to_text.py
│ │ ├── types.py
│ │ └── utils.py
│ └── scripts
│ ├── create_angry_tweets.py
│ ├── create_dane.py
│ ├── create_mim_gold_ner.py
│ ├── create_mlsum.py
│ ├── create_no_sammendrag.py
│ ├── create_nordjylland_news.py
│ ├── create_norec.py
│ ├── create_norne.py
│ ├── create_rrn.py
│ ├── create_scala.py
│ ├── create_scandiqa.py
│ ├── create_suc3.py
│ ├── create_swedn.py
│ ├── create_swerec.py
│ ├── create_wiki_lingua_nl.py
│ ├── create_wikiann_fo.py
│ ├── fill_in_missing_model_metadata.py
│ ├── fix_dot_env_file.py
│ ├── load_ud_pos.py
│ └── versioning.py
└── tests
├── __init__.py
├── conftest.py
├── test_benchmark_config_factory.py
├── test_benchmark_dataset.py
├── test_benchmarker.py
├── test_callbacks.py
├── test_cli.py
├── test_config.py
├── test_dataset_configs.py
├── test_dataset_factory.py
├── test_dataset_tasks.py
├── test_enums.py
├── test_exceptions.py
├── test_languages.py
├── test_model_config.py
├── test_model_loading.py
├── test_named_entity_recognition.py
├── test_openai_models.py
├── test_question_answering.py
├── test_question_answering_trainer.py
├── test_scores.py
├── test_sequence_classification.py
├── test_speed_benchmark.py
├── test_types.py
└── test_utils.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scandeval-8.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84f800fac7e74c19460a94e4043faef6e3c3507ec9a02297a5f9ae22c00c2b8a |
|
MD5 | 3a2fd80114cdbfe57d5d0122d2b292ab |
|
BLAKE2b-256 | 27ea0ad6424d59b5bd201ad8a3273fb976beee2bb1c394aa7f7d35ac5efb669a |