Interpretable Evaluation for Natural Language Processing
Project description
ExplainaBoard: An Explainable Leaderboard for NLP
Introduction | Website | Download | Backend | Paper | Video | Bib
Introduction
ExplainaBoard is an interpretable, interactive and reliable leaderboard with seven (so far) new features (F) compared with generic leaderboard.
- F1: Single-system Analysis: What is a system good or bad at?
- F2: Pairwise Analysis: Where is one system better (worse) than another?
- F3: Data Bias Analysis: What are the characteristics of different evaluated datasets?
- F5: Common errors: What are common mistakes that top-5 systems made?
- F6: Fine-grained errors: where will errors occur?
- F7: System Combination: Is there potential complementarity between different systems?
Website
We deploy ExplainaBoard as a Web toolkit, which includes 9 NLP tasks, 40 datasets and 300 systems. Detailed information is as follows.
Task
Task | Sub-task | Dataset | Model | Attribute |
---|---|---|---|---|
Sentiment | 8 | 40 | 2 | |
Text Classification | Topics | 4 | 18 | 2 |
Intention | 1 | 3 | 2 | |
Text-Span Classification | Aspect Sentiment | 4 | 20 | 4 |
Text pair Classification | NLI | 2 | 6 | 7 |
NER | 3 | 74 | 9 | |
Sequence Labeling | POS | 3 | 14 | 4 |
Chunking | 3 | 14 | 9 | |
CWS | 7 | 64 | 7 | |
Structure Prediction | Semantic Parsing | 4 | 12 | 4 |
Text Generation | Summarization | 2 | 36 | 7 |
Download System Outputs
We haven't released datasets or corresponding system outputs that require licenses. But If you have licenses please fill in this form and we will send them to you privately. (Description of output's format can refer here If these system outputs are useful for you, you can cite our work.
Test Your Results
pip install -r requirements.txt
Description of Each Directory
-
task-[task_name]
: fine-grained analysis for each task, aiming to generating fine-grained analysis results with the json format. For example, task-mlqa can calculate the fine-graied F1 scores for different systems, and output corresponding json files in task-mlqa/output/ . -
meta-eval
is a sort of controller, which can be used to start the fine-graind anlsysis of all tasks, and analyze output json files.- calculate fine-grained results for all tasks: ./meta-eval/run-allTasks.sh
cd ./meta-eval/ ./run-allTasks.sh
- merge json files of all tasks into a csv file, which would be useful for further SQL import: ./meta-eval/genCSV/json2csv.py
cd ./meta-eval/genCSV/json2csv.py python json2csv.py > explainabord.csv
-
src
stores some auxiliary codes.
Submit Your Results
You can submit your system's output by this form following the format description.
Acknowledgement
We thanks all authors who share their system outputs with us: Ikuya Yamada, Stefan Schweter, Colin Raffel, Yang Liu, Li Dong. We also thank Vijay Viswanathan, Yiran Chen, Hiroaki Hayashi for useful discussion and feedback about ExplainaBoard.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for interpret_eval-0.1.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61899d47a742cde91d33443fe98875078a51a46f69c1fd4990445b1bce83a608 |
|
MD5 | e98247945c996e2ccc8885e7ce8372f8 |
|
BLAKE2b-256 | 12acd4c1359ce5748e05e41e9b0a9fe6cb4bbb1631e1629d942d5356aae9d581 |