Annotation error detection and correction
Project description
nessie is a package for annotation error detection. It can be used to automatically detect errors in annotated corpora so that human annotators can concentrate on a subset to correct, instead of needing to look at each and every instance.
💡 Please also refer to our additional documentation ! It contains detailed explanations and code examples!
Contact person: Jan-Christoph Klie
https://www.ukp.tu-darmstadt.de
https://www.tu-darmstadt.de
Don't hesitate to report an issue if something is broken (and it shouldn't be) or if you have further questions.
⚠️ This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
Please use the following citation when using our software:
@misc{https://doi.org/10.48550/arxiv.2206.02280,
doi = {10.48550/ARXIV.2206.02280},
url = {https://arxiv.org/abs/2206.02280},
author = {Klie, Jan-Christoph and Webber, Bonnie and Gurevych, Iryna},
title = {Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future},
publisher = {arXiv},
year = {2022}
}
Installation
pip install nessie
This installs the package with default dependencies and PyTorch with only CPU support.
If you want to use your own PyTorch version (e.g., with CUDA enabled), you need to install it afterwards manually.
If you need faiss-gpu
, then you should also install that manually afterwards.
Basic Usage
Given annotated data, this package can be used to find potential errors. For instance, using Retag
, that is,
training a model, letting it predict on your data and then flagging instances where model predictions disagree
with the given labels can be done as:
from nessie.dataloader import load_example_text_classification_data
from nessie.helper import CrossValidationHelper
from nessie.models.text import DummyTextClassifier
from nessie.detectors import Retag
text_data = load_example_text_classification_data().subset(100)
cv = CrossValidationHelper(n_splits=10)
tc_result = cv.run(text_data.texts, text_data.noisy_labels, DummyTextClassifier())
detector = Retag()
flags = detector.score(text_data.noisy_labels, tc_result.predictions)
💡 Please also refer to our additional documentation ! It contains detailed explanations and code examples!
Methods
We implement a wide range of annotation error detection methods. These are divided in two categories, flaggers and scorers. Flaggers give a binary judgement whether an instance is considered wrong, Scorers give a certainty estimate how likely it is that an instance is wrong.
Flagger
Abbreviation | Method | Text | Token | Span | Proposed by |
---|---|---|---|---|---|
CL | Confident Learning | ✓ | ✓ | ✓ | Northcutt (2021) |
CS | Curriculum Spotter | ✓ | Amiri (2018) | ||
DE | Diverse Ensemble | ✓ | ✓ | ✓ | Loftsson (2009) |
IRT | Item Response Theory | ✓ | ✓ | ✓ | Rodriguez (2021) |
LA | Label Aggregation | ✓ | ✓ | ✓ | Amiri (2018) |
LS | Leitner Spotter | ✓ | Amiri (2018) | ||
PE | Projection Ensemble | ✓ | ✓ | ✓ | Reiss (2020) |
RE | Retag | ✓ | ✓ | ✓ | van Halteren (2000) |
VN | Variation n-Grams | ✓ | ✓ | Dickinson (2003) |
Scorer
Abbreviation | Method | Text | Token | Span | Proposed by |
---|---|---|---|---|---|
BC | Borda Count | ✓ | ✓ | ✓ | Larson (2020) |
CU | Classification Uncertainty | ✓ | ✓ | ✓ | Hendrycks (2017) |
DM | Data Map Confidence | ✓ | ✓ | ✓ | Swayamdipta (2020) |
DU | Dropout Uncertainty | ✓ | ✓ | ✓ | Amiri (2018) |
KNN | k-Nearest Neighbor Entropy | ✓ | ✓ | ✓ | Grivas (2020) |
LE | Label Entropy | ✓ | ✓ | Hollenstein (2016) | |
MD | Mean Distance | ✓ | ✓ | ✓ | Larson (2019) |
PM | Prediction Margin | ✓ | ✓ | ✓ | Dligach (2011) |
WD | Weighted Discrepancy | ✓ | ✓ | Hollenstein (2016) |
Models
Model-based annotation detection methods need trained models to obtain predictions or probabilities.
We already implemented the most common models for you to be ready to use.
You can add your own models by implementing the respective abstract class for TextClassifier
or SequenceTagger
.
We provide the following models:
Text classification
Class name | Description |
---|---|
FastTextTextClassifier | Fasttext |
FlairTextClassifier | Flair |
LgbmTextClassifier | LightGBM with handcrafted features |
LgbmTextClassifier | LightGBM with S-BERT features |
MaxEntTextClassifier | Logistic Regression with handcrafted features |
MaxEntTextClassifier | Logistic with S-BERT features |
TransformerTextClassifier | Transformers |
You can easily add your own sklearn classifiers by subclassing SklearnTextClassifier
like the following:
class MaxEntTextClassifier(SklearnTextClassifier):
def __init__(self, embedder: SentenceEmbedder, max_iter=10000):
super().__init__(lambda: LogisticRegression(max_iter=max_iter, random_state=RANDOM_STATE), embedder)
Sequence Classification
Class name | Description |
---|---|
FlairSequenceTagger | Flair |
CrfSequenceTagger | CRF with handcrafted features |
MaxEntSequenceTagger | Maxent sequence tagger |
TransformerSequenceTagger | Transformer |
Development
We use flit for dependency management and packaging. Follow their documentation to install it. Then you can run
flit install -s
to download the dependencies and install in its own environment. In order to install your own PyTorch with CUDA, you can run
make force-cuda113
or install it manually in the poetry environment. You can format the code via
make format
which should be run before every commit.
Bibliography
Amiri, Hadi, Timothy Miller, and Guergana Savova. 2018. "Spotting Spurious Data with Neural Networks." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2006-16. New Orleans, Louisiana.
Dligach, Dmitriy, and Martha Palmer. 2011. "Reducing the Need for Double Annotation." Proceedings of the 5th Linguistic Annotation Workshop, 65-73. Portland, Oregon, USA.
Grivas, Andreas, Beatrice Alex, Claire Grover, Richard Tobin, and William Whiteley. 2020. "Not a Cute Stroke: Analysis of Rule- and Neural Network-based Information Extraction Systems for Brain Radiology Reports." Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, 24-37. Online.
Hendrycks, Dan, and Kevin Gimpel. 2017. "A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks." Proceedings of International Conference on Learning Representations, 1-12.
Hollenstein, Nora, Nathan Schneider, and Bonnie Webber. 2016. "Inconsistency Detection in Semantic Annotation." Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 3986-90. Portorož, Slovenia.
Larson, Stefan, Anish Mahendran, Andrew Lee, Jonathan K. Kummerfeld, Parker Hill, Michael A. Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars. 2019. "Outlier Detection for Improved Data Quality and Diversity in Dialog Systems." Proceedings of the 2019 Conference of the North, 517-27. Minneapolis, Minnesota.
Loftsson, Hrafn. 2009. "Correcting a POS-Tagged Corpus Using Three Complementary Methods." Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), 523-31. Athens, Greece.
Northcutt, Curtis, Lu Jiang, and Isaac Chuang. 2021. "Confident Learning: Estimating Uncertainty in Dataset Labels." Journal of Artificial Intelligence Research 70 (April): 1373-1411.
Reiss, Frederick, Hong Xu, Bryan Cutler, Karthik Muthuraman, and Zachary Eichenberger. 2020. "Identifying Incorrect Labels in the CoNLL-2003 Corpus." Proceedings of the 24th Conference on Computational Natural Language Learning, 215-26. Online.
Rodriguez, Pedro, Joe Barrow, Alexander Miserlis Hoyle, John P. Lalor, Robin Jia, and Jordan Boyd-Graber. 2021. "Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards?" Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4486-4503. Online.
Swayamdipta, Swabha, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. 2020. "Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9275-93. Online.
van Halteren, Hans. 2000. "The Detection of Inconsistency in Manually Tagged Text." Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora, 48-55. Luxembourg.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nessie-0.1.1.tar.gz
.
File metadata
- Download URL: nessie-0.1.1.tar.gz
- Upload date:
- Size: 198.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.27.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34b7b996267f963dca658bf75883768565bdf04c5197d58150cd3713099e06cc |
|
MD5 | 9fa374cfcbfc2197c83253aa007652af |
|
BLAKE2b-256 | 005ad467a9d851430a3acdb0f975929547e56526292479be9302afa2295bbe61 |
File details
Details for the file nessie-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: nessie-0.1.1-py3-none-any.whl
- Upload date:
- Size: 76.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.27.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f503eacaa9d185d605e4603c0c17f7b98246a298af434da480eb460778ac0c09 |
|
MD5 | 0eb3f0f856c0902629a2abc452913eff |
|
BLAKE2b-256 | 2a170c21f467052a8c26d36ae3f14996cd194bc829b5092c69b4c7b376f7263d |