A Command-Line Tool for Inter-Annotator Agreement Evaluation
Project description
IAA-Eval: A Command-Line Tool for Inter-Annotator Agreement Evaluation
Introduction
Inter-Annotator Agreement (IAA) is a crucial method for evaluating the quality of annotations , as it measures the reliability and reproducibility of the annotation process [1-2]. A high IAA suggests that the annotations are consistent and reliable, while a low IAA indicates potential problems with the annotation guidelines or the annotators' understanding of the data [1]. IAA is not merely an indicator of reliability; it's a tool for understanding sources of variation in the data and for improving the annotations [1].
Here are the key interests of inter-annotator agreement and different ways to calculate it, according to the sources:
- Reliability Assessment: The main purpose of IAA is to determine if the annotation process is reliable, meaning it produces consistent results. If annotators disagree, this highlights a problem in the process or in how guidelines are interpreted. Reliable annotation is a necessary but not sufficient condition for correct annotations [1-2].
- Ambiguity Identification: Disagreements between annotators reveal ambiguities or difficulties in the data or annotation guidelines. This helps to identify areas that require clarification and improvement [2, 3].
- Improvement of Annotation Guidelines: Analyzing disagreements can lead to adjustments and refinements of the annotation guidelines, thus enhancing the consistency of annotations. The process of improving guidelines is iterative, with regular reliability tests and adjustments until an acceptable level of agreement is achieved [2 3].
- Understanding Data Variation: IAA helps to identify parts of the data or types of annotations that are more difficult or less reliable than others. By segmenting the data and calculating agreement for each part, a better understanding of the annotation complexity can be achieved [1].
- Annotator Selection: IAA analysis can help identify annotators who are better suited to a specific annotation task. Some annotators may be more consistent in certain types of annotations than others [2].
- Measuring Impact of Semantic Context: By introducing semantic contexts during annotation, for example, for hate speech, IAA helps measure the impact of this information on convergence between annotators, especially between those belonging to the target population and those who do not [4].
Different methods exist for calculating IAA:
- Raw Agreement (Observed Agreement): This is the simplest measure, which calculates the percentage of items on which the annotators agree. However, it doesn't account for agreement that could be obtained by chance [1].
- Chance-Corrected Measures: These measures are used to correct the raw agreement by taking into account the probability of random agreement. The most common include:
- Cohen's Kappa: This measure is often used for categorical annotations between two annotators. It calculates the observed agreement beyond the agreement expected by chance, taking into account potential annotator biases. Several sources recommend choosing a metric based on the annotation task [1].
- Fleiss' Kappa: This coefficient is used to measure agreement between multiple annotators and is particularly suitable for classification evaluation [5].
- Krippendorff's Alpha (α): Similar to Fleiss' Kappa but more flexible as it can take into account different levels of disagreement and is suitable for incomplete data or with a variable number of annotators [6]. The sources show a range of alpha scores based on different annotation tasks, with some scores showing substantial agreement [1].
- F-measure: This measure calculates agreement using precision, recall, and their harmonic mean. It is particularly useful for tasks such as named entity recognition or classification, where the ability of a model to identify correct entities and its ability to identify all relevant entities must be balanced.
- Boundary-Weighted Fleiss' Kappa (BWFK): This method is used to evaluate agreement in tissue segmentation, particularly on tissue boundaries. It reduces the impact of minor disagreements that often occur along these boundaries [10]. This method can be used only on binary segmentation masks.
- Distance-Based Cell Agreement Algorithm (DBCAA): An algorithm that measures agreement in cell detection without relying on ground truth [10].
- Intersection over Union (IoU)-Based Measures: These measures are used for regional segmentation but are limited for assessments involving more than two observers [1].
- Intra-class correlation coefficient (ICC): This coefficient measures the degree of agreement between multiple annotators by comparing the variance between annotators to the total variance [11].
It is important to note that the choice of IAA measure depends on the nature of the data and the annotation task [1]. Additionally, IAA should not be considered an end in itself, but rather as a tool to understand the data and improve the annotation process. Disagreements between annotators can provide valuable insights [1, 9].
IAA-Eval: Features and Usage
IAA-Eval is designed to be a user-friendly CLI tool with the following key features:
- Data Loading and Preprocessing: Reads annotation data from various file formats, including CSV and JSON. Supports different annotation types (categorical, numerical, etc.)
- Test Data Note: For testing purposes, the file
Tests/Assets/Reviews_annotated.csvhas been generated using the program available at LLM-Tests, which provides sentiment annotations for Amazon reviews. - IAA Calculation: Computes various IAA metrics, including Cohen's Kappa, Fleiss' Kappa, Krippendorff's Alpha, and F-measure. Enables pairwise and overall IAA calculation [1].
- Data Filtering: Allows for data filtering by annotator or item to assess agreement subsets.
- Visualization: Generates graphs and tables to present IAA results clearly, aiding in interpretation. Provides options for saving these visualizations to files.
- Disagreement Analysis: Identifies and analyzes specific disagreements between annotators, including detailed reports and statistics. Allows for exporting disagreements for further analysis.
- Configuration Options: Provides flexible options for input data formats, IAA metrics, output directories, and more.
Basic Usage
The basic command for running IAA-Eval is:
python iaa_eval.py input_file.csv [options]
Common Options:
-
Input/Output:
input_file.csv: Path to the CSV file containing annotation data--output <file>: Path to save the results (default: print to console)--output-format <format>: Format for output: text, csv, json, html, console (default: text)
-
Agreement Measures:
--all: Calculate all applicable agreement measures--raw: Calculate raw agreement (percentage agreement)--cohen-kappa: Calculate Cohen's Kappa--fleiss-kappa: Calculate Fleiss' Kappa--krippendorff-alpha: Calculate Krippendorff's Alpha--f-measure: Calculate F-measure--icc: Calculate Intraclass Correlation Coefficient--bwfk: Calculate Boundary-Weighted Fleiss' Kappa--dbcaa: Calculate Distance-Based Cell Agreement Algorithm--iou: Calculate Intersection over Union
-
Advanced Options:
--confidence-interval <value>: Confidence level (0-1) for intervals (default: 0.95)--confidence-method <method>: Method for confidence intervals: bootstrap, normal, wilson, agresti-coull (default: wilson)--bootstrap-samples <n>: Number of bootstrap samples (default: 1000)--positive-class <class>: Specify positive class for F-measure--distance-threshold <value>: Distance threshold for DBCAA (default: 10.0)--bwfk-width <n>: Width parameter for BWFK (default: 5)--icc-form <form>: ICC form to use: 1,1|2,1|3,1|1,k|2,k|3,k (default: 2,1)
-
Logging:
-v <level>: Set verbosity level (0=error, 1=warning, 2=info, 3=debug)
Examples:
# Calculate all agreement measures
python iaa_eval.py annotations.csv --all
# Calculate specific measures with confidence intervals
python iaa_eval.py annotations.csv --cohen-kappa --fleiss-kappa --confidence-interval 0.95
# Calculate ICC with specific form and save results to CSV
python iaa_eval.py annotations.csv --icc --icc-form 3,1 --output results.csv --output-format csv
# Calculate F-measure with a specific positive class
python iaa_eval.py annotations.csv --f-measure --positive-class 1
# Get detailed help and options
python iaa_eval.py --help
python iaa_eval.py --show-options
For a complete overview of all options and their descriptions, use:
python iaa_eval.py --show-options
Data Format
The tool accepts CSV files with the following format:
- Each row represents an annotated item (e.g., a review)
- Columns should include scores from each annotator with the suffix
_score - For example:
Annotator1_score,Annotator2_score, etc.
Example of accepted CSV format:
id,text,Annotator1_score,Annotator2_score,Annotator3_score
1,"Text to annotate 1",5,4,5
2,"Text to annotate 2",3,3,4
3,"Text to annotate 3",1,2,1
Missing values are allowed and will be ignored when calculating agreement.
Core Modules
raw_agreement.py
This module calculates raw agreement between annotators. It provides the following functionalities:
calculate_pairwise(): Calculates pairwise agreement between all annotatorscalculate_overall(): Calculates overall agreement across all annotatorsget_agreement_statistics(): Provides agreement statistics (overall, average, min, max)
Agreement is calculated as the proportion of items for which annotators gave the same score.
dataPreparation.py
This module handles loading and preprocessing annotation data.
confident_interval.py
This module calculates confidence intervals for agreement measures. Confidence intervals provide a range of values that likely contain the true agreement value, helping to assess the reliability of the calculated agreement.
The module supports several methods for calculating confidence intervals:
-
Bootstrap Method: This non-parametric approach generates multiple resampled datasets from the original data, calculates agreement for each sample, and determines the confidence interval from the distribution of these values. It's robust and doesn't require assumptions about the underlying distribution.
-
Normal Approximation: This method assumes that the sampling distribution of agreement follows a normal distribution. It uses the standard error of the agreement measure to calculate confidence intervals. This approach is computationally efficient but assumes normality.
-
Wilson Score Interval: Particularly useful for proportions (like agreement scores), this method provides better coverage than normal approximation, especially for extreme agreement values (near 0 or 1) or small sample sizes.
-
Agresti-Coull Interval: An improved version of the Wilson Score method that adds "pseudo-observations" to the sample, resulting in intervals with better coverage properties.
The choice of method depends on factors such as sample size, agreement values, and computational constraints:
- For small samples or extreme agreement values, bootstrap or Wilson Score methods are recommended
- For large samples with moderate agreement values, normal approximation is computationally efficient
- When in doubt, the bootstrap method provides robust estimates without distributional assumptions
The confidence level (typically 95%) indicates the probability that the true agreement value falls within the calculated interval.
Advanced Features
• Semantic Information: The application can potentially incorporate semantic information to analyze how it affects IAA, building on research which suggests that providing background semantics increases inter-annotator agreement [1-2].
• Support for diverse data types: The application can be extended to support data coming from different annotation tasks (e.g., segmentation, cell detection), and metrics (e.g., BWFK, DBCAA) [2].
• Annotation variation analysis : The application can incorporate a tool to analyze how the introduction of additional information influences agreement, for example, by calculating changes in the metrics before and after using semantics [1-2].
• Integration with other tools: The application can easily export results in common data format, for use in external data analysis tools.
Bibliography
[1] Artstein, R. (2017). Inter-annotator Agreement. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_11
[2] Vámos, Csilla et al. 'Ontology of Active and Passive Environmental Exposure'. 1 Jan. 2024 : 1733 – 1761
[3] Cheng, Xiang, Raveesh Mayya, and João Sedoc. "From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research." arXiv preprint arXiv:2412.14461 (2024).
[4] Reyero Lobo, Paula, et al. "Enhancing Hate Speech Annotations with Background Semantics." ECAI 2024 . IOS Press, 2024. 3923-3930.
[5] McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012;22(3):276-82. PMID: 23092060; PMCID: PMC3900052.
[6] Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)
[7] Zhang, Ziqi, Sam Chapman, and Fabio Ciravegna. "A methodology towards effective and efficient manual document annotation: addressing annotator discrepancy and annotation quality." Knowledge Engineering and Management by the Masses: 17th International Conference, EKAW 2010, Lisbon, Portugal, October 11-15, 2010. Proceedings 17 . Springer Berlin Heidelberg, 2010.
[8] Reyero Lobo, Paula, et al. "Enhancing Hate Speech Annotations with Background Semantics." ECAI 2024 . IOS Press, 2024. 3923-3930.
[9] Krippendorff, K.: Reliability in content analysis: some common misconceptions and recom-mendations. Hum. Commun. Res. 30(3), 411–433 (2004)
[10] Capar, Abdulkerim & Ekinci, Dursun & Ertano, Mucahit & Niazi, M. & Balaban, Erva & Aloglu, Ibrahim & Dogan, Meryem & Su, Ziyu & Aker, Fugen & Gurcan, Metin. (2024). An interpretable framework for inter-observer agreement measurements in TILs scoring on histopathological breast images: A proof-of-principle study. PLOS ONE. 19. 10.1371/journal.pone.0314450.
[11] Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med. 2016 Jun;15(2):155-63. doi: 10.1016/j.jcm.2016.02.012. Epub 2016 Mar 31. Erratum in: J Chiropr Med. 2017 Dec;16(4):346. doi: 10.1016/j.jcm.2017.10.001. PMID: 27330520; PMCID: PMC4913118.
Contributing
We welcome contributions to AnnotationQuality! Whether you're fixing bugs, adding new features, improving documentation, or reporting issues, your help is appreciated. Here's how you can contribute:
Getting Started
-
Fork the Repository
- Fork the repository on GitHub
- Clone your fork locally:
git clone https://github.com/yourusername/AnnotationQuality.git - Add the upstream repository:
git remote add upstream https://github.com/Wameuh/AnnotationQuality.git
-
Set Up Development Environment
- Create a virtual environment:
python -m venv venv - Activate the virtual environment:
- Windows:
.\venv\Scripts\activate - Unix/MacOS:
source venv/bin/activate
- Windows:
- Install dependencies:
pip install -r requirements.txt - Install development dependencies:
pip install -r requirements-dev.txt
- Create a virtual environment:
Development Guidelines
-
Code Style
- Follow PEP 8 guidelines
- Use 4 spaces for indentation
- Limit line length to 79 characters
- Use descriptive variable and function names in snake_case
- Use PascalCase for class names
- Add docstrings following Google Python Style Guide
-
Testing
- Write tests for all new functionality
- Aim for 100% test coverage
- Include both unit and integration tests
- Run tests using:
pytest # Run all tests pytest Tests/UnitTests/test_file.py -v # Test specific file pytest --cov=src --cov-report=term-missing # Check coverage
-
Documentation
- Update documentation for any changes
- Include docstrings for all functions and classes
- Update README.md if adding new features
- Add examples for new functionality
-
Version Control
- Create a new branch for each feature/fix
- Follow the GitHub Flow branching strategy
- Use Conventional Commits for commit messages
- Keep commits focused and atomic
Submitting Changes
-
Before Submitting
- Ensure all tests pass
- Run linting checks:
flake8 - Update documentation if needed
- Add your changes to CHANGELOG.md
-
Pull Request Process
- Push your changes to your fork
- Create a Pull Request (PR) to the main repository
- Fill out the PR template completely
- Link any related issues
- Wait for review and address any feedback
-
Code Review
- All contributions require review
- Address review comments promptly
- Keep discussions professional and constructive
- Be open to suggestions and improvements
Security
- Report security vulnerabilities privately
- Use HTTPS for secure connections
- Never commit sensitive information
- Use environment variables for secrets
Need Help?
- Check existing issues and documentation
- Ask for clarification on unclear tasks
- Join our community discussions
- Reach out to maintainers
Thank you for contributing to AnnotationQuality! Your efforts help make this tool better for everyone.
License
This project is licensed under the MIT license.
Available Agreement Measures
The following agreement measures are available in IAA-Eval:
-
Raw Agreement (Percentage Agreement): The simplest measure, calculating the percentage of items on which annotators agree. While easy to interpret, it doesn't account for chance agreement.
-
Cohen's Kappa: Designed for two annotators, this measure calculates agreement while accounting for chance agreement. Particularly useful for categorical annotations.
-
Fleiss' Kappa: An extension of Cohen's Kappa for three or more annotators. Suitable for measuring agreement in classification tasks with multiple annotators.
-
Krippendorff's Alpha: A versatile measure that handles missing data and works with any number of annotators. It can accommodate different types of data (nominal, ordinal, interval, ratio).
-
F-measure: Calculates agreement using precision, recall, and their harmonic mean. Particularly useful for tasks like named entity recognition or classification where both precision and recall are important.
-
Intraclass Correlation Coefficient (ICC): Measures the degree of correlation and agreement between measurements made by different annotators. Supports different forms (1,1 | 2,1 | 3,1 | 1,k | 2,k | 3,k) for various study designs.
-
Boundary-Weighted Fleiss' Kappa (BWFK): Specialized measure for evaluating agreement in binary segmentation tasks, particularly useful for tissue boundaries where minor disagreements are common.
-
Distance-Based Cell Agreement Algorithm (DBCAA): Designed specifically for cell annotations, this algorithm measures agreement based on spatial relationships without requiring ground truth.
-
Intersection over Union (IoU): Measures agreement in segmentation tasks by calculating the overlap between annotated regions. Particularly useful for evaluating spatial annotations.
Each measure can be calculated with confidence intervals using various methods (bootstrap, normal approximation, Wilson score, or Agresti-Coull), providing statistical bounds for the agreement values.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iaa_eval-0.1.0.tar.gz.
File metadata
- Download URL: iaa_eval-0.1.0.tar.gz
- Upload date:
- Size: 28.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b05fc683ded5a38c6fcfbb854325e2d70fc4c35d3133c2a69016baf855324c05
|
|
| MD5 |
b39fe359d153fb3f51b4806e6bd850a8
|
|
| BLAKE2b-256 |
04820e544ef697f51526ec5358a341b37ab8e14b8fff5863f85b8b948f3ed09b
|
Provenance
The following attestation bundles were made for iaa_eval-0.1.0.tar.gz:
Publisher:
publish.yml on Wameuh/AnnotationQuality
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iaa_eval-0.1.0.tar.gz -
Subject digest:
b05fc683ded5a38c6fcfbb854325e2d70fc4c35d3133c2a69016baf855324c05 - Sigstore transparency entry: 195046844
- Sigstore integration time:
-
Permalink:
Wameuh/AnnotationQuality@afbd1c0aa0505579c1312366b5136cc3c2a56a19 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Wameuh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@afbd1c0aa0505579c1312366b5136cc3c2a56a19 -
Trigger Event:
release
-
Statement type:
File details
Details for the file iaa_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: iaa_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
245d693caca1f6c5e757507cca69da0d6cff6c78e272a1829a3108da5b3a6fe3
|
|
| MD5 |
740c30272f21bffb5a74355dbc0c1d2a
|
|
| BLAKE2b-256 |
69b58390fc6a04dab6ce28a8aed5b753bfd1980ed33e278d4ba901af580a2d0a
|
Provenance
The following attestation bundles were made for iaa_eval-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Wameuh/AnnotationQuality
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
iaa_eval-0.1.0-py3-none-any.whl -
Subject digest:
245d693caca1f6c5e757507cca69da0d6cff6c78e272a1829a3108da5b3a6fe3 - Sigstore transparency entry: 195046848
- Sigstore integration time:
-
Permalink:
Wameuh/AnnotationQuality@afbd1c0aa0505579c1312366b5136cc3c2a56a19 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Wameuh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@afbd1c0aa0505579c1312366b5136cc3c2a56a19 -
Trigger Event:
release
-
Statement type: