Automated Audio Captioning metrics with Pytorch.
Project description
This package is a tool to evaluate sentences produced by automatic models to caption image or audio. The results of BLEU [1], ROUGE-L [2], METEOR [3], CIDEr [4], SPICE [5] and SPIDEr [6] are consistents with https://github.com/audio-captioning/caption-evaluation-tools.
Installation
Install the pip package:
pip install https://github.com/Labbeti/aac-metrics
Download the external code needed for METEOR, SPICE and PTBTokenizer:
aac-metrics-download
Examples
Evaluate all metrics
from aac_metrics import aac_evaluate
candidates = ["a man is speaking", ...]
mult_references = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
global_scores, _ = aac_evaluate(candidates, mult_references)
print(global_scores)
# dict containing the score of each aac metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
# {"bleu_1": tensor(0.7), "bleu_2": ..., ...}
Evaluate a specific metric
from aac_metrics.functional import coco_cider_d
candidates = [...]
mult_references = [[...], ...]
global_scores, local_scores = coco_cider_d(candidates, mult_references)
print(global_scores)
# {"cider_d": tensor(0.1)}
print(local_scores)
# {"cider_d": tensor([0.9, ...])}
Experimental SPIDEr-max metric
from aac_metrics.functional import spider_max
mult_candidates = [[...], ...]
mult_references = [[...], ...]
global_scores, local_scores = spider_max(mult_candidates, mult_references)
print(global_scores)
# {"spider": tensor(0.1)}
print(local_scores)
# {"spider": tensor([0.9, ...])}
Requirements
Python packages
The requirements are automatically installed when using pip install
on this repository.
torch >= 1.10.1
numpy >= 1.21.2
pyyaml >= 6.0
tqdm >= 4.64.0
External requirements
-
java
>= 1.8 is required to compute METEOR, SPICE and use the PTBTokenizer. Most of these functions can specify a java executable path withjava_path
argument. -
unzip
command to extract SPICE zipped files.
Metrics
Coco metrics
Metric | Origin | Range | Short description |
---|---|---|---|
BLEU [1] | machine translation | [0, 1] | Precision of n-grams |
ROUGE-L [2] | machine translation | [0, 1] | Longest common subsequence |
METEOR [3] | machine translation | [0, 1] | Cosine-similarity of frequencies |
CIDEr [4] | image captioning | [0, 10] | Cosine-similarity of TF-IDF |
SPICE [5] | image captioning | [0, 1] | FScore of semantic graph |
SPIDEr [6] | image captioning | [0, 5.5] | Mean of CIDEr and SPICE |
Other metrics
Metric | Origin | Range | Short description |
---|---|---|---|
SPIDEr-max | audio captioning | [0, 5.5] | Max of multiples candidates SPIDEr scores |
References
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceed- ings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1073083.1073135
[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific Translation Evaluation for Any Target Language,” in Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014, pp. 376–380. [Online]. Available: http://aclweb.org/anthology/W14-3348
[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, arXiv: 1411.5726. [Online]. Available: http://arxiv.org/abs/1411.5726
[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016, arXiv: 1607.08822. [Online]. Available: http://arxiv.org/abs/1607.08822
[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE Inter- national Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017, arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
Cite the aac-metrics package
The associated paper has been accepted but it will be published after the DCASE2022 workshop.
If you use this code, you can cite with the following temporary citation:
@inproceedings{Labbe2022,
author = "Etienne Labbe, Thomas Pellegrini, Julien Pinquier",
title = "IS MY AUTOMATIC AUDIO CAPTIONING SYSTEM SO BAD? SPIDEr-max: A METRIC TO CONSIDER SEVERAL CAPTION CANDIDATES",
month = "November",
year = "2022",
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file aac-metrics-0.1.0.tar.gz
.
File metadata
- Download URL: aac-metrics-0.1.0.tar.gz
- Upload date:
- Size: 45.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9bfb93d9cb0fd25fc2d493bd0feca9844ade837b240243a0d26fba5143c3825 |
|
MD5 | c66c521f66ab58d259dbedb38286fb03 |
|
BLAKE2b-256 | 9a6f0a87a06ce96b54a12d555e1069a3c854c05df83f7e2db594106e6bdfae19 |