Automated Audio Captioning metrics with Pytorch.
Project description
This package is a tool to evaluate sentences produced by automatic models to caption image or audio. The results of BLEU [1], ROUGE-L [2], METEOR [3], CIDEr [4], SPICE [5] and SPIDEr [6] are consistents with https://github.com/audio-captioning/caption-evaluation-tools.
Why using this package?
- Easy installation with pip
- Consistent with audio caption metrics code https://github.com/audio-captioning/caption-evaluation-tools
- Provides functions and classes to compute metrics separately
- Provides SPIDEr-max metric as described in the DCASE paper [7].
Installation
Install the pip package:
pip install aac-metrics
Download the external code needed for METEOR, SPICE and PTBTokenizer:
aac-metrics-download
Note: The external code for SPICE, METEOR and PTBTokenizer is stored in the cache directory (default: $HOME/aac-metrics-cache/
)
Metrics
AAC metrics
Metric | Origin | Range | Short description |
---|---|---|---|
BLEU [1] | machine translation | [0, 1] | Precision of n-grams |
ROUGE-L [2] | machine translation | [0, 1] | FScore of the longest common subsequence |
METEOR [3] | machine translation | [0, 1] | Cosine-similarity of frequencies |
CIDEr-D [4] | image captioning | [0, 10] | Cosine-similarity of TF-IDF |
SPICE [5] | image captioning | [0, 1] | FScore of semantic graph |
SPIDEr [6] | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |
Other metrics
Metric | Origin | Range | Short description |
---|---|---|---|
SPIDEr-max [7] | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
Usage
Evaluate AAC metrics
The full evaluation process to compute AAC metrics can be done with aac_metrics.aac_evaluate
function.
from aac_metrics import aac_evaluate
candidates: list[str] = ["a man is speaking", ...]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
global_scores, _ = aac_evaluate(candidates, mult_references)
print(global_scores)
# dict containing the score of each aac metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
# {"bleu_1": tensor(0.7), "bleu_2": ..., ...}
Evaluate a specific metric
Evaluate a specific metric can be done using the aac_metrics.functional.<metric_name>.<metric_name>
function. Unlike aac_evaluate
, the tokenization with PTBTokenizer is not done with these functions, but you can do it before with preprocess_mono_sents
and preprocess_mult_sents
functions.
from aac_metrics.functional import coco_cider_d
from aac_metrics.utils.tokenization import preprocess_mono_sents, preprocess_mult_sents
candidates: list[str] = ["a man is speaking", ...]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
candidates = preprocess_mono_sents(candidates)
mult_references = preprocess_mult_sents(mult_references)
global_scores, local_scores = coco_cider_d(candidates, mult_references)
print(global_scores)
# {"cider_d": tensor(0.1)}
print(local_scores)
# {"cider_d": tensor([0.9, ...])}
Each metrics also exists as a python class version, like aac_metrics.classes.coco_cider_d.CocoCIDErD
.
SPIDEr-max
SPIDEr-max [7] is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model.
SPIDEr-max: why ?
The SPIDEr metric used in audio captioning is highly sensitive to the frequencies of the words used.
Here is few examples of candidates and references for 2 differents audios, with their associated SPIDEr score:
Candidates | SPIDEr |
---|---|
heavy rain is falling on a roof | 0.562 |
heavy rain is falling on a tin roof | 0.930 |
a heavy rain is falling on a roof | 0.594 |
a heavy rain is falling on the ground | 0.335 |
a heavy rain is falling on the roof | 0.594 |
References |
---|
heavy rain falls loudly onto a structure with a thin roof |
heavy rainfall falling onto a thin structure with a thin roof |
it is raining hard and the rain hits a tin roof |
rain that is pouring down very hard outside |
the hard rain is noisy as it hits a tin roof |
(References for the Clotho development-testing file named "rain.wav")
Candidates | SPIDEr |
---|---|
a woman speaks and a sheep bleats | 0.190 |
a woman speaks and a goat bleats | 1.259 |
a man speaks and a sheep bleats | 0.344 |
an adult male speaks and a sheep bleats | 0.231 |
an adult male is speaking and a sheep bleats | 0.189 |
References |
---|
a man speaking and laughing followed by a goat bleat |
a man is speaking in high tone while a goat is bleating one time |
a man speaks followed by a goat bleat |
a person speaks and a goat bleats |
a man is talking and snickering followed by a goat bleating |
(References for an AudioCaps testing file (id: "jid4t-FzUn0"))
Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio.
SPIDEr-max: usage
This usage is very similar to other captioning metrics, with the main difference of take a multiple candidates list as input.
from aac_metrics.functional import spider_max
from aac_metrics.utils.tokenization import preprocess_mult_sents
mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"], ...]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
mult_candidates = preprocess_mult_sents(mult_candidates)
mult_references = preprocess_mult_sents(mult_references)
global_scores, local_scores = spider_max(mult_candidates, mult_references)
print(global_scores)
# {"spider": tensor(0.1), ...}
print(local_scores)
# {"spider": tensor([0.9, ...]), ...}
Requirements
Python packages
The requirements are automatically installed when using pip install
on this repository.
torch >= 1.10.1
numpy >= 1.21.2
pyyaml >= 6.0
tqdm >= 4.64.0
External requirements
-
java
>= 1.8 is required to compute METEOR, SPICE and use the PTBTokenizer. Most of these functions can specify a java executable path withjava_path
argument. -
unzip
command to extract SPICE zipped files.
Additional notes
CIDEr or CIDEr-D ?
The CIDEr [4] metric differs from CIDEr-D because it apply a stemmer to each words before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr, but some papers called it "CIDEr".
Is torchmetrics needed for this package ?
No. But if torchmetrics is installed, all metrics classes will inherit from the base class torchmetrics.Metric
.
It is because most of the metrics does not use PyTorch tensors to compute scores and numpy or string cannot be added to states of torchmetrics.Metric
.
References
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceed- ings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1073083.1073135
[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific Translation Evaluation for Any Target Language,” in Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014, pp. 376–380. [Online]. Available: http://aclweb.org/anthology/W14-3348
[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, arXiv: 1411.5726. [Online]. Available: http://arxiv.org/abs/1411.5726
[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016, arXiv: 1607.08822. [Online]. Available: http://arxiv.org/abs/1607.08822
[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE Inter- national Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017, arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
Note: the following reference is temporary:
[7] E. Labbe, T. Pellegrini, J. Pinquier, "IS MY AUTOMATIC AUDIO CAPTIONING SYSTEM SO BAD? SPIDEr-max: A METRIC TO CONSIDER SEVERAL CAPTION CANDIDATES", DCASE2022 Workshop.
Cite the aac-metrics package
The associated paper has been accepted but it will be published after the DCASE2022 workshop.
If you use this code, you can cite with the following temporary citation:
@inproceedings{Labbe2022,
author = "Etienne Labbe, Thomas Pellegrini, Julien Pinquier",
title = "IS MY AUTOMATIC AUDIO CAPTIONING SYSTEM SO BAD? SPIDEr-max: A METRIC TO CONSIDER SEVERAL CAPTION CANDIDATES",
month = "November",
year = "2022",
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file aac-metrics-0.1.1.tar.gz
.
File metadata
- Download URL: aac-metrics-0.1.1.tar.gz
- Upload date:
- Size: 149.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7e4f3d2d5a36f1d3167131aff6750b6f3f4590f2136500ba5d071911294ba03 |
|
MD5 | 867633600eeb74455e9e27991872a929 |
|
BLAKE2b-256 | c785d922b3e0f7619425a87329a8a0706e071aa44e5f3dbbfb25d8e1dd919dfd |