A library for minimum Bayes risk (MBR) decoding.
Project description
mbrs is a library for minimum Bayes risk (MBR) decoding.
Paper | Reference docs | Citation
Installation
You can install from PyPi:
pip install mbrs
For developers, it can be installed from the source.
git clone https://github.com/naist-nlp/mbrs.git
cd mbrs/
pip install ./
Quick start
mbrs provides two interfaces: command-line interface (CLI) and Python API.
Command-line interface
Command-line interface can run MBR decoding from command-line. Before
running MBR decoding, you can generate hypothesis sentences with
mbrs-generate
:
mbrs-generate \
sources.txt \
--output hypotheses.txt \
--lang_pair en-de \
--model facebook/m2m100_418M \
--num_candidates 1024 \
--sampling eps --epsilon 0.02 \
--batch_size 8 --sampling_size 8 --fp16 \
--report_format rounded_outline
Beam search can also be used by replacing
--sampling eps --epsilon 0.02
with --beam_size 10
.
Next, MBR decoding and other decoding methods can be executed with
mbrs-decode
. This example regards the hypothesis set as the
pseudo-reference set.
mbrs-decode \
hypotheses.txt \
--num_candidates 1024 \
--nbest 1 \
--source sources.txt \
--references hypotheses.txt \
--output translations.txt \
--report report.txt --report_format rounded_outline \
--decoder mbr \
--metric comet \
--metric.model Unbabel/wmt22-comet-da \
--metric.batch_size 64 --metric.fp16 true
You can pass the arguments using a configuration yaml file via
--config_path
option. See
docs for the
details.
Finally, you can evaluate the score with mbrs-score
:
mbrs-score \
hypotheses.txt \
--sources sources.txt \
--references hypotheses.txt \
--format json \
--metric bleurt \
--metric.batch_size 64 --metric.fp16 true
Python API
This is the example of COMET-MBR via Python API.
from mbrs.metrics import MetricCOMET
from mbrs.decoders import DecoderMBR
SOURCE = "ありがとう"
HYPOTHESES = ["Thanks", "Thank you", "Thank you so much", "Thank you.", "thank you"]
# Setup COMET.
metric_cfg = MetricCOMET.Config(
model="Unbabel/wmt22-comet-da",
batch_size=64,
fp16=True,
)
metric = MetricCOMET(metric_cfg)
# Setup MBR decoding.
decoder_cfg = DecoderMBR.Config()
decoder = DecoderMBR(decoder_cfg, metric)
# Decode by COMET-MBR.
# This example regards the hypotheses themselves as the pseudo-references.
# Args: (hypotheses, pseudo-references, source)
output = decoder.decode(HYPOTHESES, HYPOTHESES, source=SOURCE, nbest=1)
print(f"Selected index: {output.idx}")
print(f"Output sentence: {output.sentence}")
print(f"Expected score: {output.score}")
List of implemented methods
Metrics
Currently, the following metrics are supported:
- BLEU (Papineni et al., 2002):
bleu
- TER (Snover et al.,
2006):
ter
- chrF (Popović et al., 2015):
chrf
- COMET (Rei et al.,
2020):
comet
- COMETkiwi (Rei et al.,
2022):
cometkiwi
- XCOMET (Guerreiro et al., 2023):
xcomet
- BLEURT (Sellam et al.,
2020):
bleurt
(thanks to @lucadiliello)
Decoders
The following decoding methods are implemented:
- N-best reranking:
rerank
- MBR decoding:
mbr
Specifically, the following methods of MBR decoding are included:
- Expectation estimation:
- Monte Carlo estimation (Eikema and Aziz, 2020; Eikema and Aziz, 2022)
- Model-based estimation (Jinnai et al.,
2024):
--reference_lprobs
option
- Efficient methods:
- Confidence-based pruning (Cheng and Vlachos,
2023) :
pruning_mbr
- Reference aggregation (DeNero et al.,
2009; Vamvas and Sennrich,
2024):
aggregate_mbr
- N-gram aggregation on BLEU (DeNero et al., 2009)
- N-gram aggregation on chrF (Vamvas and Sennrich, 2024)
- Embedding aggregation on COMET (Vamvas and Sennrich, 2024; Deguchi et al., 2024)
- Centroid-based MBR (Deguchi et al.,
2024):
centroid_mbr
- Probabilistic MBR (Trabelsi et al.,
2024):
probabilistic_mbr
- Confidence-based pruning (Cheng and Vlachos,
2023) :
Selectors
The final output list is selected according to these selectors:
- N-best selection:
nbest
- Diverse selection (Jinnai et al., 2024):
diverse
Related projects
- mbr
- Highly integrated with huggingface
transformers by
customizing
generate()
method of model implementation. - If you are looking for an MBR decoding library that is fully integrated into transformers, this might be a good choice.
- Our mbrs works standalone; thus, not only transformers but also fairseq or LLM outputs via API can be used.
- Highly integrated with huggingface
transformers by
customizing
Citation
If you use this software, please cite:
@misc{deguchi-2024-mbrs,
title={mbrs: A Library for Minimum Bayes Risk Decoding},
author={Hiroyuki Deguchi and Yusuke Sakai and Hidetaka Kamigaito and Taro Watanabe},
year={2024},
eprint={2408.04167},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.04167},
}
License
This library is mainly developed by Hiroyuki Deguchi and published under the MIT-license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mbrs-0.1.3.tar.gz
.
File metadata
- Download URL: mbrs-0.1.3.tar.gz
- Upload date:
- Size: 48.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.11 Linux/6.6.36.3-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc23a7713390f65ffeffa247273f00a03a88033103b5410d536df1c3a0e72240 |
|
MD5 | 5154c8ec653073784594ba1b7f1d1350 |
|
BLAKE2b-256 | b3c7eb962a0fa9023368544b0f43fb6fb4e4c0b6716793e3a03e102b823fcc9a |
File details
Details for the file mbrs-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: mbrs-0.1.3-py3-none-any.whl
- Upload date:
- Size: 74.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.11 Linux/6.6.36.3-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0dc3044e9e1213cebe7834e719f9da58f7d3b0f4b5cedd67b121fbc1ccb1dcd |
|
MD5 | 95152350b7cbdc588e7251ea54ab7060 |
|
BLAKE2b-256 | 3f8f5f2a92d39690dbc61b644a5e1bc35d5fa576eeda36eb7039ffa36f7f2cf7 |