Skip to main content

High-quality Machine Translation Evaluation

Project description



License GitHub stars PyPI Code Style

NEWS: We release CometKiwi -XL (3.5B) and -XXL (10.7B) QE models. These models were the best performing QE models on the WMT23 QE shared task. Please check all available models here

Quick Installation

COMET requires python 3.8 or above. Simple installation from PyPI

pip install --upgrade pip  # ensures that pip is current 
pip install unbabel-comet

Note: To use some COMET models such as Unbabel/wmt22-cometkiwi-da you must acknowledge it's license on Hugging Face Hub and log-in into hugging face hub.

To develop locally install run the following commands:

git clone https://github.com/Unbabel/COMET
cd COMET
pip install poetry
poetry install

For development, you can run the CLI tools directly, e.g.,

PYTHONPATH=. ./comet/cli/score.py

Scoring MT outputs:

CLI Usage:

Test examples:

echo -e "10 到 15 分钟可以送到吗\nPode ser entregue dentro de 10 a 15 minutos?" >> src.txt
echo -e "Can I receive my food in 10 to 15 minutes?\nCan it be delivered in 10 to 15 minutes?" >> hyp1.txt
echo -e "Can it be delivered within 10 to 15 minutes?\nCan you send it for 10 to 15 minutes?" >> hyp2.txt
echo -e "Can it be delivered between 10 to 15 minutes?\nCan it be delivered between 10 to 15 minutes?" >> ref.txt

Basic scoring command:

comet-score -s src.txt -t hyp1.txt -r ref.txt

you can set the number of gpus using --gpus (0 to test on CPU).

Scoring multiple systems:

comet-score -s src.txt -t hyp1.txt hyp2.txt -r ref.txt

WMT test sets via SacreBLEU:

comet-score -d wmt22:en-de -t PATH/TO/TRANSLATIONS

If you are only interested in a system-level score use the following command:

comet-score -s src.txt -t hyp1.txt -r ref.txt --quiet --only_system

Reference-free evaluation:

comet-score -s src.txt -t hyp1.txt --model Unbabel/wmt23-cometkiwi-da-xl

Note: To use the Unbabel/wmt23-cometkiwi-da-xl you first have to acknowledge its license on Hugging Face Hub.

Comparing multiple systems:

When comparing multiple MT systems we encourage you to run the comet-compare command to get statistical significance with Paired T-Test and bootstrap resampling (Koehn, et al 2004).

comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en

Minimum Bayes Risk Decoding:

The MBR command allows you to rank translations and select the best one according to COMET metrics. For more details you can read our paper on Quality-Aware Decoding for Neural Machine Translation.

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt

If working with a very large candidate list you can use --rerank_top_k flag to prune the topK most promissing candidates according to a reference-free metric.

Example for a candidate list of 1000 samples:

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt -o [OUTPUT_FILE].txt --num_sample 1000 --rerank_top_k 100 --gpus 4 --qe_model Unbabel/wmt23-cometkiwi-da-xl

Your source and samples file should be formated in this way.

COMET Models

Within COMET, there are several evaluation models available. You can refer to the MODELS page for a comprehensive list of all available models. Here is a concise list of the main reference-based and reference-free models:

  • Default Model: Unbabel/wmt22-comet-da - This model employs a reference-based regression approach and is built upon the XLM-R architecture. It has been trained on direct assessments from WMT17 to WMT20 and provides scores ranging from 0 to 1, where 1 signifies a perfect translation.
  • Reference-free Model: Unbabel/wmt23-cometkiwi-da-xl - This reference-free model adopts a regression approach and is built on top of the XLM-R XL architecture. It has undergone training on direct assessments from WMT17 to WMT20, as well as direct assessments from the WMT23 QE shared task. Similar to the default model, it also generates scores ranging from 0 to 1. This model has 3.5 billion parameters and requires a minimum of 15GB of GPU memory. For a more lightweight evaluation, please consult Unbabel/wmt22-cometkiwi-da, and if you seek the best overall performance consider Unbabel/wmt23-cometkiwi-da-xxl which requires a minimum of 44GB GPU memory.

If you intend to compare your results with papers published before 2022, it's likely that they used older evaluation models. In such cases, please refer to Unbabel/wmt20-comet-da and Unbabel/wmt20-comet-qe-da, which were the primary checkpoints used in previous versions (<2.0) of COMET.

Also, UniTE Metric developed by the NLP2CT Lab at the University of Macau and Alibaba Group can be used directly through COMET check here for more details.

Interpreting Scores:

When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.

In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.

However, for the latest COMET models like Unbabel/wmt22-comet-da, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.

It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run the comet-compare command to obtain statistical significance measures. This command compares the output of two systems using a statistical hypothesis test, providing an estimate of the probability that the observed difference in scores between the systems is due to chance. This is an important step to ensure that any differences in scores between systems are statistically significant.

Overall, the added interpretability of scores in the latest COMET models, combined with the ability to assess statistical significance between systems using comet-compare, make COMET a valuable tool for evaluating machine translation.

Languages Covered:

All the above mentioned models are build on top of XLM-R which cover the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Thus, results for language pairs containing uncovered languages are unreliable!

Scoring within Python:

from comet import download_model, load_from_checkpoint

model_path = download_model("Unbabel/wmt22-comet-da")
model = load_from_checkpoint(model_path)
data = [
    {
        "src": "10 到 15 分钟可以送到吗",
        "mt": "Can I receive my food in 10 to 15 minutes?",
        "ref": "Can it be delivered between 10 to 15 minutes?"
    },
    {
        "src": "Pode ser entregue dentro de 10 a 15 minutos?",
        "mt": "Can you send it for 10 to 15 minutes?",
        "ref": "Can it be delivered between 10 to 15 minutes?"
    }
]
model_output = model.predict(data, batch_size=8, gpus=1)
print(model_output)

Train your own Metric:

Instead of using pretrained models your can train your own model with the following command:

comet-train --cfg configs/models/{your_model_config}.yaml

You can then use your own metric to score:

comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT

You can also upload your model to Hugging Face Hub. Use Unbabel/wmt22-comet-da as example. Then you can use your model directly from the hub.

unittest:

In order to run the toolkit tests you must run the following command:

poetry run coverage run --source=comet -m unittest discover
poetry run coverage report -m # Expected coverage 80%

Note: Testing on CPU takes a long time

Publications

If you use COMET please cite our work and don't forget to say which model you used!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unbabel_comet-2.1.0.tar.gz (57.9 kB view details)

Uploaded Source

Built Distribution

unbabel_comet-2.1.0-py3-none-any.whl (85.8 kB view details)

Uploaded Python 3

File details

Details for the file unbabel_comet-2.1.0.tar.gz.

File metadata

  • Download URL: unbabel_comet-2.1.0.tar.gz
  • Upload date:
  • Size: 57.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for unbabel_comet-2.1.0.tar.gz
Algorithm Hash digest
SHA256 5326264e9d7487df486e308a43de7e168051c20dedd405672df1f692b1727071
MD5 30f6ce28e42738c83425ec519788b3a3
BLAKE2b-256 644df995d0e7dfc26e766d8a5c29b6dbb10ce00defeac09be9d7cf4e6dd864ff

See more details on using hashes here.

File details

Details for the file unbabel_comet-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for unbabel_comet-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3ff3fe91ce61cadf1ade50768ea2469859d09366a1812ff43e438a521b9a8cbc
MD5 2dc1d2cc5d072afef6723598fa4c6b7d
BLAKE2b-256 062266e2b67568320d639578b2562d1ce482973a0eef8546a611d1c498125a4d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page