Skip to main content

High-quality Machine Translation Evaluation Tool

Project description

ReMedy Logo

🚀ReMedy: Machine Translation Evaluation via Reward Modeling

Learning High-Quality Machine Translation Evaluation from Human Preferences with Reward Modeling

arXiv PyPI version GitHub Stars License


✨ About ReMedy

ReMedy is a new state-of-the-art machine translation (MT) evaluation framework that reframes the task as reward modeling rather than direct regression. Instead of relying on noisy human scores, ReMedy learns from pairwise human preferences, leading to better alignment with human judgments.

  • 📈 State-of-the-art accuracy on WMT22–24 (39 language pairs, 111 systems)
  • ⚖️ Segment- and system-level evaluation, outperforming GPT-4, PaLM-540B, Finetuned-PaLM2, MetricX-13B, and XCOMET
  • 🔍 More robust on low-quality and out-of-domain translations (ACES, MSLC benchmarks)
  • 🧠 Can be used as a reward model in RLHF pipelines to improve MT systems

ReMedy demonstrates that reward modeling with pairwise preferences offers a more reliable and human-aligned approach for MT evaluation.


📚 Contents


📦 Quick Installation

ReMedy requires Python ≥ 3.12, and leverages VLLM for fast inference.

✅ Recommended: Install via pip

pip install --upgrade pip
pip install remedy-mt-eval

🛠️ Install from Source

git clone https://github.com/Smu-Tan/Remedy
cd Remedy
pip install -e .

📜 Install via Poetry

git clone https://github.com/Smu-Tan/Remedy
cd Remedy
poetry install

⚙️ Requirements

  • Python ≥ 3.12
  • transformers ≥ 4.51.1
  • vllm ≥ 0.8.5
  • torch ≥ 2.6.0
  • (See pyproject.toml for full dependencies)

🚀 Usage

💾 Download ReMedy Models

Before using, you can download the model from HuggingFace:

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download ShaomuTan/ReMedy-9B-22 --local-dir Models/ReMedy-9B-22

You can replace ReMedy-9B-22 with other variants like ReMedy-9B-23.


🔹 Basic Usage

remedy-score \
    --model ShaomuTan/ReMedy-9B-22 \
    --src_file testcase/en.src \
    --mt_file testcase/en-de.hyp \
    --ref_file testcase/de.ref \
    --src_lang en --tgt_lang de \
    --cache_dir Models \
    --save_dir testcase \
    --num_gpus 4 \
    --calibrate

🔹 Reference-Free Mode (Quality Estimation)

remedy-score \
    --model ShaomuTan/ReMedy-9B-22 \
    --src_file testcase/en.src \
    --mt_file testcase/en-de.hyp \
    --no_ref \
    --src_lang en --tgt_lang de \
    --cache_dir Models \
    --save_dir testcase/QE \
    --num_gpus 4 \
    --calibrate

📄 Output Files

  • src-tgt_raw_scores.txt
  • src-tgt_sigmoid_scores.txt
  • src-tgt_calibration_scores.txt
  • src-tgt_detailed_results.tsv
  • src-tgt_result.json

Inspired by SacreBLEU, ReMedy provides JSON-style results to ensure transparency and comparability.

📘 Example JSON Output
{
  "metric_name": "remedy-9B-22",
  "raw_score": 4.502863049214531,
  "sigmoid_score": 0.9613502018042875,
  "calibration_score": 0.9029647169507162,
  "calibration_temp": 1.7999999999999998,
  "signature": "metric_name:remedy-9B-22|lp:en-de|ref:yes|version:0.1.1",
  "language_pair": "en-de",
  "source_language": "en",
  "target_language": "de",
  "segments": 2037,
  "version": "0.1.1",
  "args": {
    "src_file": "testcase/en.src",
    "mt_file": "testcase/en-de.hyp",
    "src_lang": "en",
    "tgt_lang": "de",
    "model": "Models/remedy-9B-22",
    "cache_dir": "Models",
    "save_dir": "testcase",
    "ref_file": "testcase/de.ref",
    "no_ref": false,
    "calibrate": true,
    "num_gpus": 4,
    "num_seqs": 256,
    "max_length": 4096,
    "enable_truncate": false,
    "version": false,
    "list_languages": false
  }
}

⚙️ Full Argument List

📋 Show CLI Arguments

🔸 Required

--src_file           # Path to source file
--mt_file            # Path to MT output file
--src_lang           # Source language code
--tgt_lang           # Target language code
--model              # Model path or HuggingFace ID
--save_dir           # Output directory

🔸 Optional

--ref_file           # Reference file path
--no_ref             # Reference-free mode
--cache_dir          # Cache directory
--calibrate          # Enable calibration
--num_gpus           # Number of GPUs
--num_seqs           # Number of sequences (default: 256)
--max_length         # Max token length (default: 4096)
--enable_truncate    # Truncate sequences
--version            # Print version
--list_languages     # List supported languages

🧠 Model Variants

Model Size Base Model Ref/QE Download
ReMedy-9B-22 9B Gemma-2-9B Both 🤗 HuggingFace
ReMedy-9B-23 9B Gemma-2-9B Both 🤗 HuggingFace
ReMedy-9B-24 9B Gemma-2-9B Both 🤗 HuggingFace

More variants coming soon...


🔁 Reproducing WMT Results

Click to show instructions for reproducing WMT22–24 evaluation

1. Clone ReMedy repo

git clone https://github.com/Smu-Tan/Remedy
cd Remedy

2. Install mt-metrics-eval

# Install MTME and download WMT data
git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .
cd ..
python3 -m mt_metrics_eval.mtme --download

3. Run ReMedy on WMT data

sbatch wmt/wmt22.sh
sbatch wmt/wmt23.sh
sbatch wmt/wmt24.sh

📄 Results will be comparable with other metrics reported in WMT shared tasks.


📚 Citation

If you use ReMedy, please cite the following paper:

@article{tan2024remedy,
  title={ReMedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling},
  author={Tan, Shaomu and Monz, Christof},
  journal={arXiv preprint},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remedy_mt_eval-0.1.3.tar.gz (28.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

remedy_mt_eval-0.1.3-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file remedy_mt_eval-0.1.3.tar.gz.

File metadata

  • Download URL: remedy_mt_eval-0.1.3.tar.gz
  • Upload date:
  • Size: 28.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for remedy_mt_eval-0.1.3.tar.gz
Algorithm Hash digest
SHA256 fa5b74c0d8a3b746a118ca2b29206860cd2675830a95fd78595f4251926abbcb
MD5 fdda9da59ae212d2c9330051348fcc5c
BLAKE2b-256 d84a3669d515809ffe3818357deb7ad634a7439a4d6bcbd8b2d85979c10a8ce5

See more details on using hashes here.

File details

Details for the file remedy_mt_eval-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: remedy_mt_eval-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 34.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for remedy_mt_eval-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3024ca11b393a7e434bce3e363e910392b0031e69193f572f8709981f6ed816e
MD5 fd4aaf1f6631e14187dfd97040903133
BLAKE2b-256 f05a9e6e3d8927c881fced8bdf4dfe124c343b5b57fa4fb434b1383b08024119

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page