多指标翻译评测工具

These details have not been verified by PyPI

Project links

Project description

📊 MultiMetric-Eval

多指标翻译评测工具，一行代码计算 BLEU、chrF++、COMET、BLEURT，支持文本和语音双模式输入与评测。

🚀 安装

# 基础安装（BLEU + chrF++）
pip install multimetric-eval

# 按需安装可选依赖
pip install multimetric-eval[comet]     # + COMET 指标
pip install multimetric-eval[whisper]   # + 语音转文字
pip install multimetric-eval[all]       # 全部功能

BLEURT 额外安装（需从 GitHub 源码安装）：

pip install multimetric-eval[bleurt]    # 安装 TensorFlow
pip install git+https://github.com/google-research/bleurt.git  # 安装 BLEURT
# 另需下载 BLEURT 模型 Checkpoint（如 BLEURT-20），并在代码中指定 bleurt_path

⚠️ 环境说明：BLEURT 依赖 TensorFlow，COMET 依赖 PyTorch。为避免 GPU 显存冲突，本工具自动强制 BLEURT 在 CPU 上运行，COMET/Whisper 在 GPU 上运行。

📖 快速开始

from multimetric_eval import ModelEvaluator

# 初始化（首次会自动下载 COMET 模型）
evaluator = ModelEvaluator()

# 评测
results = evaluator.evaluate(
    hypothesis=["The cat sits on the mat."],
    reference=["The cat is sitting on the mat."],
    source=["猫坐在垫子上。"]
)

print(results)
# {'sacreBLEU': 45.23, 'chrF++': 62.15, 'COMET': 0.8523}

🎯 三种评测模式

本工具支持 文本 (target_text) 和 语音 (target_speech) 两种输入，可单独使用或同时使用：

模式	输入	输出指标
纯文本	`target_text`	`sacreBLEU`, `chrF++`, `COMET`, `BLEURT`
纯语音	`target_speech`	`sacreBLEU_ASR`, `chrF++_ASR`, `COMET_ASR`, `BLEURT_ASR`
双模式	两者同时	以上全部指标

纯文本评测

results = evaluator.evaluate_all(
    reference=["Reference translation."],
    source=["源文本。"],
    target_text=["My translation."],
)
# {'sacreBLEU': 45.2, 'chrF++': 62.1, 'COMET': 0.85, 'hypothesis_text': [...]}

纯语音评测

evaluator = ModelEvaluator(use_comet=True, use_whisper=True)

results = evaluator.evaluate_all(
    reference=["Reference translation."],
    source=["源文本。"],
    target_speech="./my_audio/",
)
# {'sacreBLEU_ASR': 38.1, 'chrF++_ASR': 55.3, 'COMET_ASR': 0.78, 'hypothesis_ASR': [...]}

双模式评测（同时输入文本和语音）

evaluator = ModelEvaluator(use_comet=True, use_whisper=True)

results = evaluator.evaluate_all(
    reference=["Reference translation."],
    source=["源文本。"],
    target_text=["My translation."],
    target_speech="./my_audio/",
)
# {
#     'sacreBLEU': 45.2,        'sacreBLEU_ASR': 38.1,
#     'chrF++': 62.1,           'chrF++_ASR': 55.3,
#     'COMET': 0.85,            'COMET_ASR': 0.78,
#     'hypothesis_text': [...],
#     'hypothesis_ASR': [...],
# }

📌 同时输入时，target_text 和 target_speech 必须是同一批样本的不同形式，数量需一致。

📁 使用内置数据集

from multimetric_eval import ModelEvaluator, load_dataset

# 加载内置数据集（自动下载到 ./datasets/）
# 若有网络问题，可手动下载：
# https://github.com/sjtuayj/MultiMetric-Eval/releases/download/v0.1.0/zh-en-littleprince.zip
# 解压后将 zh-en-littleprince 文件夹保存至 ./datasets/
dataset = load_dataset("zh-en-littleprince")

方式1：传入文本列表

evaluator = ModelEvaluator(use_comet=True)

results = evaluator.evaluate_dataset(
    dataset=dataset,
    target_text=["Translation 1", "Translation 2", ...],
)

方式2：传入 JSON / TXT 文件

results = evaluator.evaluate_dataset(
    dataset=dataset,
    target_text="translations.json",  # 或 "translations.txt"
)

方式3：传入音频文件夹

evaluator = ModelEvaluator(use_comet=True, use_whisper=True)

results = evaluator.evaluate_dataset(
    dataset=dataset,
    target_speech="./my_audio/",
)

方式4：同时传入文本和语音

evaluator = ModelEvaluator(use_comet=True, use_whisper=True)

results = evaluator.evaluate_dataset(
    dataset=dataset,
    target_text=["Translation 1", "Translation 2", ...],
    target_speech="./my_audio/",
)
# 返回两组指标：sacreBLEU / sacreBLEU_ASR, chrF++ / chrF++_ASR, ...

📂 使用自定义数据集

from multimetric_eval import ModelEvaluator

evaluator = ModelEvaluator(use_comet=True)

reference = ["Reference 1", "Reference 2"]
source = ["源文本1", "源文本2"]  # COMET 需要

纯文本评测

# 传入列表
results = evaluator.evaluate(
    hypothesis=["Translation 1", "Translation 2"],
    reference=reference,
    source=source,
)

# 传入文件
results = evaluator.evaluate_file(
    hypothesis_file="translations.json",  # 或 .txt
    reference=reference,
    source=source,
)

纯语音评测

evaluator = ModelEvaluator(use_comet=True, use_whisper=True)

results = evaluator.evaluate_audio_folder(
    audio_folder="./my_audio/",
    reference=reference,
    source=source,
)

双模式评测（统一接口）

evaluator = ModelEvaluator(use_comet=True, use_whisper=True)

results = evaluator.evaluate_all(
    reference=reference,
    source=source,
    target_text=["Translation 1", "Translation 2"],  # 或文件路径
    target_speech="./my_audio/",
)

📄 输入文件格式

JSON 文件（三种格式均支持）

格式1：字典格式

{
    "hypothesis": [
        "Translation sentence 1.",
        "Translation sentence 2."
    ]
}

格式2：对象数组格式

[
    {"id": "001", "hypothesis": "Translation sentence 1."},
    {"id": "002", "hypothesis": "Translation sentence 2."}
]

格式3：纯字符串数组

[
    "Translation sentence 1.",
    "Translation sentence 2."
]

TXT 文件

每行一句，空行自动忽略：

Translation sentence 1.
Translation sentence 2.

音频文件夹

my_audio/
├── 001.wav
├── 002.wav
├── 003.mp3
└── 004.flac

支持格式：.wav、.mp3、.flac
排序规则：按文件名自动排序（确保与参考译文顺序一致）
命名建议：使用数字前缀如 001.wav、002.wav

⚙️ 参数配置

评测器参数

evaluator = ModelEvaluator(
    use_comet=True,                        # 启用 COMET（需要 source）
    use_bleurt=False,                      # 启用 BLEURT
    use_whisper=False,                     # 启用语音转文字
    comet_model="Unbabel/wmt22-comet-da",  # COMET 模型
    whisper_model="medium",                # tiny/base/small/medium/large
    bleurt_path=None,                      # BLEURT 模型路径
    device=None,                           # cuda/cuda:0/cuda:1/cpu，默认自动检测
)

参数	类型	默认值	说明
`use_comet`	bool	`True`	启用 COMET 指标
`use_bleurt`	bool	`False`	启用 BLEURT 指标
`use_whisper`	bool	`False`	启用语音转文字
`comet_model`	str	`"Unbabel/wmt22-comet-da"`	COMET 模型名称
`whisper_model`	str	`"medium"`	Whisper 模型大小
`bleurt_path`	str	`None`	BLEURT 模型本地路径
`device`	str	`None`	计算设备，支持 `cuda:N` 指定 GPU

数据集参数

dataset = load_dataset(
    name="zh-en-littleprince",   # 数据集名称
    cache_dir="./datasets",      # 缓存目录
    force_download=False,        # 强制重新下载
)

🖥️ GPU 指定

方式1：代码中指定

# 使用第 0 号 GPU
evaluator = ModelEvaluator(device="cuda:0")

# 使用第 3 号 GPU
evaluator = ModelEvaluator(device="cuda:3")

# 强制使用 CPU
evaluator = ModelEvaluator(device="cpu")

# 自动选择（默认）
evaluator = ModelEvaluator()

方式2：命令行环境变量

# 只使用第 2 号 GPU
CUDA_VISIBLE_DEVICES=2 python my_eval.py

# 使用第 0 和第 1 号 GPU
CUDA_VISIBLE_DEVICES=0,1 python my_eval.py

# 禁用 GPU，强制 CPU
CUDA_VISIBLE_DEVICES="" python my_eval.py

方式3：脚本中使用 argparse

import argparse
from multimetric_eval import ModelEvaluator

parser = argparse.ArgumentParser()
parser.add_argument("--device", type=str, default=None, help="指定GPU，如 cuda:0, cuda:1, cpu")
args = parser.parse_args()

evaluator = ModelEvaluator(device=args.device)

python my_eval.py --device cuda:2

📊 支持的指标

指标	说明	需要 source	需要额外安装
sacreBLEU	标准 BLEU 分数	❌	❌
chrF++	字符级 F 分数	❌	❌
COMET	神经网络评估	✅	`pip install unbabel-comet`
BLEURT	Google BLEURT	❌	`pip install tensorflow` + BLEURT 源码 + 模型文件

涉及语音输入（ASR）时，以上每个指标均会额外输出带 _ASR 后缀的版本。

📤 输出结果

纯文本输入

{
    "sacreBLEU": 45.23,
    "chrF++": 62.15,
    "COMET": 0.8523,          # use_comet=True 时
    "BLEURT": 0.7234,         # use_bleurt=True 时
    "hypothesis_text": [...], # evaluate_all / evaluate_dataset 时返回
}

纯语音输入

{
    "sacreBLEU_ASR": 38.12,
    "chrF++_ASR": 55.30,
    "COMET_ASR": 0.7823,
    "BLEURT_ASR": 0.6534,
    "hypothesis_ASR": [...],  # Whisper 转写结果
}

双模式输入

{
    "sacreBLEU": 45.23,        "sacreBLEU_ASR": 38.12,
    "chrF++": 62.15,           "chrF++_ASR": 55.30,
    "COMET": 0.8523,           "COMET_ASR": 0.7823,
    "BLEURT": 0.7234,          "BLEURT_ASR": 0.6534,
    "hypothesis_text": [...],
    "hypothesis_ASR": [...],
}

📋 API 总结

方法	用途	输入方式
`evaluate()`	纯文本评测	`hypothesis` 列表
`evaluate_file()`	从文件评测	JSON / TXT 文件路径
`evaluate_audio_folder()`	纯语音评测	音频文件夹路径
`evaluate_all()`	统一接口（自定义数据）	`target_text` 和/或 `target_speech`
`evaluate_dataset()`	统一接口（内置数据集）	`target_text` 和/或 `target_speech`

🔧 高级用法

使用上下文管理器（自动释放显存）

with ModelEvaluator(use_comet=True) as evaluator:
    results = evaluator.evaluate(
        hypothesis=["Translation"],
        reference=["Reference"],
        source=["源文本"],
    )
# 退出 with 块后自动释放显存

从本地 JSON 创建自定义数据集

from multimetric_eval import create_dataset_from_json

# my_data.json 格式：
# [
#     {"id": "001", "source_text": "源文本1", "reference_text": "Ref 1"},
#     {"id": "002", "source_text": "源文本2", "reference_text": "Ref 2"}
# ]

dataset = create_dataset_from_json("./my_data.json")

results = evaluator.evaluate_dataset(
    dataset=dataset,
    target_text=["Translation 1", "Translation 2"],
)

查看可用数据集

from multimetric_eval import list_datasets, get_dataset_info

print(list_datasets())
# ['zh-en-littleprince']

info = get_dataset_info("zh-en-littleprince")
print(info)
# {
#     'name': 'zh-en-littleprince',
#     'is_downloaded': True,
#     'num_samples': 54,
#     'audio_complete': True
# }

向后兼容（旧版参数）

# 以下旧写法依然有效
results = evaluator.evaluate_dataset(
    dataset=dataset,
    hypothesis=["Translation 1", ...],   # 等同于 target_text
)

results = evaluator.evaluate_dataset(
    dataset=dataset,
    audio_folder="./my_audio/",          # 等同于 target_speech
)

❓ 常见问题

Q: COMET 分数显示 -1.0？

A: 请确保传入了 source 参数，COMET 需要源文本。

Q: CUDA out of memory？

A: 使用上下文管理器或手动调用 evaluator.cleanup() 释放显存。也可以通过 device="cuda:N" 指定空闲 GPU。

Q: 如何只使用基础指标？

A: 设置 use_comet=False，只计算 sacreBLEU 和 chrF++，无需下载任何模型。

Q: 音频文件顺序不对？

A: 使用数字前缀命名，如 001.wav、002.wav，确保排序正确。

Q: BLEURT 和 COMET 环境冲突？

A: 本工具自动将 BLEURT (TensorFlow) 强制运行在 CPU 上，COMET (PyTorch) 运行在 GPU 上，无需手动处理。

Q: 在中国大陆服务器 COMET 下载失败？

A: 在代码最前面添加：

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

Q: 如何指定使用哪块 GPU？

A: 代码中设置 device="cuda:N"，或命令行使用 CUDA_VISIBLE_DEVICES=N python script.py。

📜 License

MIT License

🤝 Contributing

欢迎提交 Issue 和 Pull Request！

GitHub: https://github.com/sjtuayj/MultiMetric-Eval


---

### 相比旧版的主要变更

| 项目 | 变更 |
|------|------|
| **版本号** | `0.1.0` → `0.2.0` |
| **pyproject.toml** | 新增 `readme`、`license`、`authors`、`keywords`、`classifiers`、`[project.urls]`、`bleurt` 可选依赖 |
| **README 新增：三种评测模式** | 详细说明纯文本 / 纯语音 / 双模式的输入与输出 |
| **README 新增：GPU 指定** | 代码指定、环境变量、argparse 三种方式 |
| **README 新增：`_ASR` 后缀说明** | 输出结果章节展示三种模式的返回格式 |
| **README 新增：API 总结表** | 清晰对比 5 个公开方法的用途和输入方式 |
| **README 新增：向后兼容说明** | 旧参数 `hypothesis` / `audio_folder` 依然可用 |
| **README 新增：FAQ** | BLEURT 冲突、中国镜像、GPU 指定等常见问题 |

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.4

Apr 21, 2026

0.8.3

Apr 20, 2026

0.8.2

Apr 19, 2026

0.8.1

Apr 18, 2026

0.8.0

Apr 18, 2026

0.7.2

Apr 16, 2026

0.7.1

Apr 14, 2026

0.7.0

Mar 31, 2026

0.6.3

Mar 31, 2026

0.6.2

Mar 30, 2026

0.6.1

Mar 27, 2026

0.6.0

Mar 26, 2026

0.5.4

Mar 9, 2026

0.5.3

Mar 9, 2026

0.5.2

Mar 9, 2026

0.5.1

Mar 9, 2026

0.5.0

Mar 7, 2026

0.4.4

Mar 4, 2026

0.4.2

Feb 26, 2026

0.4.1

Feb 14, 2026

0.4.0

Feb 14, 2026

0.3.0

Feb 13, 2026

0.2.1

Feb 12, 2026

0.2.0

Feb 12, 2026

0.1.4

Feb 11, 2026

0.1.3

Feb 11, 2026

0.1.2

Feb 8, 2026

This version

0.1.1

Feb 8, 2026

0.1.0

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multimetriceval-0.1.1.tar.gz (17.6 kB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

multimetriceval-0.1.1-py3-none-any.whl (13.8 kB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file multimetriceval-0.1.1.tar.gz.

File metadata

Download URL: multimetriceval-0.1.1.tar.gz
Upload date: Feb 8, 2026
Size: 17.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for multimetriceval-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3eacdf80ba726bb5fb0d57f6a7991b59945f288f497777cb0cdfae8a9de33955`
MD5	`dacac19827913c75435a0433ba2ee2cb`
BLAKE2b-256	`185f2f8afbf6fac5a059c1624bbafbeb165f309c7dc1412ff94c6eea6cda45f1`

See more details on using hashes here.

File details

Details for the file multimetriceval-0.1.1-py3-none-any.whl.

File metadata

Download URL: multimetriceval-0.1.1-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 13.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for multimetriceval-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6606934768319e800af1c7161b63d9f17c80cd1cbda5dd316dccb0a7c7aad2d5`
MD5	`9390226828b218a84d2712d86f080142`
BLAKE2b-256	`010204e8f79d54ffd14fc8251913665fe331dcac9954365988684a77d9c34a0e`

See more details on using hashes here.

multimetriceval 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

📊 MultiMetric-Eval

🚀 安装

📖 快速开始

🎯 三种评测模式

纯文本评测

纯语音评测

双模式评测（同时输入文本和语音）

📁 使用内置数据集

方式1：传入文本列表

方式2：传入 JSON / TXT 文件

方式3：传入音频文件夹

方式4：同时传入文本和语音

📂 使用自定义数据集

纯文本评测

纯语音评测

双模式评测（统一接口）

📄 输入文件格式

JSON 文件（三种格式均支持）

TXT 文件

音频文件夹

⚙️ 参数配置

评测器参数

数据集参数

🖥️ GPU 指定

方式1：代码中指定

方式2：命令行环境变量

方式3：脚本中使用 argparse

📊 支持的指标

📤 输出结果

纯文本输入

纯语音输入

双模式输入

📋 API 总结

🔧 高级用法

使用上下文管理器（自动释放显存）

从本地 JSON 创建自定义数据集

查看可用数据集

向后兼容（旧版参数）

❓ 常见问题

Q: COMET 分数显示 -1.0？

Q: CUDA out of memory？

Q: 如何只使用基础指标？

Q: 音频文件顺序不对？

Q: BLEURT 和 COMET 环境冲突？

Q: 在中国大陆服务器 COMET 下载失败？

Q: 如何指定使用哪块 GPU？

📜 License

🤝 Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes