eval-mm is a tool for evaluating Multi-Modal Large Language Models.
Project description
llm-jp-eval-mm
llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.
Getting Started
You can install llm-jp-eval-mm from GitHub or via PyPI.
- Option 1: Clone from GitHub (Recommended)
git clone git@github.com:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync
- Option 2: Install via PyPI
pip install eval_mm
To use LLM-as-a-Judge, configure your OpenAI API keys in a.env file:
- For Azure: Set
AZURE_OPENAI_ENDPOINTandAZURE_OPENAI_KEY - For OpenAI: Set
OPENAI_API_KEY
If you are not using LLM-as-a-Judge, you can assign any value in the .env file to bypass the error.
Usage
To evaluate a model on a task, run the following command:
uv sync --group normal
uv run --group normal python examples/sample.py \
--model_id llava-hf/llava-1.5-7b-hf \
--task_id japanese-heron-bench \
--result_dir result \
--metrics heron-bench \
--judge_model gpt-4o-2024-11-20 \
--overwrite
The evaluation results will be saved in the result directory:
result
├── japanese-heron-bench
│ ├── llava-hf
│ │ ├── llava-1.5-7b-hf
│ │ │ ├── evaluation.jsonl
│ │ │ └── prediction.jsonl
To evaluate multiple models on multiple tasks, please check eval_all.sh.
Hello World Example
You can integrate llm-jp-eval-mm into your own code. Here's an example:
from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
class MockVLM:
def generate(self, images: list[Image.Image], text: str) -> str:
return "宮崎駿"
task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]
input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)
model = MockVLM()
prediction = model.generate(images, input_text)
scorer = ScorerRegistry.load_scorer(
"rougel",
ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})
Leaderboard
To generate a leaderboard from your evaluation results, run:
python scripts/make_leaderboard.py --result_dir result
This will create a leaderboard.md file with your model performance:
| Model | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
|---|---|---|---|
| llm-jp/llm-jp-3-vila-14b | 68.03 | 4.08 | 52.4 |
| Qwen/Qwen2.5-VL-7B-Instruct | 70.29 | 4.28 | 29.63 |
| google/gemma-3-27b-it | 69.15 | 4.36 | 30.89 |
| microsoft/Phi-4-multimodal-instruct | 45.52 | 3.2 | 26.8 |
| gpt-4o-2024-11-20 | 93.7 | 4.44 | 32.2 |
The official leaderboard is available here
Supported Tasks
Japanese Tasks:
- Japanese Heron Bench
- JA-VG-VQA500
- JA-VLM-Bench-In-the-Wild
- JA-Multi-Image-VQA
- JDocQA
- JMMMU
- JIC-VQA
- MECHA-ja
- CC-OCR (multi_lan_ocr split, ja subset)
- CVQA (ja subset)
English Tasks:
Managing Dependencies
We use uv’s dependency groups to manage each model’s dependencies.
For example, to use llm-jp/llm-jp-3-vila-14b, run:
uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py
See eval_all.sh for the complete list of model dependencies.
When adding a new group, remember to configure conflict.
Browse Predictions with Streamlit
uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hf
Development
Adding a new task
To add a new task, implement the Task class in src/eval_mm/tasks/task.py.
Adding a new metric
To add a new metric, implement the Scorer class in src/eval_mm/metrics/scorer.py.
Adding a new model
To add a new model, implement the VLM class in examples/base_vlm.py
Adding a new dependency
Install a new dependency using the following command:
uv add <package_name>
uv add --group <group_name> <package_name>
Testing
Run the following commands to test tasks, metrics, and models::
bash test.sh
bash test_model.sh
Formatting and Linting
Ensure code consistency with:
uv run ruff format src
uv run ruff check --fix src
Releasing to PyPI
To release a new version:
git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags
Updating the Website
For website updates, see github_pages/README.md.
To update leaderboard data:
python scripts/make_leaderboard.py --update_pages
Acknowledgements
- Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
- lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.
We also thank the developers of the evaluation datasets for their hard work.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eval_mm-0.4.1.tar.gz.
File metadata
- Download URL: eval_mm-0.4.1.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ee5f407cbfde7a28ec273bd202ee3da488d825edddb10e1759536355fe69145
|
|
| MD5 |
55089219e4e948febef872816627eb45
|
|
| BLAKE2b-256 |
572eb0744d81c7cbffbdba123de82374b1646d27c68e76c6cadbec9d5330fe0d
|
Provenance
The following attestation bundles were made for eval_mm-0.4.1.tar.gz:
Publisher:
python-publish.yml on llm-jp/llm-jp-eval-mm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
eval_mm-0.4.1.tar.gz -
Subject digest:
1ee5f407cbfde7a28ec273bd202ee3da488d825edddb10e1759536355fe69145 - Sigstore transparency entry: 214692770
- Sigstore integration time:
-
Permalink:
llm-jp/llm-jp-eval-mm@e91ba55f5b51d5c7028e5d8d22a5cbb870dbccda -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/llm-jp
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e91ba55f5b51d5c7028e5d8d22a5cbb870dbccda -
Trigger Event:
push
-
Statement type:
File details
Details for the file eval_mm-0.4.1-py3-none-any.whl.
File metadata
- Download URL: eval_mm-0.4.1-py3-none-any.whl
- Upload date:
- Size: 54.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e05f3d6535427973aa4d4d09dba8e3b0afe3bb8ac69c384605b784033ed0547
|
|
| MD5 |
22d3d302a7861f3fbc2a1c12a53ff929
|
|
| BLAKE2b-256 |
88fce23fe9c4bfdf73922fa6365fc7b7b2f46b4b166d54151f713437e7410912
|
Provenance
The following attestation bundles were made for eval_mm-0.4.1-py3-none-any.whl:
Publisher:
python-publish.yml on llm-jp/llm-jp-eval-mm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
eval_mm-0.4.1-py3-none-any.whl -
Subject digest:
7e05f3d6535427973aa4d4d09dba8e3b0afe3bb8ac69c384605b784033ed0547 - Sigstore transparency entry: 214692771
- Sigstore integration time:
-
Permalink:
llm-jp/llm-jp-eval-mm@e91ba55f5b51d5c7028e5d8d22a5cbb870dbccda -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/llm-jp
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e91ba55f5b51d5c7028e5d8d22a5cbb870dbccda -
Trigger Event:
push
-
Statement type: