Skip to main content

eval-mm is a tool for evaluating Multi-Modal Large Language Models.

Project description

llm-jp-eval-mm

pypi Test workflow License

llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.

Overview of llm-jp-eval-mm

Getting Started

You can install llm-jp-eval-mm from GitHub or via PyPI.

  • Option 1: Clone from GitHub (Recommended)
git clone git@github.com:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync
  • Option 2: Install via PyPI
pip install eval_mm

To use LLM-as-a-Judge, configure your OpenAI API keys in a.env file:

  • For Azure: Set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY
  • For OpenAI: Set OPENAI_API_KEY

If you are not using LLM-as-a-Judge, you can assign any value in the .env file to bypass the error.

Usage

To evaluate a model on a task, run the following command:

uv sync --group normal
uv run --group normal python examples/sample.py \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir result  \
  --metrics heron-bench \
  --judge_model gpt-4o-2024-11-20 \
  --overwrite

The evaluation results will be saved in the result directory:

result
├── japanese-heron-bench
│   ├── llava-hf
│   │   ├── llava-1.5-7b-hf
│   │   │   ├── evaluation.jsonl
│   │   │   └── prediction.jsonl

To evaluate multiple models on multiple tasks, please check eval_all.sh.

Hello World Example

You can integrate llm-jp-eval-mm into your own code. Here's an example:

from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig

class MockVLM:
    def generate(self, images: list[Image.Image], text: str) -> str:
        return "宮崎駿"

task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]

input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)

model = MockVLM()
prediction = model.generate(images, input_text)

scorer = ScorerRegistry.load_scorer(
    "rougel",
    ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})

Leaderboard

To generate a leaderboard from your evaluation results, run:

python scripts/make_leaderboard.py --result_dir result

This will create a leaderboard.md file with your model performance:

Model Heron/LLM JVB-ItW/LLM JVB-ItW/Rouge
llm-jp/llm-jp-3-vila-14b 68.03 4.08 52.4
Qwen/Qwen2.5-VL-7B-Instruct 70.29 4.28 29.63
google/gemma-3-27b-it 69.15 4.36 30.89
microsoft/Phi-4-multimodal-instruct 45.52 3.2 26.8
gpt-4o-2024-11-20 93.7 4.44 32.2

The official leaderboard is available here

Supported Tasks

Japanese Tasks:

English Tasks:

Managing Dependencies

We use uv’s dependency groups to manage each model’s dependencies.

For example, to use llm-jp/llm-jp-3-vila-14b, run:

uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py

See eval_all.sh for the complete list of model dependencies.

When adding a new group, remember to configure conflict.

Browse Predictions with Streamlit

uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hf

Streamlit

Development

Adding a new task

To add a new task, implement the Task class in src/eval_mm/tasks/task.py.

Adding a new metric

To add a new metric, implement the Scorer class in src/eval_mm/metrics/scorer.py.

Adding a new model

To add a new model, implement the VLM class in examples/base_vlm.py

Adding a new dependency

Install a new dependency using the following command:

uv add <package_name>
uv add --group <group_name> <package_name>

Testing

Run the following commands to test tasks, metrics, and models::

bash test.sh
bash test_model.sh

Formatting and Linting

Ensure code consistency with:

uv run ruff format src
uv run ruff check --fix src

Releasing to PyPI

To release a new version:

git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags

Updating the Website

For website updates, see github_pages/README.md.

To update leaderboard data:

python scripts/make_leaderboard.py --update_pages

Acknowledgements

  • Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
  • lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_mm-0.4.1.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eval_mm-0.4.1-py3-none-any.whl (54.4 kB view details)

Uploaded Python 3

File details

Details for the file eval_mm-0.4.1.tar.gz.

File metadata

  • Download URL: eval_mm-0.4.1.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for eval_mm-0.4.1.tar.gz
Algorithm Hash digest
SHA256 1ee5f407cbfde7a28ec273bd202ee3da488d825edddb10e1759536355fe69145
MD5 55089219e4e948febef872816627eb45
BLAKE2b-256 572eb0744d81c7cbffbdba123de82374b1646d27c68e76c6cadbec9d5330fe0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_mm-0.4.1.tar.gz:

Publisher: python-publish.yml on llm-jp/llm-jp-eval-mm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file eval_mm-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: eval_mm-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 54.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for eval_mm-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7e05f3d6535427973aa4d4d09dba8e3b0afe3bb8ac69c384605b784033ed0547
MD5 22d3d302a7861f3fbc2a1c12a53ff929
BLAKE2b-256 88fce23fe9c4bfdf73922fa6365fc7b7b2f46b4b166d54151f713437e7410912

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_mm-0.4.1-py3-none-any.whl:

Publisher: python-publish.yml on llm-jp/llm-jp-eval-mm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page