A comprehensive benchmark for hallucinations in multimodal large language models.

Project description

Measuring Epistemic Humility in Multimodal Large Language Models

License

📦 Installation

Install the latest release from PyPI:

pip install HumbleBench

🚀 Quickstart (Python API)

The following snippet demonstrates a minimal example to evaluate your model on HumbleBench.

from HumbleBench import download_dataset, evaluate
from HumbleBench.utils.entity import DataLoader

# Download the HumbleBench dataset
dataset = download_dataset()

# Prepare data loader (batch_size=16, no-noise images)
data = DataLoader(dataset=dataset,
                    batch_size=16, 
                    use_noise_image=False,  # For HumbleBench-GN, set this to True
                    nota_only=False)        # For HumbleBench-E, set this to True

# Run inference
results = []
for batch in data:
    # Replace the next line with your model's inference method
    predictions = your_model.infer(batch)
    # Expect predictions to be a list of dicts matching batch keys, plus 'prediction'
    # Example: 
    results.extend(predictions)

# Compute evaluation metrics
metrics = evaluate(
    input_data=results,
    model_name_or_path='YourModel',
    use_noise_image=False,  # For HumbleBench-GN, set this to True
    nota_only=False         # For HumbleBench-E, set this to True
)
print(metrics)

If you prefer to reproduce the published results, load one of our provided JSONL files (at results/common, results/noise_image, or results/nota_only):

from HumbleBench.utils.io import load_jsonl
from HumbleBench import evaluate

path = 'results/common/Model_Name/Model_Name.jsonl'
data = load_jsonl(path)
metrics = evaluate(
    input_data=data,
    model_name_or_path='Model_Name',
    use_noise_image=False,  # For HumbleBench-GN, set this to True
    nota_only=False,        # For HumbleBench-E, set this to True
)
print(metrics)

🧩 Advanced Usage: Command-Line Interface

⚠️WARNING⚠️: If you wanna use our implemented models, please make sure you install all the requirements of respective model by yourself. And we use Conda to manage the python environment, so maybe you need to modify the env_name to your env's name.

HumbleBench provides a unified CLI for seamless integration with any implementation of our model interface.

1. Clone the Repository

git clone git@github.com:maifoundations/HumbleBench.git
cd HumbleBench

2. Implement the Model Interface

Create a subclass of MultiModalModelInterface and define the infer method:

# my_model.py
from HumbleBench.models.base import register_model, MultiModalModelInterface

@register_model("YourModel")
class YourModel(MultiModalModelInterface):
    def __init__(self, model_name_or_path, **kwargs):
        super().__init__(model_name_or_path, **kwargs)
        # Load your model and processor here
        # Example:
        # self.model = ...
        # self.processor = ...

    def infer(self, batch: List[Dict]) -> List[Dict]:
        """
        Args:
            batch: List of dicts with keys:
                - label: one of 'A', 'B', 'C', 'D', 'E'
                - question: str
                - type: 'Object'/'Attribute'/'Relation'/...
                - file_name: path to image file
                - question_id: unique identifier
        Returns:
            List of dicts with an added 'prediction' key (str).
        """
        # Your inference code here
        return predictions

3. Configure Your Model

Edit configs/models.yaml to register your model and specify its weights:

models:
  YourModel:
    params:
      model_name_or_path: "/path/to/your/checkpoint"

4. Run Evaluation from the Shell

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3

python main.py \
    --model "YourModel" \
    --config configs/models.yaml \
    --batch_size 16 \
    --log_dir results/common \
    [--use-noise] \
    [--nota-only]

--model: Name registered via @register_model
--config: Path to your models.yaml
--batch_size: Inference batch size
--log_dir: Directory to save logs and results
--use-noise: Optional flag to assess HumbleBench-GN
--nota-only: Optional flag to assess HumbleBench-E

5. Contribute to HumbleBench!

🙇🏾🙇🏾🙇🏾

We have implemented many popular models in the models directory, along with corresponding shell scripts (including support for noise-image experiments) in the shell directory. If you’d like to add your own model to HumbleBench, feel free to open a Pull Request — we’ll review and merge it as soon as possible.

📮 Contact

For bug reports or feature requests, please open an issue or email us at bingkuitong@gmail.com.

Project details

Release history Release notifications | RSS feed

This version

1.0.2

Sep 3, 2025

1.0.1

Sep 3, 2025

1.0.0

Sep 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humblebench-1.0.2.tar.gz (10.5 kB view details)

Uploaded Sep 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

humblebench-1.0.2-py3-none-any.whl (12.9 kB view details)

Uploaded Sep 3, 2025 Python 3

File details

Details for the file humblebench-1.0.2.tar.gz.

File metadata

Download URL: humblebench-1.0.2.tar.gz
Upload date: Sep 3, 2025
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for humblebench-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`5a04e5c5cfd18b8759449223a03801a32f26617b34c1dbed9812a4de09d03e6a`
MD5	`b2b4fe07d1e3317fd40f2562ca3dadb6`
BLAKE2b-256	`c23e56bf0c7841e1300afec1d261f93bbc6fe0fa703b4adb631b7aafbd91e305`

See more details on using hashes here.

File details

Details for the file humblebench-1.0.2-py3-none-any.whl.

File metadata

Download URL: humblebench-1.0.2-py3-none-any.whl
Upload date: Sep 3, 2025
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for humblebench-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`142b7bbaa55015008b86bb13240daef3045039e1657a37aa49036b7057f2d4a4`
MD5	`10e9a13bb3a9505e7320b418bf1beffe`
BLAKE2b-256	`995a7efcc03509091b14c7e0db376acbf8859f0ff1ed6814def4e30b4b340032`

See more details on using hashes here.

HumbleBench 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Measuring Epistemic Humility in Multimodal Large Language Models

📦 Installation

🚀 Quickstart (Python API)

🧩 Advanced Usage: Command-Line Interface

1. Clone the Repository

2. Implement the Model Interface

3. Configure Your Model

4. Run Evaluation from the Shell

5. Contribute to HumbleBench!

📮 Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes