Skip to main content

A comprehensive benchmark for hallucinations in multimodal large language models.

Project description

Measuring Epistemic Humility in Multimodal Large Language Models

License PyPI HuggingFace

📦 Installation

Install the latest release from PyPI:

pip install HumbleBench

🚀 Quickstart (Python API)

The following snippet demonstrates a minimal example to evaluate your model on HumbleBench.

from HumbleBench import download_dataset, evaluate
from HumbleBench.utils.entity import DataLoader

# Download the HumbleBench dataset
dataset = download_dataset()

# Prepare data loader (batch_size=16, no-noise images)
data = DataLoader(dataset=dataset,
                    batch_size=16, 
                    use_noise_image=False,  # For HumbleBench-GN, set this to True
                    nota_only=False)        # For HumbleBench-E, set this to True

# Run inference
results = []
for batch in data:
    # Replace the next line with your model's inference method
    predictions = your_model.infer(batch)
    # Expect predictions to be a list of dicts matching batch keys, plus 'prediction'
    # Example: 
    results.extend(predictions)

# Compute evaluation metrics
metrics = evaluate(
    input_data=results,
    model_name_or_path='YourModel',
    use_noise_image=False,  # For HumbleBench-GN, set this to True
    nota_only=False         # For HumbleBench-E, set this to True
)
print(metrics)

If you prefer to reproduce the published results, load one of our provided JSONL files (at results/common, results/noise_image, or results/nota_only):

from HumbleBench.utils.io import load_jsonl
from HumbleBench import evaluate

path = 'results/common/Model_Name/Model_Name.jsonl'
data = load_jsonl(path)
metrics = evaluate(
    input_data=data,
    model_name_or_path='Model_Name',
    use_noise_image=False,  # For HumbleBench-GN, set this to True
    nota_only=False,        # For HumbleBench-E, set this to True
)
print(metrics)

🧩 Advanced Usage: Command-Line Interface

⚠️WARNING⚠️: If you wanna use our implemented models, please make sure you install all the requirements of respective model by yourself. And we use Conda to manage the python environment, so maybe you need to modify the env_name to your env's name.

HumbleBench provides a unified CLI for seamless integration with any implementation of our model interface.

1. Clone the Repository

git clone git@github.com:maifoundations/HumbleBench.git
cd HumbleBench

2. Implement the Model Interface

Create a subclass of MultiModalModelInterface and define the infer method:

# my_model.py
from HumbleBench.models.base import register_model, MultiModalModelInterface

@register_model("YourModel")
class YourModel(MultiModalModelInterface):
    def __init__(self, model_name_or_path, **kwargs):
        super().__init__(model_name_or_path, **kwargs)
        # Load your model and processor here
        # Example:
        # self.model = ...
        # self.processor = ...

    def infer(self, batch: List[Dict]) -> List[Dict]:
        """
        Args:
            batch: List of dicts with keys:
                - label: one of 'A', 'B', 'C', 'D', 'E'
                - question: str
                - type: 'Object'/'Attribute'/'Relation'/...
                - file_name: path to image file
                - question_id: unique identifier
        Returns:
            List of dicts with an added 'prediction' key (str).
        """
        # Your inference code here
        return predictions

3. Configure Your Model

Edit configs/models.yaml to register your model and specify its weights:

models:
  YourModel:
    params:
      model_name_or_path: "/path/to/your/checkpoint"

4. Run Evaluation from the Shell

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3

python main.py \
    --model "YourModel" \
    --config configs/models.yaml \
    --batch_size 16 \
    --log_dir results/common \
    [--use-noise] \
    [--nota-only]
  • --model: Name registered via @register_model
  • --config: Path to your models.yaml
  • --batch_size: Inference batch size
  • --log_dir: Directory to save logs and results
  • --use-noise: Optional flag to assess HumbleBench-GN
  • --nota-only: Optional flag to assess HumbleBench-E

5. Contribute to HumbleBench!

🙇🏾🙇🏾🙇🏾

We have implemented many popular models in the models directory, along with corresponding shell scripts (including support for noise-image experiments) in the shell directory. If you’d like to add your own model to HumbleBench, feel free to open a Pull Request — we’ll review and merge it as soon as possible.

📮 Contact

For bug reports or feature requests, please open an issue or email us at bingkuitong@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humblebench-1.0.2.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

humblebench-1.0.2-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file humblebench-1.0.2.tar.gz.

File metadata

  • Download URL: humblebench-1.0.2.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for humblebench-1.0.2.tar.gz
Algorithm Hash digest
SHA256 5a04e5c5cfd18b8759449223a03801a32f26617b34c1dbed9812a4de09d03e6a
MD5 b2b4fe07d1e3317fd40f2562ca3dadb6
BLAKE2b-256 c23e56bf0c7841e1300afec1d261f93bbc6fe0fa703b4adb631b7aafbd91e305

See more details on using hashes here.

File details

Details for the file humblebench-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: humblebench-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for humblebench-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 142b7bbaa55015008b86bb13240daef3045039e1657a37aa49036b7057f2d4a4
MD5 10e9a13bb3a9505e7320b418bf1beffe
BLAKE2b-256 995a7efcc03509091b14c7e0db376acbf8859f0ff1ed6814def4e30b4b340032

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page