Token-set choice audit for POPE VLM hallucination evaluation

These details have not been verified by PyPI

Project links

Project description

Token-Set Choice Confounds POPE

A systematic audit of yes/no token extraction in VLM hallucination evaluation. Companion code, predictions, and figures for the paper Token-Set Choice Confounds POPE: A Systematic Audit of Yes/No Extraction in VLM Hallucination Evaluation (preprint, 2026).

Install

The audit utilities ship as a lightweight package on PyPI:

pip install pope-audit

from pope_audit import YES_TOKEN_IDS, NO_TOKEN_IDS, compute_f1

# The eight-token rule for LLaMA-family VLMs on POPE (see below):
YES_TOKEN_IDS  # [3582, 8241, 4874, 3869]  -> yes, Yes, ' yes', ' Yes'
NO_TOKEN_IDS   # [1217, 3782,  694, 1939]  -> no,  No,  ' no',  ' No'

pope-audit itself depends only on transformers, sentencepiece, and Pillow. The GPU stack (torch, bitsandbytes, ...) is intentionally not a hard dependency, because the right torch wheel is platform/CUDA-specific; install it from pytorch.org first if you want to run the model-loading paths. For the full research stack used to reproduce the paper, install the extra and clone this repo (see Reproducing the headline numbers):

pip install "pope-audit[research]"

The 9,000 per-question prediction records, diagnostics, and POPE question manifests that back every number in the paper are published as a companion dataset on the Hugging Face Hub: 🤗 kesav2k04/pope-audit-records.

from datasets import load_dataset

pope = load_dataset("kesav2k04/pope-audit-records", "pope_questions", split="adversarial")

TL;DR

POPE's reference evaluator reads the model's first generated token by comparing two specific vocabulary IDs: 3582 for yes and 1217 for no. In 9,000 greedy-decode runs of LLaVA-1.5-7B across the three POPE splits, those two IDs appeared zero times. The model produces ' Yes' (token 3869) and ' No' (token 1939) instead, because SentencePiece prepends a whitespace byte after the prompt suffix ASSISTANT:. Reading the wrong IDs shifts the LLaVA-1.5-7B adversarial F1 by 6.13 points (0.7608 vs. 0.8221), which is larger than the headline gain claimed by several recent inference-time methods.

This repository contains the evaluation pipeline, the 9,000 prediction records, the diagnostic script, and the LaTeX / HTML versions of the paper.

Headline numbers (all from `experiments/`)

Split	Two-token F1	Eight-token F1	Gap	Yes-rate (8-tok)	Recall (8-tok)
Adversarial	0.7608	0.8221	+0.0613	0.452	0.7827
Popular	0.7961	0.8498	+0.0537	0.434	0.7940
Random	0.8397	0.8713	+0.0316	0.411	0.7940

The corrected eight-token baseline of 0.8221 on the adversarial split is stable: in the full 3,000-question run it converges to 0.8219 by question 1,000 and stays within 0.820 to 0.825 through question 3,000.

The eight-token rule

For LLaMA-family VLMs evaluated on POPE under the prompt template USER: <image>\n{question}\nASSISTANT:, we recommend:

YES_TOKEN_IDS = [3582, 8241, 4874, 3869]   # yes, Yes, ' yes', ' Yes'
NO_TOKEN_IDS  = [1217, 3782,  694, 1939]   # no,  No,  ' no',  ' No'

# Decision rule (matches greedy decoding by construction)
yes_score = max(logits[i] for i in YES_TOKEN_IDS)
no_score  = max(logits[i] for i in NO_TOKEN_IDS)
prediction = "yes" if yes_score > no_score else "no"

Token-set choice is context-dependent. For any new model-template combination we recommend running the diagnostic in scripts/run_pope_2tok_baseline.py on a 100-question sample first; if the argmax token of any question is not in YES_TOKEN_IDS or NO_TOKEN_IDS, add it before trusting the rule.

Repository layout

.
|-- README.md                  this file
|-- CITATION.cff               machine-readable citation metadata
|-- LICENSE                    MIT
|-- pyproject.toml             installable package metadata (pip install -e .)
|-- MANIFEST.in                sdist contents control
|-- requirements.txt           pinned Python dependencies
|-- environment.yml            conda environment
|-- .gitignore                 excludes model weights, caches, build, secrets
|-- pytest.ini                 test framework configuration
|
|-- .github/                   CI/CD workflows
|   `-- workflows/tests.yml    automated evaluation validation
|
|-- tests/                     mathematical validation suite
|   |-- test_evaluation.py     tensor-level parsing and metric robustness
|   `-- test_pope_loader.py    end-to-end dataset I/O integrity
|
|-- src/pope_audit/            the installable library (pip package `pope-audit`)
|   |-- __init__.py            public API (lazy torch import)
|   |-- evaluate.py            compute_f1 / compute_accuracy / load_pope
|   |-- pope_loader.py         load POPE splits
|   |-- ugaa_hook.py           UGAA v5 + _get_yes_no_logits + token-ID constants
|   `-- clip_l_grounding.py    CLIP-L per-patch similarity module
|
|-- scripts/                   runnable drivers + data downloaders
|   |-- run_pope_2tok_baseline.py   diagnostic + 2-tok vs 8-tok comparison
|   |-- run_pope_eval_full.py       full 3,000q POPE eval driver
|   |-- run_ablation_a.py           nine inference-time correction methods
|   |-- run_full_diagnostic_3000q.py  3,000q per-question diagnostic
|   |-- run_cross_model_token_audit.py  second-model token audit
|   |-- run_multi_model_audit.py    four-readout audit across VLMs
|   `-- download_pope_full.py       fetch POPE questions + images
|
|-- analysis/                  paper-claim audit, stats, and figure scripts
|   |-- fact_audit.py
|   |-- diagnostic_stats.py
|   |-- string_parse_equivalence.py
|   `-- latency_microbench.py
|
|-- experiments/               JSON outputs from every run
|   |-- pope_adversarial_2tok_vs_8tok.json    main result, adv. split
|   |-- pope_popular_2tok_vs_8tok.json
|   |-- pope_random_2tok_vs_8tok.json
|   |-- ugaa_full_adversarial_3000q_diagnostic.json   3,000q diagnostic (sec. 5.3)
|   |-- cross_model/           LLaVA-1.6-Mistral per-question logs
|   `-- multi_model/           recorded metrics for the four extra VLMs
|
|-- datasets/                  POPE questions and image paths (images gitignored)
|-- docs/                      local + cloud setup notes
|-- references/                other-paper bibtex + survey CSV
|-- paper/                     paper-side audit docs (LaTeX sources gitignored)
|
`-- archive/                   initial UGAA research, not needed for reproduction
    |-- src_legacy/            older / experimental scripts (preserved verbatim)
    |-- experiments_legacy/    superseded prediction logs
    |-- notebooks/             cloud runners (run_on_cloud.ipynb reproduces Table 6)
    `-- misc/                  one-off utilities

Reproducing the headline numbers

The full pipeline runs on an 8 GB GPU in roughly 80 minutes for all 9,000 questions.

0. Environment

python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # macOS / Linux

pip install -r requirements.txt

# (optional) install the audit utilities as an editable package so the
# scripts can `import pope_audit` from any working directory:
pip install -e .

Or, with conda:

conda env create -f environment.yml
conda activate ugaa

Notes on the two development machines we used are in docs/local_setup_notes.md.

1. Download LLaVA-1.5-7B weights

We used the HuggingFace checkpoint llava-hf/llava-1.5-7b-hf, quantized to 4-bit NF4 through BitsAndBytes. The default --model-path llava-hf/llava-1.5-7b-hf flag downloads it to the HuggingFace cache on first use. Use --cache-dir <dir> to control where it lives.

2. Reproduce the protocol comparison (Table 2 in the paper)

python scripts/run_pope_2tok_baseline.py --split adversarial \
    --model-path llava-hf/llava-1.5-7b-hf \
    --data-dir datasets/pope --output-dir experiments

python scripts/run_pope_2tok_baseline.py --split popular  --output-dir experiments
python scripts/run_pope_2tok_baseline.py --split random   --output-dir experiments

Expected output files and metrics (eight-token rule, full 3,000-question splits):

File	F1	Precision	Recall	TP	TN	FP	FN
`experiments/pope_adversarial_2tok_vs_8tok.json`	0.8221	0.8658	0.7827	1174	1318	182	326
`experiments/pope_popular_2tok_vs_8tok.json`	0.8498	0.9140	0.7940	1191	1388	112	309
`experiments/pope_random_2tok_vs_8tok.json`	0.8713	0.9652	0.7940	1191	1457	43	309

3. Reproduce the nine inference-time corrections (Table 4)

python scripts/run_ablation_a.py --variant all    --beta 1.0 \
    --model-path llava-hf/llava-1.5-7b-hf \
    --dataset datasets/pope/pope_sample_100.json \
    --output-dir experiments

# Full 3,000-question runs for the headline correction methods:
python scripts/run_pope_eval_full.py --split adversarial --baseline \
    --output-dir experiments
python scripts/run_pope_eval_full.py --split adversarial \
    --variant clip_certainty --beta 1.0 --output-dir experiments
python scripts/run_pope_eval_full.py --split adversarial \
    --variant clip_certainty --beta 1.5 --output-dir experiments

Expected per-method summary files in experiments/:

File	F1	Precision	Recall
`pope_full_adversarial_baseline_summary.json`	0.8221	0.8658	0.7827
`pope_full_adversarial_beta1.0_summary.json`	0.8164	0.9023	0.7453
`pope_full_adversarial_clip_b1.0_summary.json`	0.8171	0.9072	0.7433
`pope_full_adversarial_clip_b1.5_summary.json`	0.8104	0.9278	0.7193

4. Multi-model validation (four additional VLMs)

To test whether the token-set confound is specific to LLaVA-1.5 or a property of the readout protocol itself, run the same four-readout audit (legacy_2tok, legacy_8tok, dynamic_single, string_parse) on other VLMs with scripts/run_multi_model_audit.py. Each model is run on all three POPE splits at the full 3,000 questions per split.

# LLaVA-1.6-Mistral (local, 4-bit) -- one split shown; repeat for popular, random
python scripts/run_multi_model_audit.py --model llava16_mistral \
    --split adversarial --samples 3000 \
    --data-dir datasets/pope --output-dir experiments/cross_model \
    --device cuda --quantize 4bit

# InstructBLIP / mPLUG-Owl2 / Qwen2-VL run the same way on a T4 (Kaggle/Colab);
# see archive/notebooks/run_on_cloud.ipynb for the turnkey cloud runner.
python scripts/run_multi_model_audit.py --model instructblip --split adversarial --samples 3000 --output-dir experiments/multi_model --device cuda
python scripts/run_multi_model_audit.py --model mplug_owl2   --split adversarial --samples 3000 --output-dir experiments/multi_model --device cuda
python scripts/run_multi_model_audit.py --model qwen2_vl     --split adversarial --samples 3000 --output-dir experiments/multi_model --device cuda --prompt-mode native_chat_template

Measured POPE F1 (3,000 questions per split; LLaVA-1.6 local RTX 3070 Ti, the other three on Kaggle T4). Bold marks the F1 a careful practitioner would report for each model:

Model (tokenizer)	Split	legacy_2tok	legacy_8tok	dynamic_single	string_parse
LLaVA-1.5-7B (LLaMA-2)	adversarial	0.7608	0.8221	0.8221	0.8221
LLaVA-1.5-7B (LLaMA-2)	popular	0.7961	0.8498	0.8498	0.8498
LLaVA-1.5-7B (LLaMA-2)	random	0.8397	0.8713	0.8713	0.8713
LLaVA-1.6-Mistral-7B (Mistral)	adversarial	0.6667	0.0000	0.8521	0.8521
LLaVA-1.6-Mistral-7B (Mistral)	popular	0.6667	0.0000	0.8953	0.8953
LLaVA-1.6-Mistral-7B (Mistral)	random	0.6667	0.0000	0.9163	0.9163
InstructBLIP-Vicuna-7B (LLaMA)	adversarial	0.7074	0.8183	0.8183	0.6667
InstructBLIP-Vicuna-7B (LLaMA)	popular	0.7277	0.8553	0.8553	0.6667
InstructBLIP-Vicuna-7B (LLaMA)	random	0.7613	0.8815	0.8815	0.6667
mPLUG-Owl2-LLaMA2-7B (LLaMA-2)	adversarial	0.7754	0.8001	0.8001	0.8001
mPLUG-Owl2-LLaMA2-7B (LLaMA-2)	popular	0.8060	0.8312	0.8312	0.8312
mPLUG-Owl2-LLaMA2-7B (LLaMA-2)	random	0.8466	0.8707	0.8707	0.8707
Qwen2-VL-7B-Instruct (Qwen)	adversarial	0.6674	0.0000	0.8457	0.8457
Qwen2-VL-7B-Instruct (Qwen)	popular	0.6671	0.0000	0.8578	0.8578
Qwen2-VL-7B-Instruct (Qwen)	random	0.6667	0.0000	0.8671	0.8671

Two patterns hold across every model and split. The legacy LLaVA-1.5 eight-token readout collapses to F1 0.00 on the models with disjoint vocabularies (LLaVA-1.6-Mistral, Qwen2-VL), while the tokenizer-derived dynamic_single readout holds between 0.80 and 0.92 everywhere. string_parse matches dynamic_single on every model except InstructBLIP, where the free-text parse did not isolate the answer token and collapsed to an all-yes prediction (0.6667); the single-token logit readout is the robust default. Corresponding artifacts:

experiments/cross_model/llava16_mistral_pope_{adversarial,popular,random}_3000q_paper_template_token_audit.json (full per-question records)
experiments/multi_model/multi_model_results.json (recorded metrics for the four additional models; the three cloud runs are reproducible with archive/notebooks/run_on_cloud.ipynb)
the original 500-question two-prompt-template comparison remains in experiments/cross_model/llava-hf_llava-v1.6-mistral-7b-hf_pope_adversarial_500q_*_token_audit.json

For a CPU-only sanity check that just verifies the dynamic single-token IDs derived from the second model's tokenizer:

python scripts/run_cross_model_token_audit.py \
    --model-path llava-hf/llava-v1.6-mistral-7b-hf \
    --dry-run-tokenizer --cache-dir ./hf_cache

See experiments/cross_model/README.md for the full description of every JSON field and what to look at.

5. Verify the paper's numeric claims and submission gate

python analysis/fact_audit.py        # phase 1+2 paper-claim audit, phase 3 strict gate
python analysis/diagnostic_stats.py  # Mann-Whitney U on the 100q diagnostic

fact_audit.py exits with code 0 if every claim matches the JSON ground truth and the paper sources contain no placeholders, anonymous- author text (in the arXiv version), TODO/FIXME, em dashes, corrupted characters, or off-by-one Table 7 entries.

Evaluation Integrity & Testing Architecture

To guarantee the mathematical and structural soundness of the library, pope-audit implements an elite-tier verification suite that tests the exact logit execution boundaries described in the paper. The testing architecture strictly enforces:

True Logit-Level Integration Testing: Rather than asserting high-level string equality, the suite generates localized PyTorch tensors mirroring Hugging Face's return_dict_in_generate=True multi-dimensional tuple signatures. This proves that the internal extraction layers (_get_yes_no_logits) mathematically pool the eight indices properly inside a strict torch.no_grad() memory context, perfectly isolating the core research artifact.
End-to-End I/O Robustness: Uses secure pytest tmp_path environments to validate dynamic JSONL split detection, label parsing, and metric compilation across community-schema edge cases seamlessly.
Strict Package Isolation: Local path mapping is natively decoupled via pytest.ini. CI/CD validations trigger automatically through GitHub Actions on Ubuntu runners, cleanly injecting runtime dependencies (like torch) on-the-fly to guarantee the lightweight pyproject.toml remains unpolluted for production deployments.

To run the full 8-test verification sweep locally:

pytest tests/ -v

Determinism

All runs use:

torch.manual_seed(42)
transformers==4.40.1
bitsandbytes 4-bit NF4 quantization, fp16 compute dtype
model.generate(max_new_tokens=1, output_scores=True) with greedy decoding

Rerunning the same script on the same hardware produces identical predictions. Different GPUs may differ in the third decimal of individual logits but not in the argmax token, so the eight-token F1 is stable across hardware.

Hardware footprint

Hardware	Throughput	Total time for 9,000 questions
NVIDIA RTX 3070 Ti, 8 GB VRAM	0.508 s/q	76.3 min
NVIDIA RTX 3050, 4 GB VRAM	(verification only)	partial

Cloud Hardware

Kaggle Tesla T4x2 GPU (mPLUG-Owl2, InstructBLIP, Qwen2-VL)
Approximately 27,000 POPE evaluations across three models and three splits

Cloud Accounts

Experiments were distributed across multiple Kaggle sessions to accommodate GPU quota limits.

Citing this work

If you find the audit useful, please cite via CITATION.cff or the BibTeX entry below.

@misc{xxxx2026tokenset,
  title  = {Token-Set Choice Confounds {POPE}:
            A Systematic Audit of Yes/No Extraction
            in {VLM} Hallucination Evaluation},
  author = {Jayakumar, Kesav Kumar and Thilak, Karthigeyan},
  year   = {2026},
  note   = {Preprint, arXiv identifier to be added on submission.}
}

License

MIT, see LICENSE. POPE images are used under the COCO Terms of Use. LLaVA-1.5 weights are used under their respective license (Apache 2.0). CLIP weights are used under MIT.

Acknowledgments

We used Claude (Anthropic) as a coding assistant for experimental script development and figure rendering. All experimental design, analysis, and writing are our own. This work received no external funding and was conducted independently during the authors' undergraduate studies.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 15, 2026

0.1.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pope_audit-0.1.1.tar.gz (31.8 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pope_audit-0.1.1-py3-none-any.whl (21.7 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file pope_audit-0.1.1.tar.gz.

File metadata

Download URL: pope_audit-0.1.1.tar.gz
Upload date: Jun 15, 2026
Size: 31.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pope_audit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3b618b3ea1480d6fc549f86075f57a632575b1c5fa5f94e0698de00a54050be1`
MD5	`2004044f92a85b29be6a6a16af7cfc8c`
BLAKE2b-256	`79e62eec47e352f99d1685c87837a677a8f6bef5864225318e5886b7e1966e18`

See more details on using hashes here.

File details

Details for the file pope_audit-0.1.1-py3-none-any.whl.

File metadata

Download URL: pope_audit-0.1.1-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 21.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pope_audit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7734c3c28849df2f7e598188796641e3e9c974cf2eb7ca61609000b654d18cf6`
MD5	`3c5c07e8e29f5714720346d83aa41a8a`
BLAKE2b-256	`c43482b5f6283e52556d713c64b1282ecb00fd7cc28901f18f5ec699456c1573`

See more details on using hashes here.

pope-audit 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Token-Set Choice Confounds POPE

Install

TL;DR

Headline numbers (all from experiments/)

The eight-token rule

Repository layout

Reproducing the headline numbers

0. Environment

1. Download LLaVA-1.5-7B weights

2. Reproduce the protocol comparison (Table 2 in the paper)

3. Reproduce the nine inference-time corrections (Table 4)

4. Multi-model validation (four additional VLMs)

5. Verify the paper's numeric claims and submission gate

Evaluation Integrity & Testing Architecture

Determinism

Hardware footprint

Cloud Hardware

Cloud Accounts

Citing this work

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Headline numbers (all from `experiments/`)