Diagnostic toolkit for black-box evaluation of automated peer-review systems.

These details have not been verified by PyPI

Project links

Project description

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review

A pip-installable diagnostic toolkit for black-box evaluation of automated peer-review systems under controlled aspect-guided perturbations.

Use it as a community evaluation tool: run any automated review system on paired original/perturbed paper, review, or rebuttal inputs; export scores or decisions; then generate aspect-level reports measuring sensitivity to soundness, presentation, contribution, tone, factuality, completeness, and recommendation perturbations.

AI reviewer diagnostics workflow

Companion code, prompts, examples, and reproducibility notes for the CIKM 2025 paper:

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation
Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, Xiaojun Wan. CIKM 2025.
DOI: https://doi.org/10.1145/3746252.3761274

If this repository helps your research, please cite the paper. Copy-paste BibTeX is below and in CITATION.bib; GitHub citation metadata is in CITATION.cff.

30-second quickstart

Install the diagnostic CLI from PyPI and run the toy report. This path needs no API keys, GPUs, model downloads, or companion dataset.

python -m pip install ai-reviewer-diagnostics
ai-reviewer-diagnostics --demo --output-md outputs/demo_diagnostic_report.md

If you want the latest GitHub version before a PyPI release catches up:

python -m pip install "git+https://github.com/JiataoLi/where-do-llms-go-wrong.git"

Expected package demo output:

Compared 1 condition pair(s).
Wrote outputs/demo_diagnostic_report.md

For a repo checkout:

git clone https://github.com/JiataoLi/where-do-llms-go-wrong
cd where-do-llms-go-wrong
make quickstart
make demo-report

Expected checkout quickstart output:

AI-reviewer diagnostic release quickstart: OK
Validated 1 chat example(s), 3 OpenReview note(s).
Prompt rows: base=9, perturb=7.
Wrote outputs/quickstart/quickstart_summary.json

make quickstart validates the repo layout, example schemas, prompt files, and citation metadata, then writes a tiny demo artifact under outputs/quickstart/. make demo-report exercises the same packaged report engine exposed as ai-reviewer-diagnostics. These are format/schema demos, not model results.

What you can reuse

Goal	Start here	Requires
Check that the repo is healthy	`make quickstart`	Python only
Run the API-free code smoke test	`uv sync && uv run make smoke-test`	lightweight Python deps
Generate a toy diagnostic report	`make demo-report` or `ai-reviewer-diagnostics ...`	Python only
Reuse the prompt templates	`prompts/`	none
Run API-based model inference	`scripts/run_openrouter.py`, `scripts/run_gemini.py`	API key
Run local model inference	`scripts/run_vllm.py`	GPU + vLLM
Evaluate a new review system's outputs	`ai-reviewer-diagnostics` / `ai_reviewer_diagnostics`	JSONL outputs with shared `id` fields
Inspect released artifacts	`docs/DATA.md` + Hugging Face dataset	`huggingface_hub`
Recreate analysis tables/figures	`analysis/` + `docs/REPRODUCIBILITY.md`	dataset + analysis deps

Command map

Command	Purpose
`ai-reviewer-diagnostics --demo --output-md outputs/demo.md`	Verify the pip-installed CLI with bundled toy fixtures.
`ai-reviewer-diagnostics --baseline base.jsonl --perturbed pert.jsonl --condition paper/soundness --output-md report.md`	Diagnose one baseline/perturbed pair from your own system.
`ai-reviewer-diagnostics --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores --output-md report.md`	Summarize all paired score files in the released dataset format.
`make quickstart`	Check a repo checkout without installing dependencies.
`make smoke-test`	Run the API-free repository test suite.

Integrate your own review system

The toolkit only requires JSONL outputs with shared id values. Start with docs/INTEGRATIONS.md for schema examples, custom score fields, directory mode, and common pitfalls.

Dependencies

Dependencies and package metadata are managed in pyproject.toml. The install exposes two console commands, ai-reviewer-diagnostics and the shorter alias ai-reviewer-report. Analysis and vLLM dependencies are optional extras.

uv sync                    # default runtime dependencies
uv sync --extra analysis   # pandas/numpy/scipy plotting stack
uv sync --extra vllm       # optional local GPU inference stack

If you do not use uv, the pip-compatible fallback is:

python -m pip install -e .
python -m pip install -e ".[analysis]"
python -m pip install -e ".[vllm]"
ai-reviewer-diagnostics --help

API-free smoke test

uv sync
uv run make smoke-test

This compiles Python files, validates example JSON, checks all inference runners in --validate-only mode, and runs the OpenReview-cleaner fixture. Generated files go under outputs/ and can be removed with:

make clean

Dataset

Large artifacts are hosted separately on Hugging Face so the GitHub repo stays lightweight:

https://huggingface.co/datasets/jiataoli/ai-reviewer-diagnostic-data

Download and inspect:

uv run hf download jiataoli/ai-reviewer-diagnostic-data \
  --repo-type dataset \
  --local-dir ai-reviewer-diagnostic-data
uv run python scripts/summarize_release_data.py --data-dir ai-reviewer-diagnostic-data/data

Expected summary starts with file count, total size, file types, JSONL row counts, and largest files. See docs/DATA.md for schema and naming notes.

Diagnostic toolkit workflow

To evaluate a new automated review system, export its baseline and perturbed outputs as JSONL with a shared id field and score/decision fields:

{"id":"paper_001","overall_score":8,"soundness_score":4,"final_decision":"Accept as Poster"}

Then generate an aspect-level report:

make demo-report
# or, for your own system outputs:
uv run ai-reviewer-diagnostics \
  --baseline outputs/my_system_baseline.jsonl \
  --perturbed outputs/my_system_soundness_perturbed.jsonl \
  --condition paper/soundness \
  --output-md reports/my_system_soundness_report.md \
  --output-json reports/my_system_soundness_report.json

The report summarizes score deltas, decision-change rates, and top decision transitions. If you use the public dataset score-file naming convention, run directory mode:

uv run ai-reviewer-diagnostics \
  --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores \
  --output-md reports/released_scores_report.md

Common commands

OpenAI-compatible / OpenRouter inference

export OPENROUTER_API_KEY=***
uv run python scripts/run_openrouter.py \
  --input examples/example.json \
  --output outputs/model_outputs.jsonl \
  --model mistralai/mistral-small-3.1-24b-instruct \
  --base-url https://openrouter.ai/api/v1 \
  --api-key-env OPENROUTER_API_KEY \
  --workers 1

Gemini inference

export GEMINI_API_KEY=***
uv run python scripts/run_gemini.py \
  --input examples/example.json \
  --output outputs/gemini_outputs.jsonl \
  --model gemini-2.0-flash \
  --workers 1

Optional local vLLM inference

vllm is intentionally kept out of the default install because it depends on your CUDA, PyTorch, and GPU setup.

uv sync --extra vllm
uv run python scripts/run_vllm.py \
  --input examples/example.json \
  --output outputs/vllm_outputs.jsonl \
  --model-path Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 8 \
  --limit 1

Clean an OpenReview export

uv run python scripts/clean_openreview.py \
  --input examples/openreview_comments_minimal.json \
  --output outputs/openreview_conversations.json \
  --forum-id forum_example \
  --print-text

For your own data, replace examples/openreview_comments_minimal.json with an OpenReview comments export.

Repository layout

ai_reviewer_diagnostics/ # pip-installable diagnostic report package
scripts/              # wrappers and reusable CLIs: quickstart, inference, preprocessing, data summary
analysis/             # analysis scripts for released annotation-score artifacts
examples/             # tiny runnable fixtures for quickstart, smoke tests, and report generation
prompts/              # curated machine-readable prompt templates
  base_prompt.jsonl
  perturb_prompt.jsonl
data/README.md        # pointer to the external Hugging Face dataset
docs/                 # getting-started, data, and reproducibility notes
paper/README.md       # DOI, ACM PDF link, and citation pointer
CITATION.bib          # BibTeX citation
CITATION.cff          # GitHub citation metadata
CONTRIBUTING.md       # community contribution guide
MANIFEST.md           # full release inventory

Reproducibility path

The release is organized in tiers so users can get value from the public code, prompts, examples, and released artifacts:

Immediate check: make quickstart validates layout and schemas with Python only.
Code smoke test: make smoke-test checks scripts without API calls or GPUs.
Artifact inspection: download the Hugging Face dataset and run summarize_release_data.py.
Model inference: run OpenRouter, Gemini, or vLLM wrappers on your own prompt batches.
Diagnostic report: compare a system's baseline and perturbed outputs with ai-reviewer-diagnostics.
Analysis: use analysis/ scripts on downloaded score artifacts.

See docs/GETTING_STARTED.md and docs/REPRODUCIBILITY.md for the longer guide.

Community contributions

Bug reports, integration requests, and metric ideas are welcome. Use the GitHub issue templates for reproducible CLI bugs, new automated-review-system integrations, or report-metric proposals. Pull requests should keep the default install lightweight and API-free smoke tests passing; see CONTRIBUTING.md.

Citation

@inproceedings{li2025where,
  title     = {Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation},
  author    = {Li, Jiatao and Li, Yanheng and Hu, Xinyu and Gao, Mingqi and Wan, Xiaojun},
  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
  year      = {2025},
  publisher = {ACM},
  doi       = {10.1145/3746252.3761274},
  url       = {https://doi.org/10.1145/3746252.3761274}
}

Documentation map

docs/GETTING_STARTED.md: shortest path for a new user.
docs/DATA.md: dataset contents, schemas, naming glossary, and rights notes.
docs/REPRODUCIBILITY.md: tiered reproduction guide.
docs/INTEGRATIONS.md: connect a new automated-review system.
scripts/README.md: CLI map and examples.
analysis/README.md: analysis script guide.
prompts/README.md: prompt reuse notes.
CONTRIBUTING.md: issue/PR workflow for community contributions.
MANIFEST.md: release inventory.

Release status

The GitHub repository contains code, curated prompts, docs, examples, citation metadata, and pointers to the paper/data. The ACM paper PDF is linked from paper/README.md rather than redistributed here. Code is MIT licensed in LICENSE; the dataset is hosted separately on Hugging Face under its dataset-card terms.

Contact

Open a GitHub issue in the public repository for questions, reuse requests, or reproduction problems.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

May 12, 2026

0.1.4

May 11, 2026

0.1.3

May 11, 2026

0.1.2

May 11, 2026

0.1.1

May 11, 2026

This version

0.1.0

May 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_reviewer_diagnostics-0.1.0.tar.gz (16.4 kB view details)

Uploaded May 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_reviewer_diagnostics-0.1.0-py3-none-any.whl (13.4 kB view details)

Uploaded May 9, 2026 Python 3

File details

Details for the file ai_reviewer_diagnostics-0.1.0.tar.gz.

File metadata

Download URL: ai_reviewer_diagnostics-0.1.0.tar.gz
Upload date: May 9, 2026
Size: 16.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ai_reviewer_diagnostics-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e41989413b26ee438025fcaa1aa94cf270f2808dfdf6f26b0af4f181c5ba6007`
MD5	`a5097d7a3415adebf700b9147fd3878b`
BLAKE2b-256	`ed69c413775c1b2c6654e995814c9afef82b34cb821442088b1fa0607348664d`

See more details on using hashes here.

File details

Details for the file ai_reviewer_diagnostics-0.1.0-py3-none-any.whl.

File metadata

Download URL: ai_reviewer_diagnostics-0.1.0-py3-none-any.whl
Upload date: May 9, 2026
Size: 13.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ai_reviewer_diagnostics-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1b48d3823781a98a4a8c74b8d585ef0ec9076995a9159834b350730db91601d`
MD5	`2b7fa2e77c66fab7591598132dc834bd`
BLAKE2b-256	`126186bb9632b7d4ce2cf1c56e9a8a7a7436c8a70c82b70564f948ff06e64d74`

See more details on using hashes here.

ai-reviewer-diagnostics 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review

30-second quickstart

What you can reuse

Command map

Integrate your own review system

Dependencies

API-free smoke test

Dataset

Diagnostic toolkit workflow

Common commands

OpenAI-compatible / OpenRouter inference

Gemini inference

Optional local vLLM inference

Clean an OpenReview export

Repository layout

Reproducibility path

Community contributions

Citation

Documentation map

Release status

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes