Diagnostic toolkit for black-box evaluation of automated peer-review systems.
Project description
Where Do LLMs Go Wrong? Diagnosing Automated Peer Review
A pip-installable diagnostic toolkit for black-box evaluation of automated peer-review systems under controlled aspect-guided perturbations.
Use it as a community evaluation tool: run any automated review system on paired original/perturbed paper, review, or rebuttal inputs; export scores or decisions; then generate aspect-level reports measuring sensitivity to soundness, presentation, contribution, tone, factuality, completeness, and recommendation perturbations.
Companion code, prompts, examples, and reproducibility notes for the CIKM 2025 paper:
Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation
Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, Xiaojun Wan. CIKM 2025.
DOI: https://doi.org/10.1145/3746252.3761274
If this repository helps your research, please cite the paper. Copy-paste BibTeX is below and in CITATION.bib; GitHub citation metadata is in CITATION.cff.
30-second quickstart
Install the diagnostic CLI from PyPI and run the toy report. This path needs no API keys, GPUs, model downloads, or companion dataset.
python -m pip install ai-reviewer-diagnostics
ai-reviewer-diagnostics --demo --output-md outputs/demo_diagnostic_report.md
If you want the latest GitHub version before a PyPI release catches up:
python -m pip install "git+https://github.com/JiataoLi/where-do-llms-go-wrong.git"
Expected package demo output:
Compared 1 condition pair(s).
Wrote outputs/demo_diagnostic_report.md
For a repo checkout:
git clone https://github.com/JiataoLi/where-do-llms-go-wrong
cd where-do-llms-go-wrong
make quickstart
make demo-report
Expected checkout quickstart output:
AI-reviewer diagnostic release quickstart: OK
Validated 1 chat example(s), 3 OpenReview note(s).
Prompt rows: base=9, perturb=7.
Wrote outputs/quickstart/quickstart_summary.json
make quickstart validates the repo layout, example schemas, prompt files, and citation metadata, then writes a tiny demo artifact under outputs/quickstart/. make demo-report exercises the same packaged report engine exposed as ai-reviewer-diagnostics. These are format/schema demos, not model results.
What you can reuse
| Goal | Start here | Requires |
|---|---|---|
| Check that the repo is healthy | make quickstart |
Python only |
| Run the API-free code smoke test | uv sync && uv run make smoke-test |
lightweight Python deps |
| Generate a toy diagnostic report | make demo-report or ai-reviewer-diagnostics ... |
Python only |
| Reuse the prompt templates | prompts/ |
none |
| Run API-based model inference | scripts/run_openrouter.py, scripts/run_gemini.py |
API key |
| Run local model inference | scripts/run_vllm.py |
GPU + vLLM |
| Evaluate a new review system's outputs | ai-reviewer-diagnostics / ai_reviewer_diagnostics |
JSONL outputs with shared id fields |
| Inspect released artifacts | docs/DATA.md + Hugging Face dataset |
huggingface_hub |
| Recreate analysis tables/figures | analysis/ + docs/REPRODUCIBILITY.md |
dataset + analysis deps |
Command map
| Command | Purpose |
|---|---|
ai-reviewer-diagnostics --demo --output-md outputs/demo.md |
Verify the pip-installed CLI with bundled toy fixtures. |
ai-reviewer-diagnostics --baseline base.jsonl --perturbed pert.jsonl --condition paper/soundness --output-md report.md |
Diagnose one baseline/perturbed pair from your own system. |
ai-reviewer-diagnostics --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores --output-md report.md |
Summarize all paired score files in the released dataset format. |
make quickstart |
Check a repo checkout without installing dependencies. |
make smoke-test |
Run the API-free repository test suite. |
Integrate your own review system
The toolkit only requires JSONL outputs with shared id values. Start with docs/INTEGRATIONS.md for schema examples, custom score fields, directory mode, and common pitfalls.
Dependencies
Dependencies and package metadata are managed in pyproject.toml. The install exposes two console commands, ai-reviewer-diagnostics and the shorter alias ai-reviewer-report. Analysis and vLLM dependencies are optional extras.
uv sync # default runtime dependencies
uv sync --extra analysis # pandas/numpy/scipy plotting stack
uv sync --extra vllm # optional local GPU inference stack
If you do not use uv, the pip-compatible fallback is:
python -m pip install -e .
python -m pip install -e ".[analysis]"
python -m pip install -e ".[vllm]"
ai-reviewer-diagnostics --help
API-free smoke test
uv sync
uv run make smoke-test
This compiles Python files, validates example JSON, checks all inference runners in --validate-only mode, and runs the OpenReview-cleaner fixture. Generated files go under outputs/ and can be removed with:
make clean
Dataset
Large artifacts are hosted separately on Hugging Face so the GitHub repo stays lightweight:
https://huggingface.co/datasets/jiataoli/ai-reviewer-diagnostic-data
Download and inspect:
uv run hf download jiataoli/ai-reviewer-diagnostic-data \
--repo-type dataset \
--local-dir ai-reviewer-diagnostic-data
uv run python scripts/summarize_release_data.py --data-dir ai-reviewer-diagnostic-data/data
Expected summary starts with file count, total size, file types, JSONL row counts, and largest files. See docs/DATA.md for schema and naming notes.
Diagnostic toolkit workflow
To evaluate a new automated review system, export its baseline and perturbed outputs as JSONL with a shared id field and score/decision fields:
{"id":"paper_001","overall_score":8,"soundness_score":4,"final_decision":"Accept as Poster"}
Then generate an aspect-level report:
make demo-report
# or, for your own system outputs:
uv run ai-reviewer-diagnostics \
--baseline outputs/my_system_baseline.jsonl \
--perturbed outputs/my_system_soundness_perturbed.jsonl \
--condition paper/soundness \
--output-md reports/my_system_soundness_report.md \
--output-json reports/my_system_soundness_report.json
The report summarizes score deltas, decision-change rates, and top decision transitions. If you use the public dataset score-file naming convention, run directory mode:
uv run ai-reviewer-diagnostics \
--scores-dir ai-reviewer-diagnostic-data/data/annotation_scores \
--output-md reports/released_scores_report.md
Common commands
OpenAI-compatible / OpenRouter inference
export OPENROUTER_API_KEY=***
uv run python scripts/run_openrouter.py \
--input examples/example.json \
--output outputs/model_outputs.jsonl \
--model mistralai/mistral-small-3.1-24b-instruct \
--base-url https://openrouter.ai/api/v1 \
--api-key-env OPENROUTER_API_KEY \
--workers 1
Gemini inference
export GEMINI_API_KEY=***
uv run python scripts/run_gemini.py \
--input examples/example.json \
--output outputs/gemini_outputs.jsonl \
--model gemini-2.0-flash \
--workers 1
Optional local vLLM inference
vllm is intentionally kept out of the default install because it depends on your CUDA, PyTorch, and GPU setup.
uv sync --extra vllm
uv run python scripts/run_vllm.py \
--input examples/example.json \
--output outputs/vllm_outputs.jsonl \
--model-path Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 8 \
--limit 1
Clean an OpenReview export
uv run python scripts/clean_openreview.py \
--input examples/openreview_comments_minimal.json \
--output outputs/openreview_conversations.json \
--forum-id forum_example \
--print-text
For your own data, replace examples/openreview_comments_minimal.json with an OpenReview comments export.
Repository layout
ai_reviewer_diagnostics/ # pip-installable diagnostic report package
scripts/ # wrappers and reusable CLIs: quickstart, inference, preprocessing, data summary
analysis/ # analysis scripts for released annotation-score artifacts
examples/ # tiny runnable fixtures for quickstart, smoke tests, and report generation
prompts/ # curated machine-readable prompt templates
base_prompt.jsonl
perturb_prompt.jsonl
data/README.md # pointer to the external Hugging Face dataset
docs/ # getting-started, data, and reproducibility notes
paper/README.md # DOI, ACM PDF link, and citation pointer
CITATION.bib # BibTeX citation
CITATION.cff # GitHub citation metadata
CONTRIBUTING.md # community contribution guide
MANIFEST.md # full release inventory
Reproducibility path
The release is organized in tiers so users can get value from the public code, prompts, examples, and released artifacts:
- Immediate check:
make quickstartvalidates layout and schemas with Python only. - Code smoke test:
make smoke-testchecks scripts without API calls or GPUs. - Artifact inspection: download the Hugging Face dataset and run
summarize_release_data.py. - Model inference: run OpenRouter, Gemini, or vLLM wrappers on your own prompt batches.
- Diagnostic report: compare a system's baseline and perturbed outputs with
ai-reviewer-diagnostics. - Analysis: use
analysis/scripts on downloaded score artifacts.
See docs/GETTING_STARTED.md and docs/REPRODUCIBILITY.md for the longer guide.
Community contributions
Bug reports, integration requests, and metric ideas are welcome. Use the GitHub issue templates for reproducible CLI bugs, new automated-review-system integrations, or report-metric proposals. Pull requests should keep the default install lightweight and API-free smoke tests passing; see CONTRIBUTING.md.
Citation
@inproceedings{li2025where,
title = {Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation},
author = {Li, Jiatao and Li, Yanheng and Hu, Xinyu and Gao, Mingqi and Wan, Xiaojun},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
year = {2025},
publisher = {ACM},
doi = {10.1145/3746252.3761274},
url = {https://doi.org/10.1145/3746252.3761274}
}
Documentation map
docs/GETTING_STARTED.md: shortest path for a new user.docs/DATA.md: dataset contents, schemas, naming glossary, and rights notes.docs/REPRODUCIBILITY.md: tiered reproduction guide.docs/INTEGRATIONS.md: connect a new automated-review system.scripts/README.md: CLI map and examples.analysis/README.md: analysis script guide.prompts/README.md: prompt reuse notes.CONTRIBUTING.md: issue/PR workflow for community contributions.MANIFEST.md: release inventory.
Release status
The GitHub repository contains code, curated prompts, docs, examples, citation metadata, and pointers to the paper/data. The ACM paper PDF is linked from paper/README.md rather than redistributed here. Code is MIT licensed in LICENSE; the dataset is hosted separately on Hugging Face under its dataset-card terms.
Contact
Open a GitHub issue in the public repository for questions, reuse requests, or reproduction problems.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_reviewer_diagnostics-0.1.0.tar.gz.
File metadata
- Download URL: ai_reviewer_diagnostics-0.1.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e41989413b26ee438025fcaa1aa94cf270f2808dfdf6f26b0af4f181c5ba6007
|
|
| MD5 |
a5097d7a3415adebf700b9147fd3878b
|
|
| BLAKE2b-256 |
ed69c413775c1b2c6654e995814c9afef82b34cb821442088b1fa0607348664d
|
File details
Details for the file ai_reviewer_diagnostics-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ai_reviewer_diagnostics-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1b48d3823781a98a4a8c74b8d585ef0ec9076995a9159834b350730db91601d
|
|
| MD5 |
2b7fa2e77c66fab7591598132dc834bd
|
|
| BLAKE2b-256 |
126186bb9632b7d4ce2cf1c56e9a8a7a7436c8a70c82b70564f948ff06e64d74
|