Skip to main content

Diagnostic toolkit for black-box evaluation of automated peer-review systems.

Project description

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review

CI PyPI Paper Dataset License: MIT Python

A pip-installable diagnostic toolkit for black-box evaluation of automated peer-review systems under controlled aspect-guided perturbations.

Use it as a community evaluation tool: run any automated review system on paired original/perturbed paper, review, or rebuttal inputs; export scores or decisions; then generate aspect-level reports measuring sensitivity to soundness, presentation, contribution, tone, factuality, completeness, and recommendation perturbations.

AI reviewer diagnostics workflow

Companion code, prompts, examples, and reproducibility notes for the CIKM 2025 paper:

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation
Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, Xiaojun Wan. CIKM 2025.
DOI: https://doi.org/10.1145/3746252.3761274

If this repository helps your research, please cite the paper. Copy-paste BibTeX is below and in CITATION.bib; GitHub citation metadata is in CITATION.cff.

30-second quickstart

Install the diagnostic CLI from PyPI and run the toy report. This path needs no API keys, GPUs, model downloads, or companion dataset.

python -m pip install ai-reviewer-diagnostics
ai-reviewer-diagnostics --demo --output-md outputs/demo_diagnostic_report.md

If you want the latest GitHub version before a PyPI release catches up:

python -m pip install "git+https://github.com/JiataoLi/where-do-llms-go-wrong.git"

Expected package demo output:

Compared 1 condition pair(s).
Wrote outputs/demo_diagnostic_report.md

For a repo checkout:

git clone https://github.com/JiataoLi/where-do-llms-go-wrong
cd where-do-llms-go-wrong
make quickstart
make demo-report

Expected checkout quickstart output:

AI-reviewer diagnostic release quickstart: OK
Validated 1 chat example(s), 3 OpenReview note(s).
Prompt rows: base=9, perturb=7.
Wrote outputs/quickstart/quickstart_summary.json

make quickstart validates the repo layout, example schemas, prompt files, and citation metadata, then writes a tiny demo artifact under outputs/quickstart/. make demo-report exercises the same packaged report engine exposed as ai-reviewer-diagnostics. These are format/schema demos, not model results.

What you can reuse

Goal Start here Requires
Check that the repo is healthy make quickstart Python only
Run the API-free code smoke test uv sync && uv run make smoke-test lightweight Python deps
Generate a toy diagnostic report make demo-report or ai-reviewer-diagnostics ... Python only
Reuse the prompt templates prompts/ none
Run API-based model inference scripts/run_openrouter.py, scripts/run_gemini.py API key
Run local model inference scripts/run_vllm.py GPU + vLLM
Evaluate a new review system's outputs ai-reviewer-diagnostics / ai_reviewer_diagnostics JSONL outputs with shared id fields
Inspect released artifacts docs/DATA.md + Hugging Face dataset huggingface_hub
Recreate analysis tables/figures analysis/ + docs/REPRODUCIBILITY.md dataset + analysis deps

Command map

Command Purpose
ai-reviewer-diagnostics --demo --output-md outputs/demo.md Verify the pip-installed CLI with bundled toy fixtures.
ai-reviewer-diagnostics --baseline base.jsonl --perturbed pert.jsonl --condition paper/soundness --output-md report.md Diagnose one baseline/perturbed pair from your own system.
ai-reviewer-diagnostics --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores --output-md report.md Summarize all paired score files in the released dataset format.
make quickstart Check a repo checkout without installing dependencies.
make smoke-test Run the API-free repository test suite.

Integrate your own review system

The toolkit only requires JSONL outputs with shared id values. Start with docs/INTEGRATIONS.md for schema examples, custom score fields, directory mode, and common pitfalls.

Dependencies

Dependencies and package metadata are managed in pyproject.toml. The install exposes two console commands, ai-reviewer-diagnostics and the shorter alias ai-reviewer-report. Analysis and vLLM dependencies are optional extras.

uv sync                    # default runtime dependencies
uv sync --extra analysis   # pandas/numpy/scipy plotting stack
uv sync --extra vllm       # optional local GPU inference stack

If you do not use uv, the pip-compatible fallback is:

python -m pip install -e .
python -m pip install -e ".[analysis]"
python -m pip install -e ".[vllm]"
ai-reviewer-diagnostics --help

API-free smoke test

uv sync
uv run make smoke-test

This compiles Python files, validates example JSON, checks all inference runners in --validate-only mode, and runs the OpenReview-cleaner fixture. Generated files go under outputs/ and can be removed with:

make clean

Dataset

Large artifacts are hosted separately on Hugging Face so the GitHub repo stays lightweight:

https://huggingface.co/datasets/jiataoli/ai-reviewer-diagnostic-data

Download and inspect:

uv run hf download jiataoli/ai-reviewer-diagnostic-data \
  --repo-type dataset \
  --local-dir ai-reviewer-diagnostic-data
uv run python scripts/summarize_release_data.py --data-dir ai-reviewer-diagnostic-data/data

Expected summary starts with file count, total size, file types, JSONL row counts, and largest files. See docs/DATA.md for schema and naming notes.

Diagnostic toolkit workflow

To evaluate a new automated review system, export its baseline and perturbed outputs as JSONL with a shared id field and score/decision fields:

{"id":"paper_001","overall_score":8,"soundness_score":4,"final_decision":"Accept as Poster"}

Then generate an aspect-level report:

make demo-report
# or, for your own system outputs:
uv run ai-reviewer-diagnostics \
  --baseline outputs/my_system_baseline.jsonl \
  --perturbed outputs/my_system_soundness_perturbed.jsonl \
  --condition paper/soundness \
  --output-md reports/my_system_soundness_report.md \
  --output-json reports/my_system_soundness_report.json

The report summarizes score deltas, decision-change rates, and top decision transitions. If you use the public dataset score-file naming convention, run directory mode:

uv run ai-reviewer-diagnostics \
  --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores \
  --output-md reports/released_scores_report.md

Common commands

OpenAI-compatible / OpenRouter inference

export OPENROUTER_API_KEY=***
uv run python scripts/run_openrouter.py \
  --input examples/example.json \
  --output outputs/model_outputs.jsonl \
  --model mistralai/mistral-small-3.1-24b-instruct \
  --base-url https://openrouter.ai/api/v1 \
  --api-key-env OPENROUTER_API_KEY \
  --workers 1

Gemini inference

export GEMINI_API_KEY=***
uv run python scripts/run_gemini.py \
  --input examples/example.json \
  --output outputs/gemini_outputs.jsonl \
  --model gemini-2.0-flash \
  --workers 1

Optional local vLLM inference

vllm is intentionally kept out of the default install because it depends on your CUDA, PyTorch, and GPU setup.

uv sync --extra vllm
uv run python scripts/run_vllm.py \
  --input examples/example.json \
  --output outputs/vllm_outputs.jsonl \
  --model-path Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 8 \
  --limit 1

Clean an OpenReview export

uv run python scripts/clean_openreview.py \
  --input examples/openreview_comments_minimal.json \
  --output outputs/openreview_conversations.json \
  --forum-id forum_example \
  --print-text

For your own data, replace examples/openreview_comments_minimal.json with an OpenReview comments export.

Repository layout

ai_reviewer_diagnostics/ # pip-installable diagnostic report package
scripts/              # wrappers and reusable CLIs: quickstart, inference, preprocessing, data summary
analysis/             # analysis scripts for released annotation-score artifacts
examples/             # tiny runnable fixtures for quickstart, smoke tests, and report generation
prompts/              # curated machine-readable prompt templates
  base_prompt.jsonl
  perturb_prompt.jsonl
data/README.md        # pointer to the external Hugging Face dataset
docs/                 # getting-started, data, and reproducibility notes
paper/README.md       # DOI, ACM PDF link, and citation pointer
CITATION.bib          # BibTeX citation
CITATION.cff          # GitHub citation metadata
CONTRIBUTING.md       # community contribution guide
MANIFEST.md           # full release inventory

Reproducibility path

The release is organized in tiers so users can get value from the public code, prompts, examples, and released artifacts:

  1. Immediate check: make quickstart validates layout and schemas with Python only.
  2. Code smoke test: make smoke-test checks scripts without API calls or GPUs.
  3. Artifact inspection: download the Hugging Face dataset and run summarize_release_data.py.
  4. Model inference: run OpenRouter, Gemini, or vLLM wrappers on your own prompt batches.
  5. Diagnostic report: compare a system's baseline and perturbed outputs with ai-reviewer-diagnostics.
  6. Analysis: use analysis/ scripts on downloaded score artifacts.

See docs/GETTING_STARTED.md and docs/REPRODUCIBILITY.md for the longer guide.

Community contributions

Bug reports, integration requests, and metric ideas are welcome. Use the GitHub issue templates for reproducible CLI bugs, new automated-review-system integrations, or report-metric proposals. Pull requests should keep the default install lightweight and API-free smoke tests passing; see CONTRIBUTING.md.

Citation

@inproceedings{li2025where,
  title     = {Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation},
  author    = {Li, Jiatao and Li, Yanheng and Hu, Xinyu and Gao, Mingqi and Wan, Xiaojun},
  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
  year      = {2025},
  publisher = {ACM},
  doi       = {10.1145/3746252.3761274},
  url       = {https://doi.org/10.1145/3746252.3761274}
}

Documentation map

Release status

The GitHub repository contains code, curated prompts, docs, examples, citation metadata, and pointers to the paper/data. The ACM paper PDF is linked from paper/README.md rather than redistributed here. Code is MIT licensed in LICENSE; the dataset is hosted separately on Hugging Face under its dataset-card terms.

Contact

Open a GitHub issue in the public repository for questions, reuse requests, or reproduction problems.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_reviewer_diagnostics-0.1.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_reviewer_diagnostics-0.1.0-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file ai_reviewer_diagnostics-0.1.0.tar.gz.

File metadata

  • Download URL: ai_reviewer_diagnostics-0.1.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ai_reviewer_diagnostics-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e41989413b26ee438025fcaa1aa94cf270f2808dfdf6f26b0af4f181c5ba6007
MD5 a5097d7a3415adebf700b9147fd3878b
BLAKE2b-256 ed69c413775c1b2c6654e995814c9afef82b34cb821442088b1fa0607348664d

See more details on using hashes here.

File details

Details for the file ai_reviewer_diagnostics-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_reviewer_diagnostics-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1b48d3823781a98a4a8c74b8d585ef0ec9076995a9159834b350730db91601d
MD5 2b7fa2e77c66fab7591598132dc834bd
BLAKE2b-256 126186bb9632b7d4ce2cf1c56e9a8a7a7436c8a70c82b70564f948ff06e64d74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page