Skip to main content

Diagnostic toolkit for black-box evaluation of automated peer-review systems.

Project description

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review

CI PyPI Paper Dataset License: MIT Python

The official repository for our CIKM 2025 paper, "Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation".

It also provides a pip-installable diagnostic toolkit for black-box evaluation of automated peer-review systems: run any automated review system on paired original/perturbed paper, review, or rebuttal inputs; export scores or decisions; then generate aspect-level reports measuring sensitivity to soundness, presentation, contribution, tone, factuality, completeness, and recommendation perturbations.

AI reviewer diagnostics workflow

Companion code, prompts, examples, and reproducibility notes for the CIKM 2025 paper:

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation
Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, Xiaojun Wan. CIKM 2025.
DOI: https://doi.org/10.1145/3746252.3761274

If this repository helps your research, please cite the paper. Copy-paste BibTeX is below and in CITATION.bib; GitHub citation metadata is in CITATION.cff.

30-second quickstart

Install the diagnostic CLI from PyPI and run the toy report. This path needs no API keys, GPUs, model downloads, or companion dataset.

python -m pip install ai-reviewer-diagnostics
ai-reviewer-diagnostics --demo --output-md outputs/demo_diagnostic_report.md

If you want the latest GitHub version before a PyPI release catches up:

python -m pip install "git+https://github.com/leejamesss/where-do-llms-go-wrong.git"

Expected package demo output:

Compared 1 condition pair(s).
Wrote outputs/demo_diagnostic_report.md

For a repo checkout:

git clone https://github.com/leejamesss/where-do-llms-go-wrong
cd where-do-llms-go-wrong
make quickstart
make demo-report

Expected checkout quickstart output:

AI-reviewer diagnostic release quickstart: OK
Validated 1 chat example(s), 3 OpenReview note(s).
Prompt rows: base=9, perturb=7.
Wrote outputs/quickstart/quickstart_summary.json

make quickstart validates the repo layout, example schemas, prompt files, and citation metadata, then writes a tiny demo artifact under outputs/quickstart/. make demo-report exercises the same packaged report engine exposed as ai-reviewer-diagnostics. These are format/schema demos, not model results.

What you can reuse

Goal Start here Requires
Check that the repo is healthy make quickstart Python only
Run the API-free code smoke test uv sync && uv run make smoke-test lightweight Python deps
Generate a toy diagnostic report make demo-report or ai-reviewer-diagnostics ... Python only
Reuse the prompt templates prompts/ none
Run API-based model inference scripts/run_openrouter.py, scripts/run_gemini.py API key
Run local model inference scripts/run_vllm.py GPU + vLLM
Evaluate a new review system's outputs ai-reviewer-diagnostics / ai_reviewer_diagnostics JSONL outputs with shared id fields
Inspect released artifacts docs/DATA.md + Hugging Face dataset huggingface_hub
Recreate analysis tables/figures analysis/ + docs/REPRODUCIBILITY.md dataset + analysis deps

Command map

Command Purpose
ai-reviewer-diagnostics --demo --output-md outputs/demo.md Verify the pip-installed CLI with bundled toy fixtures.
ai-reviewer-diagnostics --baseline base.jsonl --perturbed pert.jsonl --condition paper/soundness --output-md report.md Diagnose one baseline/perturbed pair from your own system.
ai-reviewer-diagnostics --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores --output-md report.md Summarize all paired score files in the released dataset format.
make quickstart Check a repo checkout without installing dependencies.
make smoke-test Run the API-free repository test suite.

Integrate your own review system

The toolkit only requires JSONL outputs with shared id values. Start with docs/INTEGRATIONS.md for schema examples, custom score fields, directory mode, and common pitfalls.

Dependencies

Dependencies and package metadata are managed in pyproject.toml. The install exposes two console commands, ai-reviewer-diagnostics and the shorter alias ai-reviewer-report. Analysis and vLLM dependencies are optional extras.

uv sync                    # default runtime dependencies
uv sync --extra analysis   # pandas/numpy/scipy plotting stack
uv sync --extra vllm       # optional local GPU inference stack

If you do not use uv, the pip-compatible fallback is:

python -m pip install -e .
python -m pip install -e ".[analysis]"
python -m pip install -e ".[vllm]"
ai-reviewer-diagnostics --help

API-free smoke test

uv sync
uv run make smoke-test

This compiles Python files, validates example JSON, checks all inference runners in --validate-only mode, and runs the OpenReview-cleaner fixture. Generated files go under outputs/ and can be removed with:

make clean

Dataset

The core paired pre-/post-perturbation content artifacts are hosted separately on Hugging Face so the GitHub repo stays lightweight:

https://huggingface.co/datasets/leejamesssss/ai-reviewer-diagnostic-data

Download and inspect:

uv run hf download leejamesssss/ai-reviewer-diagnostic-data \
  --repo-type dataset \
  --local-dir ai-reviewer-diagnostic-data
uv run python scripts/summarize_release_data.py --data-dir ai-reviewer-diagnostic-data/data

Expected summary starts with file count, total size, file types, JSONL row counts, and largest files. In the current release, content_pairs/ is the core data contribution: before/after perturbation pairs. annotation_scores/ contains our experiment score outputs and summary tables. See docs/DATA.md for schema and naming notes.

Diagnostic toolkit workflow

To evaluate a new automated review system, export its baseline and perturbed outputs as JSONL with a shared id field and score/decision fields:

{"id":"paper_001","overall_score":8,"soundness_score":4,"final_decision":"Accept as Poster"}

Then generate an aspect-level report:

make demo-report
# or, for your own system outputs:
uv run ai-reviewer-diagnostics \
  --baseline outputs/my_system_baseline.jsonl \
  --perturbed outputs/my_system_soundness_perturbed.jsonl \
  --condition paper/soundness \
  --output-md reports/my_system_soundness_report.md \
  --output-json reports/my_system_soundness_report.json

The report summarizes score deltas, decision-change rates, and top decision transitions. If you use the public dataset score-file naming convention, run directory mode:

uv run ai-reviewer-diagnostics \
  --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores \
  --output-md reports/released_scores_report.md

Common commands

OpenAI-compatible / OpenRouter inference

export OPENROUTER_API_KEY=***
uv run python scripts/run_openrouter.py \
  --input examples/example.json \
  --output outputs/model_outputs.jsonl \
  --model mistralai/mistral-small-3.1-24b-instruct \
  --base-url https://openrouter.ai/api/v1 \
  --api-key-env OPENROUTER_API_KEY \
  --workers 1

Gemini inference

export GEMINI_API_KEY=***
uv run python scripts/run_gemini.py \
  --input examples/example.json \
  --output outputs/gemini_outputs.jsonl \
  --model gemini-2.0-flash \
  --workers 1

Optional local vLLM inference

vllm is intentionally kept out of the default install because it depends on your CUDA, PyTorch, and GPU setup.

uv sync --extra vllm
uv run python scripts/run_vllm.py \
  --input examples/example.json \
  --output outputs/vllm_outputs.jsonl \
  --model-path Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 8 \
  --limit 1

Clean an OpenReview export

uv run python scripts/clean_openreview.py \
  --input examples/openreview_comments_minimal.json \
  --output outputs/openreview_conversations.json \
  --forum-id forum_example \
  --print-text

For your own data, replace examples/openreview_comments_minimal.json with an OpenReview comments export.

Repository layout

ai_reviewer_diagnostics/ # pip-installable diagnostic report package
scripts/              # wrappers and reusable CLIs: quickstart, inference, preprocessing, data summary
analysis/             # analysis scripts for released experiment score outputs
examples/             # tiny runnable fixtures for quickstart, smoke tests, and report generation
prompts/              # curated machine-readable prompt templates
  base_prompt.jsonl
  perturb_prompt.jsonl
data/README.md        # pointer to the external Hugging Face dataset
docs/                 # getting-started, data, and reproducibility notes
paper/README.md       # DOI, ACM PDF link, and citation pointer
CITATION.bib          # BibTeX citation
CITATION.cff          # GitHub citation metadata
CONTRIBUTING.md       # community contribution guide
MANIFEST.md           # full release inventory

Reproducibility path

The release is organized in tiers so users can get value from the public code, prompts, examples, and released artifacts:

  1. Immediate check: make quickstart validates layout and schemas with Python only.
  2. Code smoke test: make smoke-test checks scripts without API calls or GPUs.
  3. Artifact inspection: download the Hugging Face dataset and inspect content_pairs/ first; this is the reusable before/after perturbation benchmark. perturbed_contents/ keeps perturbed-only artifacts for alignment with the original experiments.
  4. Model inference: run OpenRouter, Gemini, or vLLM wrappers on your own prompt batches.
  5. Diagnostic report: compare a system's baseline and perturbed outputs with ai-reviewer-diagnostics.
  6. Analysis: use analysis/ scripts on downloaded experiment score artifacts when reproducing our reported score analyses.

See docs/GETTING_STARTED.md and docs/REPRODUCIBILITY.md for the longer guide.

Community contributions

Bug reports, integration requests, and metric ideas are welcome. Use the GitHub issue templates for reproducible CLI bugs, new automated-review-system integrations, or report-metric proposals. Pull requests should keep the default install lightweight and API-free smoke tests passing; see CONTRIBUTING.md.

Citation

@inproceedings{li2025where,
  title     = {Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation},
  author    = {Li, Jiatao and Li, Yanheng and Hu, Xinyu and Gao, Mingqi and Wan, Xiaojun},
  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
  year      = {2025},
  publisher = {ACM},
  doi       = {10.1145/3746252.3761274},
  url       = {https://doi.org/10.1145/3746252.3761274}
}

Documentation map

Release status

The GitHub repository contains code, curated prompts, docs, examples, citation metadata, and pointers to the paper/data. The ACM paper PDF is linked from paper/README.md rather than redistributed here. Code is MIT licensed in LICENSE; the dataset is hosted separately on Hugging Face under its dataset-card terms.

Contact

Open a GitHub issue in the public repository for questions, reuse requests, or reproduction problems.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_reviewer_diagnostics-0.1.1.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_reviewer_diagnostics-0.1.1-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file ai_reviewer_diagnostics-0.1.1.tar.gz.

File metadata

  • Download URL: ai_reviewer_diagnostics-0.1.1.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ai_reviewer_diagnostics-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5b1b1eb33b3ba76b9656d3e2ae5a51c8b8b9cd1735d456d117916613f9a515fb
MD5 c9d135552fb65d9ebcf5c9ac4f032ed8
BLAKE2b-256 0a5f6b8509871685da407ed4b831405104cc4f9df70be2f44cd8610fe08301bc

See more details on using hashes here.

File details

Details for the file ai_reviewer_diagnostics-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_reviewer_diagnostics-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb1390517402f0a2505a3f9a483cabdd8c60d040f0be1e1d1d6827b6624d5f47
MD5 3c7e966461ae3d04c68786d7099873a7
BLAKE2b-256 cf21a714b6a773c533663278f0af3c90b532155dfb4beb26a219b10825a7a6fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page