Skip to main content

Diagnostic CLI for evaluating automated peer-review systems with aspect-guided perturbation data.

Project description

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review

CI PyPI Paper Dataset License: MIT

Official repository for the CIKM 2025 paper “Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation.”

Use this repo to evaluate automated peer-review systems on paired original/perturbed inputs and generate aspect-level diagnostic reports. The data lives on Hugging Face; the installable reporting CLI lives on PyPI.

Official resources

Resource Link Use
Paper ACM DOI CIKM 2025 publication
Dataset PKU-ONELab/ai-reviewer-diagnostic-data paired perturbation benchmark and released score artifacts
Package ai-reviewer-diagnostics on PyPI installable diagnostic-report CLI
Code PKU-ONELab/where-do-llms-go-wrong scripts, prompts, docs, and reproducibility workflow

AI reviewer diagnostics workflow

Start fast

Install the CLI and run the bundled demo. No dataset, API key, GPU, or model download is needed.

python -m pip install ai-reviewer-diagnostics
ai-reviewer-diagnostics --demo --output-md outputs/demo_diagnostic_report.md

Expected output:

Compared 1 condition pair(s).
Wrote outputs/demo_diagnostic_report.md

For a repo checkout:

git clone https://github.com/PKU-ONELab/where-do-llms-go-wrong
cd where-do-llms-go-wrong
make quickstart
make demo-report

What is included

Need Use
Diagnostic report CLI ai-reviewer-diagnostics / ai-reviewer-report
Main paired perturbation data HF datasetdata/content_pairs/*.jsonl
Released score artifacts HF dataset → data/annotation_scores/*.jsonl
Prompt templates prompts/
API / local inference wrappers scripts/
Paper-analysis scripts analysis/
Reproduction notes docs/REPRODUCIBILITY.md

Use the dataset

The primary dataset is before/after perturbation pairs:

from datasets import load_dataset
pairs = load_dataset("PKU-ONELab/ai-reviewer-diagnostic-data", split="train")
print(pairs[0].keys())  # id, source, aspect, content_before, content_after

Download the full HF repo when you need manifests or score artifacts:

hf download PKU-ONELab/ai-reviewer-diagnostic-data   --repo-type dataset   --local-dir ai-reviewer-diagnostic-data
python scripts/summarize_release_data.py --data-dir ai-reviewer-diagnostic-data/data

content_pairs/ is the canonical benchmark surface. annotation_scores/ is for reproducing or auditing the paper’s released scoring outputs. The duplicated perturbed-only view is intentionally excluded because content_pairs.content_after already contains it.

Evaluate your own review system

Export baseline and perturbed outputs as JSONL with shared id values and any score/decision fields:

{"id":"paper_001","overall_score":8,"soundness_score":4,"final_decision":"Accept as Poster"}

Then run:

ai-reviewer-diagnostics   --baseline outputs/my_system_baseline.jsonl   --perturbed outputs/my_system_soundness_perturbed.jsonl   --condition paper/soundness   --output-md reports/my_system_soundness_report.md   --output-json reports/my_system_soundness_report.json

Directory mode works for released score files:

ai-reviewer-diagnostics   --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores   --output-md reports/released_scores_report.md

The report summarizes score deltas, decision-change rates, and top decision transitions. See docs/INTEGRATIONS.md for schemas and custom fields.

Development commands

uv sync                    # default runtime deps
uv sync --extra analysis   # analysis deps
uv sync --extra vllm       # optional local GPU inference deps
uv run make smoke-test     # API-free checks
make clean                 # remove generated outputs

Inference wrappers:

python scripts/run_openrouter.py --input examples/example.json --output outputs/openrouter.jsonl --model <model> --api-key-env OPENROUTER_API_KEY
python scripts/run_gemini.py --input examples/example.json --output outputs/gemini.jsonl --model gemini-2.0-flash
python scripts/run_vllm.py --input examples/example.json --output outputs/vllm.jsonl --model-path <hf-or-local-model>

Repository map

ai_reviewer_diagnostics/  # pip package and diagnostic report CLI
scripts/                  # inference, preprocessing, quickstart, data-summary CLIs
analysis/                 # scripts for released score artifacts
examples/                 # tiny fixtures for demos/tests
prompts/                  # prompt templates
data/README.md            # pointer to HF dataset
docs/                     # data, integrations, reproducibility
paper/README.md           # DOI / citation pointer

Citation

@inproceedings{li2025where,
  title     = {Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation},
  author    = {Li, Jiatao and Li, Yanheng and Hu, Xinyu and Gao, Mingqi and Wan, Xiaojun},
  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
  year      = {2025},
  publisher = {ACM},
  doi       = {10.1145/3746252.3761274},
  url       = {https://doi.org/10.1145/3746252.3761274}
}

Docs: GETTING_STARTED, DATA, REPRODUCIBILITY, INTEGRATIONS, CONTRIBUTING.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_reviewer_diagnostics-0.1.5.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_reviewer_diagnostics-0.1.5-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file ai_reviewer_diagnostics-0.1.5.tar.gz.

File metadata

  • Download URL: ai_reviewer_diagnostics-0.1.5.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ai_reviewer_diagnostics-0.1.5.tar.gz
Algorithm Hash digest
SHA256 79a3eae44bc124c17d9df255b1e0c86972671db1c431f26aa3f353df6feef36d
MD5 92733222ad26b016b51b9ce43aab0484
BLAKE2b-256 263e7617130750d7c7177df4c0e8e92d0defc3a4419b43a6cb2d445ce508aa91

See more details on using hashes here.

File details

Details for the file ai_reviewer_diagnostics-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_reviewer_diagnostics-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 34aecd7fe3ccb9385540b76fe30042f54ed3155033fa03b7ead35e604e6047a4
MD5 be5965c59f7f13bc7c4bb6d0a59fe3dd
BLAKE2b-256 7bc0c67b96fec3b8cf4bb54bd3ba9d5b64ebcd048fd584bc9aade26f0778147c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page