Skip to main content

Diagnostic toolkit for black-box evaluation of automated peer-review systems.

Project description

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review

CI PyPI Paper Dataset License: MIT

Official repository for the CIKM 2025 paper “Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation.”

Use this repo to evaluate automated peer-review systems on paired original/perturbed inputs and generate aspect-level diagnostic reports. The data lives on Hugging Face; the installable reporting CLI lives on PyPI.

AI reviewer diagnostics workflow

Start fast

Install the CLI and run the bundled demo. No dataset, API key, GPU, or model download is needed.

python -m pip install ai-reviewer-diagnostics
ai-reviewer-diagnostics --demo --output-md outputs/demo_diagnostic_report.md

Expected output:

Compared 1 condition pair(s).
Wrote outputs/demo_diagnostic_report.md

For a repo checkout:

git clone https://github.com/leejamesss/where-do-llms-go-wrong
cd where-do-llms-go-wrong
make quickstart
make demo-report

What is included

Need Use
Diagnostic report CLI ai-reviewer-diagnostics / ai-reviewer-report
Main paired perturbation data HF datasetdata/content_pairs/*.jsonl
Released score artifacts HF dataset → data/annotation_scores/*.jsonl
Prompt templates prompts/
API / local inference wrappers scripts/
Paper-analysis scripts analysis/
Reproduction notes docs/REPRODUCIBILITY.md

Use the dataset

The primary dataset is before/after perturbation pairs:

from datasets import load_dataset
pairs = load_dataset("leejamesssss/ai-reviewer-diagnostic-data", split="train")
print(pairs[0].keys())  # id, source, aspect, content_before, content_after

Download the full HF repo when you need manifests or score artifacts:

hf download leejamesssss/ai-reviewer-diagnostic-data   --repo-type dataset   --local-dir ai-reviewer-diagnostic-data
python scripts/summarize_release_data.py --data-dir ai-reviewer-diagnostic-data/data

content_pairs/ is the canonical benchmark surface. annotation_scores/ is for reproducing or auditing the paper’s released scoring outputs. The duplicated perturbed-only view is intentionally excluded because content_pairs.content_after already contains it.

Evaluate your own review system

Export baseline and perturbed outputs as JSONL with shared id values and any score/decision fields:

{"id":"paper_001","overall_score":8,"soundness_score":4,"final_decision":"Accept as Poster"}

Then run:

ai-reviewer-diagnostics   --baseline outputs/my_system_baseline.jsonl   --perturbed outputs/my_system_soundness_perturbed.jsonl   --condition paper/soundness   --output-md reports/my_system_soundness_report.md   --output-json reports/my_system_soundness_report.json

Directory mode works for released score files:

ai-reviewer-diagnostics   --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores   --output-md reports/released_scores_report.md

The report summarizes score deltas, decision-change rates, and top decision transitions. See docs/INTEGRATIONS.md for schemas and custom fields.

Development commands

uv sync                    # default runtime deps
uv sync --extra analysis   # analysis deps
uv sync --extra vllm       # optional local GPU inference deps
uv run make smoke-test     # API-free checks
make clean                 # remove generated outputs

Inference wrappers:

python scripts/run_openrouter.py --input examples/example.json --output outputs/openrouter.jsonl --model <model> --api-key-env OPENROUTER_API_KEY
python scripts/run_gemini.py --input examples/example.json --output outputs/gemini.jsonl --model gemini-2.0-flash
python scripts/run_vllm.py --input examples/example.json --output outputs/vllm.jsonl --model-path <hf-or-local-model>

Repository map

ai_reviewer_diagnostics/  # pip package and diagnostic report CLI
scripts/                  # inference, preprocessing, quickstart, data-summary CLIs
analysis/                 # scripts for released score artifacts
examples/                 # tiny fixtures for demos/tests
prompts/                  # prompt templates
data/README.md            # pointer to HF dataset
docs/                     # data, integrations, reproducibility
paper/README.md           # DOI / citation pointer

Citation

@inproceedings{li2025where,
  title     = {Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation},
  author    = {Li, Jiatao and Li, Yanheng and Hu, Xinyu and Gao, Mingqi and Wan, Xiaojun},
  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
  year      = {2025},
  publisher = {ACM},
  doi       = {10.1145/3746252.3761274},
  url       = {https://doi.org/10.1145/3746252.3761274}
}

Docs: GETTING_STARTED, DATA, REPRODUCIBILITY, INTEGRATIONS, CONTRIBUTING.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_reviewer_diagnostics-0.1.3.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_reviewer_diagnostics-0.1.3-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file ai_reviewer_diagnostics-0.1.3.tar.gz.

File metadata

  • Download URL: ai_reviewer_diagnostics-0.1.3.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ai_reviewer_diagnostics-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c0a76bcb51bfa8d003b90a158c73414bb0990e5795e3b0dba20d4632f6b5d0ca
MD5 2a34816885ace98012d1e2314a8bf9ea
BLAKE2b-256 75d003fcc8d69683d8d1cb4b6e87b1ea112396765fc41bba7d7002a2b290c5e0

See more details on using hashes here.

File details

Details for the file ai_reviewer_diagnostics-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_reviewer_diagnostics-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5443dcd1db55a7cd2a7470f8326a3f16ccb8748b2329e543708730836611f48a
MD5 85e3d7271acaec9d2e73a4c37f196688
BLAKE2b-256 9e0f59526dd16086a5b912ff9a058d03dba93f00279bfd649ee33d7bfee946f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page