Diagnostic toolkit for black-box evaluation of automated peer-review systems.

These details have not been verified by PyPI

Project links

Project description

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review

Official repository for the CIKM 2025 paper “Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation.”

Use this repo to evaluate automated peer-review systems on paired original/perturbed inputs and generate aspect-level diagnostic reports. The data lives on Hugging Face; the installable reporting CLI lives on PyPI.

AI reviewer diagnostics workflow

Start fast

Install the CLI and run the bundled demo. No dataset, API key, GPU, or model download is needed.

python -m pip install ai-reviewer-diagnostics
ai-reviewer-diagnostics --demo --output-md outputs/demo_diagnostic_report.md

Expected output:

Compared 1 condition pair(s).
Wrote outputs/demo_diagnostic_report.md

For a repo checkout:

git clone https://github.com/PKU-ONELab/where-do-llms-go-wrong
cd where-do-llms-go-wrong
make quickstart
make demo-report

What is included

Need	Use
Diagnostic report CLI	`ai-reviewer-diagnostics` / `ai-reviewer-report`
Main paired perturbation data	HF dataset → `data/content_pairs/*.jsonl`
Released score artifacts	HF dataset → `data/annotation_scores/*.jsonl`
Prompt templates	`prompts/`
API / local inference wrappers	`scripts/`
Paper-analysis scripts	`analysis/`
Reproduction notes	`docs/REPRODUCIBILITY.md`

Use the dataset

The primary dataset is before/after perturbation pairs:

from datasets import load_dataset
pairs = load_dataset("leejamesssss/ai-reviewer-diagnostic-data", split="train")
print(pairs[0].keys())  # id, source, aspect, content_before, content_after

Download the full HF repo when you need manifests or score artifacts:

hf download leejamesssss/ai-reviewer-diagnostic-data   --repo-type dataset   --local-dir ai-reviewer-diagnostic-data
python scripts/summarize_release_data.py --data-dir ai-reviewer-diagnostic-data/data

content_pairs/ is the canonical benchmark surface. annotation_scores/ is for reproducing or auditing the paper’s released scoring outputs. The duplicated perturbed-only view is intentionally excluded because content_pairs.content_after already contains it.

Evaluate your own review system

Export baseline and perturbed outputs as JSONL with shared id values and any score/decision fields:

{"id":"paper_001","overall_score":8,"soundness_score":4,"final_decision":"Accept as Poster"}

Then run:

ai-reviewer-diagnostics   --baseline outputs/my_system_baseline.jsonl   --perturbed outputs/my_system_soundness_perturbed.jsonl   --condition paper/soundness   --output-md reports/my_system_soundness_report.md   --output-json reports/my_system_soundness_report.json

Directory mode works for released score files:

ai-reviewer-diagnostics   --scores-dir ai-reviewer-diagnostic-data/data/annotation_scores   --output-md reports/released_scores_report.md

The report summarizes score deltas, decision-change rates, and top decision transitions. See docs/INTEGRATIONS.md for schemas and custom fields.

Development commands

uv sync                    # default runtime deps
uv sync --extra analysis   # analysis deps
uv sync --extra vllm       # optional local GPU inference deps
uv run make smoke-test     # API-free checks
make clean                 # remove generated outputs

Inference wrappers:

python scripts/run_openrouter.py --input examples/example.json --output outputs/openrouter.jsonl --model <model> --api-key-env OPENROUTER_API_KEY
python scripts/run_gemini.py --input examples/example.json --output outputs/gemini.jsonl --model gemini-2.0-flash
python scripts/run_vllm.py --input examples/example.json --output outputs/vllm.jsonl --model-path <hf-or-local-model>

Repository map

ai_reviewer_diagnostics/  # pip package and diagnostic report CLI
scripts/                  # inference, preprocessing, quickstart, data-summary CLIs
analysis/                 # scripts for released score artifacts
examples/                 # tiny fixtures for demos/tests
prompts/                  # prompt templates
data/README.md            # pointer to HF dataset
docs/                     # data, integrations, reproducibility
paper/README.md           # DOI / citation pointer

Citation

@inproceedings{li2025where,
  title     = {Where Do LLMs Go Wrong? Diagnosing Automated Peer Review via Aspect-Guided Multi-Level Perturbation},
  author    = {Li, Jiatao and Li, Yanheng and Hu, Xinyu and Gao, Mingqi and Wan, Xiaojun},
  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
  year      = {2025},
  publisher = {ACM},
  doi       = {10.1145/3746252.3761274},
  url       = {https://doi.org/10.1145/3746252.3761274}
}

Docs: GETTING_STARTED, DATA, REPRODUCIBILITY, INTEGRATIONS, CONTRIBUTING.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

May 12, 2026

This version

0.1.4

May 11, 2026

0.1.3

May 11, 2026

0.1.2

May 11, 2026

0.1.1

May 11, 2026

0.1.0

May 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_reviewer_diagnostics-0.1.4.tar.gz (12.1 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_reviewer_diagnostics-0.1.4-py3-none-any.whl (11.2 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file ai_reviewer_diagnostics-0.1.4.tar.gz.

File metadata

Download URL: ai_reviewer_diagnostics-0.1.4.tar.gz
Upload date: May 11, 2026
Size: 12.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ai_reviewer_diagnostics-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`19170f0dc76444e4d4e200b97a93e6cda09e2046afcae98d17ba8366dc3bbd80`
MD5	`ad101ea9fb5b817c9e65baca64aa92f7`
BLAKE2b-256	`4af6a108fbd53bda80ddb4edf87c6f4f06ef52de83ee84739fd77616144979a0`

See more details on using hashes here.

File details

Details for the file ai_reviewer_diagnostics-0.1.4-py3-none-any.whl.

File metadata

Download URL: ai_reviewer_diagnostics-0.1.4-py3-none-any.whl
Upload date: May 11, 2026
Size: 11.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for ai_reviewer_diagnostics-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63fd520485da9370443a6667e3375cc3205dadc457fba9cf5d78b8d027dd3a33`
MD5	`9d2819a72ec0cbe0ec57b191a90573dc`
BLAKE2b-256	`3a295163cda814d066b3587a35d25f8c71d323c21941af2065f3f77ae7e49cc1`

See more details on using hashes here.

ai-reviewer-diagnostics 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Where Do LLMs Go Wrong? Diagnosing Automated Peer Review

Start fast

What is included

Use the dataset

Evaluate your own review system

Development commands

Repository map

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes