A comprehensive tool for validating reference accuracy in academic papers
Project description
RefChecker
Validate reference accuracy in academic papers. Useful for authors checking bibliographies and reviewers ensuring citations are authentic. RefChecker verifies citations against Semantic Scholar, OpenAlex, and CrossRef.
Built by Mark Russinovich with AI assistants (Cursor, GitHub Copilot, Claude Code). Watch the deep dive video.
Contents
- Quick Start
- Features
- Sample Output
- Install
- Run
- Output
- Configure
- Docker
- Local Database
- Testing
- License
Quick Start
Web UI (Docker)
docker run -p 8000:8000 ghcr.io/markrussinovich/refchecker:latest
Open http://localhost:8000 in your browser.
Web UI (pip)
pip install academic-refchecker[llm,webui]
refchecker-webui
CLI (pip)
pip install academic-refchecker[llm]
academic-refchecker --paper 1706.03762
academic-refchecker --paper /path/to/paper.pdf
Performance: Set
SEMANTIC_SCHOLAR_API_KEYfor 1-2s per reference vs 5-10s without.
Features
- Multiple formats: ArXiv papers, PDFs, LaTeX, text files
- LLM-powered extraction: OpenAI, Anthropic, Google, Azure, vLLM
- Multi-source verification: Semantic Scholar, OpenAlex, CrossRef
- Comprehensive checks: Titles, authors, years, venues, DOIs, ArXiv IDs
- Smart matching: Handles formatting variations (BERT vs B-ERT, pre-trained vs pretrained)
- Detailed reports: Errors, warnings, corrected references
Sample Output
Web UI
CLI
📄 Processing: Attention Is All You Need
URL: https://arxiv.org/abs/1706.03762
[1/45] Neural machine translation in linear time
Nal Kalchbrenner et al. | 2017
⚠️ Warning: Year mismatch: cited '2017', actual '2016'
[2/45] Effective approaches to attention-based neural machine translation
Minh-Thang Luong et al. | 2015
❌ Error: First author mismatch: cited 'Minh-Thang Luong', actual 'Thang Luong'
[3/45] Deep Residual Learning for Image Recognition
Kaiming He et al. | 2016 | https://doi.org/10.1109/CVPR.2016.91
❌ Error: DOI mismatch: cited '10.1109/CVPR.2016.91', actual '10.1109/CVPR.2016.90'
============================================================
📋 SUMMARY
📚 Total references processed: 68
❌ Total errors: 55 ⚠️ Total warnings: 16 ❓ Unverified: 15
Install
PyPI (Recommended)
pip install academic-refchecker[llm,webui] # Web UI + CLI + LLM providers
pip install academic-refchecker # CLI only
From Source (Development)
git clone https://github.com/markrussinovich/refchecker.git && cd refchecker
python -m venv .venv && source .venv/bin/activate
pip install -e ".[llm,webui]"
Requirements: Python 3.7+ (3.10+ recommended). Node.js 18+ is only needed for Web UI development.
Run
Web UI
The Web UI shows live progress, history, and export (including corrected values).
refchecker-webui --port 8000
Development (frontend)
cd web-ui
npm install
npm start
Open http://localhost:5173.
Alternative (separate servers):
# Terminal 1
python -m uvicorn backend.main:app --reload --port 8000
# Terminal 2
cd web-ui
npm run dev
Verify the backend is running:
curl http://localhost:8000/
Web UI documentation: see web-ui/README.md.
CLI
# ArXiv (ID or URL)
academic-refchecker --paper 1706.03762
academic-refchecker --paper https://arxiv.org/abs/1706.03762
# Local files
academic-refchecker --paper paper.pdf
academic-refchecker --paper paper.tex
academic-refchecker --paper paper.txt
academic-refchecker --paper refs.bib
# Faster/offline verification (local DB)
academic-refchecker --paper paper.pdf --db-path semantic_scholar_db/semantic_scholar.db
# Save results
academic-refchecker --paper 1706.03762 --output-file errors.txt
Output
RefChecker reports these result types:
| Type | Description | Examples |
|---|---|---|
| ❌ Error | Critical issues needing correction | Author/title/DOI mismatches, incorrect ArXiv IDs |
| ⚠️ Warning | Minor issues to review | Year differences, venue variations |
| ℹ️ Suggestion | Recommended improvements | Add missing ArXiv/DOI URLs, small metadata fixes |
| ❓ Unverified | Could not verify against any source | Rare publications, preprints |
Verified references include discovered URLs (Semantic Scholar, ArXiv, DOI). Suggestions are non-blocking improvements.
Detailed examples
❌ Error: First author mismatch: cited 'T. Xie', actual 'Zhao Xu'
❌ Error: DOI mismatch: cited '10.5555/3295222.3295349', actual '10.48550/arXiv.1706.03762'
⚠️ Warning: Year mismatch: cited '2024', actual '2023'
ℹ️ Suggestion: Add ArXiv URL https://arxiv.org/abs/1706.03762
❓ Could not verify: Llama guard (M. A. Research, 2024)
Configure
LLM
LLM-powered extraction improves accuracy with complex bibliographies. Claude Sonnet 4 performs best; GPT-4o may hallucinate DOIs.
| Provider | Env Variable | Example Model |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-20250514 |
| OpenAI | OPENAI_API_KEY |
gpt-4o |
GOOGLE_API_KEY |
gemini-2.5-flash |
|
| Azure | AZURE_OPENAI_API_KEY |
gpt-4 |
| vLLM | (local) | meta-llama/Llama-3.1-8B-Instruct |
export ANTHROPIC_API_KEY=your_key
academic-refchecker --paper 1706.03762 --llm-provider anthropic
academic-refchecker --paper paper.pdf --llm-provider openai --llm-model gpt-4o
academic-refchecker --paper paper.pdf --llm-provider vllm --llm-model meta-llama/Llama-3.1-8B-Instruct
Local models (vLLM)
There is no separate “GPU Docker image”. For local inference, install the vLLM extra and run an OpenAI-compatible vLLM server:
pip install "academic-refchecker[vllm]"
python scripts/start_vllm_server.py --model meta-llama/Llama-3.1-8B-Instruct --port 8001
academic-refchecker --paper paper.pdf --llm-provider vllm --llm-endpoint http://localhost:8001/v1
Command Line
--paper PAPER # ArXiv ID, URL, or file path
--llm-provider PROVIDER # openai, anthropic, google, azure, vllm
--llm-model MODEL # Override default model
--db-path PATH # Local database for offline verification
--output-file [PATH] # Save results (default: reference_errors.txt)
--debug # Verbose output
Environment Variables
# LLM
export REFCHECKER_LLM_PROVIDER=anthropic
export ANTHROPIC_API_KEY=your_key # Also: OPENAI_API_KEY, GOOGLE_API_KEY
# Performance
export SEMANTIC_SCHOLAR_API_KEY=your_key # Higher rate limits / faster verification
Docker
Pre-built images are published to GitHub Container Registry.
docker run -p 8000:8000 \
-e ANTHROPIC_API_KEY=your_key \
-v refchecker-data:/app/data \
ghcr.io/markrussinovich/refchecker:latest
Docker Compose:
git clone https://github.com/markrussinovich/refchecker.git && cd refchecker
cp .env.example .env # Add your API keys
docker compose up -d
| Tag | Description | Arch | Size |
|---|---|---|---|
latest |
RefChecker (Web UI + API-based LLM support) | amd64, arm64 | ~800MB |
Local Database
For offline verification or faster processing:
python scripts/download_db.py \
--field "computer science" \
--start-year 2020 --end-year 2024
academic-refchecker --paper paper.pdf --db-path semantic_scholar_db/semantic_scholar.db
Testing
490+ tests covering unit, integration, and end-to-end scenarios.
pytest tests/ # All tests
pytest tests/unit/ # Unit only
pytest --cov=src tests/ # With coverage
See tests/README.md for details.
License
MIT License - see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file academic_refchecker-2.0.19.tar.gz.
File metadata
- Download URL: academic_refchecker-2.0.19.tar.gz
- Upload date:
- Size: 628.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dbc79ef5b741ebd424d58ef76a8cebc67d4146e29758bd272f470d21fb041b8
|
|
| MD5 |
16879dd48a550e00f065b32fe94d1e60
|
|
| BLAKE2b-256 |
067513bf1d086fbce49ed2236436c96782e302c334f592f0cdf02bea2c0e4c4e
|
File details
Details for the file academic_refchecker-2.0.19-py3-none-any.whl.
File metadata
- Download URL: academic_refchecker-2.0.19-py3-none-any.whl
- Upload date:
- Size: 654.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82572914d828598ca4a08966039789ac019932150bf48cd10e0ac85c9d65b6c2
|
|
| MD5 |
bd5c36c3cebb6b994815629c0f42db47
|
|
| BLAKE2b-256 |
ade213f84dc408b08d320c2b8a6ad7ca9b343ea21df45f33366e7eee1487e160
|