RWCheck: A command-line interface for fast Retraction Watch checks
Project description
RWCheck — Fast retraction screening for DOIs, PubMed IDs, and BibTeX references
Check DOIs, PubMed IDs, and .bib files against the Retraction Watch dataset.
rwcheck ingests the Retraction Watch data into a local SQLite database for O(log n) lookups, exposes a FastAPI REST API, and provides a CLI for interactive and batch queries — no external database required.
Live API: https://rwcheck.khanlab.bio
Features
- REST API — OpenAPI docs, rate limiting, 5-min cache, daily auto-update.
- CLI — single DOI/PMID lookup, batch from file, BibTeX screening.
- SQLite-backed — fast indexed lookup; no Postgres or Redis required.
- Python API — import and call directly; no server needed.
- Auto-updates — API rebuilds the DB every 24 h; CLI
updatecommand pulls and verifies the latest CSV. - Reproducible — every response includes dataset version (SHA-256), row count, and build timestamp.
Quickstart
1. Install
git clone https://github.com/khan-lab/rwcheck.git
cd rwcheck
pip install -e ".[dev]" # Python 3.10+
2. Build the local database
make build-db-online # download latest CSV from GitLab and build (~20 s, ~69 k rows)
Or from a CSV you already have:
make build-db # uses retraction_watch.csv in the current directory
3. Check a DOI
rwcheck doi 10.1038/nature12345
rwcheck doi "https://doi.org/10.1038/nature12345" # URL prefix is stripped
4. Check a PubMed ID
rwcheck pmid 12345678
5. Batch check from a file
# Plain text (one DOI per line)
rwcheck batch-doi papers.txt
rwcheck batch-doi papers.txt --out tsv > results.tsv
rwcheck batch-doi papers.txt --out json | jq '.results[] | select(.matched)'
# CSV file (specify column with --col)
rwcheck batch-doi references.csv --col doi
6. Check a BibTeX file
rwcheck batch-bib refs.bib
Extracts DOIs and PubMed IDs from every entry and queries them against the local database. Three report files are written next to the input:
| File | Contents |
|---|---|
refs_rwcheck.md |
Human-readable Markdown: summary table + retracted entries |
refs_rwcheck.json |
Machine-readable JSON: full match data |
refs_rwcheck.html |
Self-contained HTML report: styled, collapsible retracted entries |
# Write reports to a specific directory
rwcheck batch-bib refs.bib --report-dir ./reports/
# Use the live API instead of the local DB
rwcheck batch-bib refs.bib --api https://rwcheck.khanlab.bio
7. Update the database
rwcheck update # download latest CSV; skip if unchanged
rwcheck update --force # force rebuild regardless
REST API
A public instance is running at https://rwcheck.khanlab.bio. Interactive docs (Swagger UI) are available at https://rwcheck.khanlab.bio/docs.
Run locally
make api
# → http://127.0.0.1:8000
# Docs: http://127.0.0.1:8000/docs
The server downloads the latest Retraction Watch CSV on startup and every 24 hours thereafter.
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/ |
Landing page with live DOI checker and dataset stats |
GET |
/meta |
Dataset metadata (version, row count, build time) |
GET |
/stats |
Aggregate statistics (totals, by year, top journals, by country) |
GET |
/check/doi/{doi} |
Look up a DOI (slashes in DOIs are supported) |
GET |
/check/pmid/{pmid} |
Look up a PubMed ID |
POST |
/check/batch |
Batch lookup (up to 500 items) |
POST |
/check/bib |
Upload a .bib file; returns retracted/clean summary |
GET |
/health |
Health check |
GET |
/docs |
Swagger UI |
Examples
# Dataset metadata
curl https://rwcheck.khanlab.bio/meta
# DOI lookup
curl "https://rwcheck.khanlab.bio/check/doi/10.1038/nature12345"
# PubMed ID lookup
curl "https://rwcheck.khanlab.bio/check/pmid/12345678"
# Batch lookup
curl -X POST https://rwcheck.khanlab.bio/check/batch \
-H "Content-Type: application/json" \
-d '{"dois": ["10.1038/nature12345", "10.9999/test"], "pmids": [12345678]}'
Response format
{
"query": "10.1038/nature12345",
"matched": true,
"matches": [
{
"record_id": 42,
"title": "Example retracted paper",
"journal": "Nature",
"retraction_nature": "Retraction",
"reason": "Falsification/Fabrication of Data;",
"retraction_date": "2022-03-15",
"original_paper_doi": "10.1038/nature12345",
"retraction_doi": "10.1038/nature12345retract",
"original_paper_pmid": 12345678
}
],
"meta": {
"dataset_version": "a1b2c3d4e5f6a7b8",
"built_at": "2024-11-01T12:00:00+00:00",
"row_count": "68999",
"source_url": "https://gitlab.com/crossref/retraction-watch-data/-/raw/main/retraction_watch.csv"
}
}
Python API
Use rwcheck directly from Python without starting the HTTP server.
from rwcheck import check_doi, check_pmid, check_batch
# Single DOI lookup
result = check_doi("10.1038/nature12345", db_path="data/rw.sqlite")
if result["matched"]:
m = result["matches"][0]
print(m["retraction_nature"], m["retraction_date"])
# Single PMID lookup
result = check_pmid(12345678, db_path="data/rw.sqlite")
# Batch lookup
import json
raw = check_batch(
dois=["10.1038/nature12345", "10.9999/test"],
pmids=[12345678],
db_path="data/rw.sqlite",
)
retracted = [r for r in json.loads(raw)["results"] if r["matched"]]
Set RW_DB_PATH to omit db_path in every call:
import os, rwcheck
os.environ["RW_DB_PATH"] = "data/rw.sqlite"
result = rwcheck.check_doi("10.1038/nature12345")
Return shapes
| Function | Returns | Keys |
|---|---|---|
check_doi(doi) |
dict |
query, matched, matches, meta |
check_pmid(pmid) |
dict |
query, matched, matches, meta |
check_batch(dois, pmids) |
str (JSON) |
results (list), meta |
Each item in matches contains: record_id, title, journal, retraction_nature, retraction_date, reason, original_paper_doi, retraction_doi, original_paper_pmid, country, paywalled, and more.
Docker
Docker images are published to the GitHub Container Registry:
| Image | Description |
|---|---|
ghcr.io/khan-lab/rwcheck |
CLI tool |
ghcr.io/khan-lab/rwcheck-api |
REST API |
CLI image
# Pull and run
docker run --rm -v "$(pwd)/data:/app/data" ghcr.io/khan-lab/rwcheck doi 10.1038/nature12345
# Build locally
make docker-build
docker run --rm -v "$(pwd)/data:/app/data" rwcheck:latest doi 10.1038/nature12345
API image
# Pull and run
docker run --rm -p 8000:8000 -v "$(pwd)/data:/app/data" ghcr.io/khan-lab/rwcheck-api
# Build locally
make docker-build-api
make docker-run # equivalent: docker run -p 8000:8000 -v ./data:/app/data rwcheck-api:latest
Production deployment (Docker Compose + Caddy)
See DEPLOY.md for full EC2 setup instructions with Caddy reverse proxy, automatic HTTPS, and persistent volumes.
CLI Reference
Usage: rwcheck [OPTIONS] COMMAND [ARGS]...
Check DOIs/PMIDs against the Retraction Watch dataset.
Commands:
doi Check a single DOI.
pmid Check a single PubMed ID.
batch-doi Batch-check DOIs from a text or CSV file.
batch-pmid Batch-check PMIDs from a text or CSV file.
batch-bib Check all references in a BibTeX file; write JSON/Markdown/HTML reports.
update Download the latest dataset and rebuild the local DB.
Options:
--version Show version and exit.
--help Show this message and exit.
Common options
| Option | Description |
|---|---|
--db PATH |
Path to local SQLite DB (default: data/rw.sqlite) |
--api URL |
Use remote API instead of local DB |
--json |
Output raw JSON (single-item commands) |
--out json|tsv|table |
Output format for batch commands |
--col NAME |
CSV column name for batch commands |
--report-dir DIR |
Directory for batch-bib report files |
--force |
Force DB rebuild even if unchanged |
Environment variables
| Variable | Default | Description |
|---|---|---|
RW_DB_PATH |
data/rw.sqlite |
SQLite database path |
RW_CSV_URL |
GitLab raw URL | Retraction Watch CSV source |
RATE_LIMIT |
60/minute |
API rate limit per IP (slowapi) |
UPDATE_INTERVAL_HOURS |
24 |
Hours between automatic DB updates |
PUBLIC_HOST |
http://localhost:8000 |
Base URL shown in API responses and landing page |
Development
make install # pip install -e ".[dev]"
make test # pytest
make lint # ruff + mypy
make fmt # ruff format + fix
make test-cov # pytest with coverage report
Data source
The Retraction Watch dataset is maintained by the Center for Scientific Integrity and distributed via CrossRef on GitLab. Please review their terms of use before deploying publicly.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rwcheck-1.1.0.tar.gz.
File metadata
- Download URL: rwcheck-1.1.0.tar.gz
- Upload date:
- Size: 328.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
620bf4fa92343e37324d74bc71c9f5ffb1616dc44a2f93bc72fc02c9670ba735
|
|
| MD5 |
91b55f73d861a13e8fb49151f56a5a5f
|
|
| BLAKE2b-256 |
d520f4c006c9087b139311c0b9b1fe4fa64eb0a2a644f7a34d02676dba803ed5
|
File details
Details for the file rwcheck-1.1.0-py3-none-any.whl.
File metadata
- Download URL: rwcheck-1.1.0-py3-none-any.whl
- Upload date:
- Size: 48.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
960384bb20fc537d09067703925c5e4c548d19066c9424c6482e443a92987ae8
|
|
| MD5 |
f9537b8e5e477ceddd23aac59a705192
|
|
| BLAKE2b-256 |
08f1427b61921b1116ffcf4b9d7497e1d42198bd2687d7c48d7ff4ddd97925a5
|