Check DOIs/PMIDs against the Retraction Watch dataset (SQLite-backed, REST API + CLI)
Project description
RWCheck
RWCheck is a CLI and REST API for Fast Retraction Screening of DOIs, PubMed IDs, and BibTeX References
Check DOIs and PubMed IDs. .bib files against the Retraction Watch dataset.
rwcheck ingests the Retraction Watch Data into a local SQLite database for O(log n) indexed lookups, exposes a FastAPI REST API, and provides a CLI for interactive and batch queries — all with no external database services required.
Features
- SQLite-backed — fast indexed lookup; no Postgres/Redis required.
- Python API (
rwcheck) — import and call directly from Python; no server needed. - REST API (
rw_api/) — OpenAPI docs, rate limiting, 5-min cache, daily auto-update. - CLI (
rwcheck) — single DOI/PMID, batch from file, offline or API mode. - Automatic updates — API rebuilds the DB every 24 h; CLI
updatecommand pulls and hashes the latest CSV. - Reproducible — every response includes dataset version (SHA-256), row count, and build timestamp.
Quickstart
1. Install
# Clone and install in editable mode (Python 3.10+)
git clone https://github.com/khan-lab/rwcheck.git
cd rwcheck
pip install -e ".[dev]"
2. Build the local database
From the local CSV (if you already downloaded retraction_watch.csv):
make build-db
# or explicitly:
python scripts/build_db.py --csv retraction_watch.csv --db data/rw.sqlite
Download the latest CSV from GitLab and build:
make build-db-online
# or:
python scripts/build_db.py --url
The build takes ~20 s on a modern laptop for ~69 k rows.
3. Check a DOI
rwcheck doi 10.1038/nature12345
rwcheck doi "https://doi.org/10.1038/nature12345" # URL prefix is stripped
4. Check a PubMed ID
rwcheck pmid 12345678
5. Batch check from a file
Plain text (one DOI per line):
rwcheck batch-doi papers.txt
rwcheck batch-doi papers.txt --out tsv > results.tsv
rwcheck batch-doi papers.txt --out json | jq '.results[] | select(.matched)'
CSV file (specify column with --col):
rwcheck batch-doi references.csv --col doi
6. Check a BibTeX file
rwcheck batch-bib refs.bib
This parses every entry in the .bib file, extracts DOIs (from the doi field, or a url field containing doi.org), and PubMed IDs (from pmid, or eprint+eprinttype=pubmed), then queries them all against the local database.
Two report files are written next to the input file:
| File | Contents |
|---|---|
refs_rwcheck.md |
Human-readable Markdown: summary table, retracted entries with details, clean list |
refs_rwcheck.json |
Machine-readable JSON: full match data, suitable for further processing |
refs_rwcheck.html |
Self-contained HTML report: styled, browser-viewable, collapsible retracted entries |
# Write reports to a specific directory
rwcheck batch-bib refs.bib --report-dir ./reports/
# Use the remote API instead of the local DB
rwcheck batch-bib refs.bib --api http://localhost:8000
Example output (stdout):
Total references 42
Retracted 3
Clean (not found) 37
Unchecked (no DOI/PMID) 2
⚠ Retracted entries:
✗ [smith2020] Smith et al. 2020 — Retraction | Nature
Reports written:
Markdown → refs_rwcheck.md
JSON → refs_rwcheck.json
7. Update the database
rwcheck update # downloads latest CSV; skips if unchanged
rwcheck update --force # force rebuild regardless
REST API
Start the server
make api
# → http://127.0.0.1:8000
# Docs: http://127.0.0.1:8000/docs
The server automatically downloads the latest Retraction Watch CSV on startup and every 24 hours thereafter.
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/meta |
Dataset metadata (version, row count, build time) |
GET |
/stats |
Aggregate statistics (totals, by year, top journals, by country) |
GET |
/check/doi/{doi} |
Look up a DOI (slashes in DOIs are supported) |
GET |
/check/pmid/{pmid} |
Look up a PubMed ID |
POST |
/check/batch |
Batch lookup (up to 500 items) |
POST |
/check/bib |
Upload a .bib file; returns retracted/clean summary |
GET |
/health |
Health check |
GET |
/docs |
Swagger UI |
Examples
# Dataset info
curl http://localhost:8000/meta
# DOI lookup
curl "http://localhost:8000/check/doi/10.1038/nature12345"
# PubMed ID lookup
curl "http://localhost:8000/check/pmid/12345678"
# Batch
curl -X POST http://localhost:8000/check/batch \
-H "Content-Type: application/json" \
-d '{"dois": ["10.1038/nature12345", "10.9999/test"], "pmids": [12345678]}'
Response format
{
"query": "10.1038/nature12345",
"matched": true,
"matches": [
{
"record_id": 42,
"title": "Example retracted paper",
"journal": "Nature",
"retraction_nature": "Retraction",
"reason": "Falsification/Fabrication of Data;",
"retraction_date": "2022-03-15",
"original_paper_doi": "10.1038/nature12345",
"retraction_doi": "10.1038/nature12345retract",
"original_paper_pmid": 12345678
}
],
"meta": {
"dataset_version": "a1b2c3d4e5f6a7b8",
"built_at": "2024-11-01T12:00:00+00:00",
"row_count": "68999",
"source_url": "https://gitlab.com/crossref/retraction-watch-data/-/raw/main/retraction_watch.csv"
}
}
Python API
Use rwcheck directly from Python without starting the HTTP server.
from rwcheck import check_doi, check_pmid, check_batch
# Single DOI lookup — returns dict
result = check_doi("10.1038/nature12345", db_path="data/rw.sqlite")
if result["matched"]:
m = result["matches"][0]
print(m["retraction_nature"], m["retraction_date"])
# Single PMID lookup — returns dict
result = check_pmid(12345678, db_path="data/rw.sqlite")
# Batch lookup — returns JSON string
import json
raw = check_batch(
dois=["10.1038/nature12345", "10.9999/test"],
pmids=[12345678],
db_path="data/rw.sqlite",
)
data = json.loads(raw)
retracted = [r for r in data["results"] if r["matched"]]
If the RW_DB_PATH environment variable is set, db_path can be omitted:
import os, rwcheck
os.environ["RW_DB_PATH"] = "data/rw.sqlite"
result = rwcheck.check_doi("10.1038/nature12345")
Return shapes
| Function | Returns | Keys |
|---|---|---|
check_doi(doi) |
dict |
query, matched, matches, meta |
check_pmid(pmid) |
dict |
query, matched, matches, meta |
check_batch(dois, pmids) |
str (JSON) |
results (list), meta |
Each item in matches / results[].matches contains: record_id, title, journal, retraction_nature, retraction_date, reason, original_paper_doi, retraction_doi, original_paper_pmid, country, paywalled, and more.
Docker
# Build image
make docker-build
# Run (mounts ./data for persistent SQLite DB)
make docker-run
Or directly:
docker build -t rwcheck .
docker run -p 8000:8000 -v "$(pwd)/data:/app/data" rwcheck
CLI Reference
Usage: rwcheck [OPTIONS] COMMAND [ARGS]...
Check DOIs/PMIDs against the Retraction Watch dataset.
Commands:
doi Check a single DOI.
pmid Check a single PubMed ID.
batch-doi Batch-check DOIs from a text or CSV file.
batch-pmid Batch-check PMIDs from a text or CSV file.
batch-bib Check all references in a BibTeX file; write JSON + Markdown report.
update Download the latest dataset and rebuild the local DB.
Options:
--version Show version and exit.
--help Show this message and exit.
Common options
| Option | Description |
|---|---|
--db PATH |
Path to local SQLite DB (default: data/rw.sqlite) |
--api URL |
Use remote API instead of local DB |
--json |
Output raw JSON (single-item commands) |
--out json|tsv|table |
Output format for batch commands |
--col NAME |
CSV column name for batch commands |
--report-dir DIR |
Directory for batch-bib report files |
--force |
Force DB rebuild even if unchanged |
Environment variables (API + Python API)
| Variable | Default | Description |
|---|---|---|
RW_DB_PATH |
data/rw.sqlite |
SQLite database path (used by API server and Python API) |
RW_CSV_URL |
GitLab raw URL | Retraction Watch CSV source |
RATE_LIMIT |
60/minute |
slowapi rate limit per IP |
UPDATE_INTERVAL_HOURS |
24 |
Hours between auto-updates |
Development
make install # pip install -e ".[dev]"
make test # pytest
make lint # ruff + mypy
make fmt # ruff format + fix
make test-cov # pytest with coverage report
Data source
The Retraction Watch dataset is maintained by the Center for Scientific Integrity and distributed via CrossRef on GitLab. Please review their terms of use before deploying publicly.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rwcheck-1.0.0.tar.gz.
File metadata
- Download URL: rwcheck-1.0.0.tar.gz
- Upload date:
- Size: 325.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a46ffe4d8240b54f32f6580c3b2beca3371dc6562831d0858aa4c770e5bd4fa
|
|
| MD5 |
5c68187b91f4fb679e77a18423629f75
|
|
| BLAKE2b-256 |
c080f6f0fd9240be399bcd985b0fa705fcfefdfe0f074801954f78928de4bba7
|
File details
Details for the file rwcheck-1.0.0-py3-none-any.whl.
File metadata
- Download URL: rwcheck-1.0.0-py3-none-any.whl
- Upload date:
- Size: 46.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f72329e212757d577ab592ac79c6e5a8aefade4ae6a9f7de0ec644cb52eb6a81
|
|
| MD5 |
a15438ee5ca22fab891a01c94517977e
|
|
| BLAKE2b-256 |
f3cca77053fd9e3a4b9a040c592e9cc37b6b2a17e099c2e2a4ea354542cc7216
|