Catch ghost citations — cross-check a bibliography's claimed author/year against CrossRef
Project description
ghostcite
Catch ghost citations — right DOI, wrong author.
ghostcite is a deterministic, no-LLM command-line tool that cross-checks a
bibliography's claimed author and year against CrossRef's canonical record for
each DOI. It catches the dominant ghost-citation failure mode — a reference whose
cited authorship doesn't match the paper the DOI actually points to — and flags
retracted or expression-of-concern works along the way.
The problem
LLM-assisted writing (and plain copy-paste drift) routinely produces references
that look right but attribute the cited DOI to the wrong authors or year. A
manuscript cites "Li et al. 2024," but DOI 10.3390/plants13060869 is actually
Chen et al. A reviewer catches it; an automated check catches it first.
Does the metadata you wrote for this citation match what CrossRef says the DOI actually is?
No model, no API key, no download — just CrossRef's REST API and a comparison.
Install
pip install ghostcite # into the current environment
pipx install ghostcite # isolated CLI install (recommended)
uv tool install ghostcite # if you use uv
Usage
ghostcite refs.bib # check a BibTeX file (or .md / DOI list)
ghostcite refs.bib --cross-check pubmed # corroborate against PubMed
ghostcite refs.bib --json # machine-readable output (for CI)
ghostcite refs.bib --fail-on author,year,retraction # tune the CI gate
cat refs.bib | ghostcite - # read from stdin
Input format is auto-detected (BibTeX, Markdown reference list, or bare DOI list);
override with --format {auto,bibtex,markdown,doi}.
Real example — refs.bib cites "Li (2024)" for a DOI CrossRef says is Chen:
$ ghostcite refs.bib
ghostcite: 1 entries, 1 with DOIs
✗ A L1 Li (2024) → DOI resolves to Chen (2024) — possibly wrong DOI [10.3390/plants13060869]
1 A
$ echo $?
1
All flags & the anatomy of a finding
✗ A L1 Li (2024) → DOI resolves to Chen (2024)… [10.3390/plants13060869]
│ │ │ │ │ │
│ │ │ │ │ └─ DOI that was checked
│ │ │ │ └─ what CrossRef actually records
│ │ │ └─ what you cited (claimed first author + year)
│ │ └─ source line in your bibliography
│ └─ tier: A author · B year · C cosmetic · R retraction · U unresolvable
└─ glyph: ✗ fails CI · ⚠ retraction · · informational
--cross-check pubmed— adds PubMed/NCBI as a second source of truth. When PubMed backs CrossRef a finding is annotated↳ corroborated by PubMed; when PubMed instead agrees with what you cited, it's flagged as a CrossRef↔PubMed conflict (the tier is kept so you don't silently trust either source). PubMed can also raise a finding CrossRef missed, or supply a record for a DOI absent from CrossRef. Optional--ncbi-email/--ncbi-api-key(orNCBI_EMAIL/NCBI_API_KEY) follow NCBI E-utilities etiquette and unlock a higher rate limit; neither is required.--max-rps <n>— cap outbound requests per second. ghostcite already self-throttles to CrossRef's advertised rate limit (read from the response headers);--max-rpslets you be more conservative (the stricter of the two wins).--color {auto,always,never}— colorize the tier glyphs.auto(default) colorizes only on a TTY.NO_COLORis honored and wins even overalways.--jsonoutput is never colorized.- stdin (
-) — pass-as the filename to read from stdin, e.g.cat refs.bib | ghostcite -orghostcite - --format doi < dois.txt. --dry-run— parse + classify + count only, no network.
See examples/ for ready-to-run sample inputs and captured output.
How it works
flowchart TD
A["Citation: claimed author + year (+ DOI)"] --> B{"Has DOI?"}
B -- yes --> C["GET CrossRef /works/{DOI}"]
B -- no --> D["CrossRef bibliographic search<br/>(low-confidence)"]
C --> E{"DOI resolves?"}
E -- no --> U["Tier U — unresolvable"]
E -- yes --> F["Compare claimed vs. canonical record"]
D --> F
F --> G{"First-author surname matches?"}
G -- no --> TA["Tier A — author mismatch"]
G -- yes --> H{"Year matches?"}
H -- no --> TB["Tier B — year mismatch"]
H -- yes --> OK["OK"]
C --> R{"Retracted / expression of concern?"}
R -- yes --> TR["Tier R — retraction (orthogonal)"]
F -. "--cross-check pubmed" .-> P["PubMed second opinion"]
No language model is involved at any step. ghostcite resolves each DOI at CrossRef
(and optionally PubMed), then does a pure, deterministic comparison of the claimed
first-author surname (Unicode-folded, punctuation-stripped) and year against the
canonical record, plus a retraction / expression-of-concern check. Only the HTTP
client touches the network, via CrossRef's polite pool (a descriptive User-Agent
with the project URL, never a personal email).
Severity tiers, input formats & exit codes
| Tier | Meaning | Fails CI? |
|---|---|---|
| A | author-mismatch — claimed first author isn't in CrossRef's authors | Yes |
| B | year-mismatch — author matches, claimed year differs | Yes |
| C | cosmetic — matches only after diacritic/initials fold (Bürger≈Burger) | No (info) |
| R | retraction / expression-of-concern per CrossRef | Yes (fires regardless of A/B/C) |
| U | unresolvable — DOI 404s, or no-DOI entry search was inconclusive | No (warn) |
| OK | first author + year match | — |
When the claimed title also diverges strongly from CrossRef's title, a Tier A finding is annotated "possibly wrong DOI entirely" to distinguish a wrong-author citation from a wrong-DOI one.
| Format | Detection | Yields claimed author/year? |
|---|---|---|
| BibTeX | @article{…} / @…{…} entries |
Yes (author, year, doi, title) |
| Markdown | bullet refs - **AuthorList (YYYY).** … 10.x … |
Yes |
| DOI list | newline-delimited bare DOIs / doi: / DOI URLs |
No — lookup + retraction sweep only |
| Exit code | Meaning |
|---|---|
0 |
clean — no findings at or above the fail threshold |
1 |
findings present at/above the threshold |
2 |
tool error (network down, unparseable input, …) |
--fail-on (default author,year,retraction) selects which tiers force exit 1;
--fail-on none runs as a passive reporter. Tiers C and U never force exit 1.
Use it in CI
A clean run is quiet and exits 0:
Drop in the composite GitHub Action:
- uses: musharna/ghostcite@v1
with:
paths: paper/refs.bib
fail-on: "author,year,retraction"
…or the pre-commit hook:
repos:
- repo: https://github.com/musharna/ghostcite
rev: v0.1.0
hooks:
- id: ghostcite
args: [paper/references.bib, --fail-on, "author,year,retraction"]
Either way, a finding at or above the --fail-on threshold returns a non-zero
exit, blocking the merge or commit before submission.
Scope & limitations
ghostcite checks metadata correctness (does the DOI's record match what you
wrote), not claim support (does the source actually say what your prose claims —
a separate, LLM-based concern). It does no auto-fixing and no citation-style
linting. CrossRef is the source of truth; --cross-check pubmed adds PubMed as an
optional second opinion.
- CrossRef stores particle surnames inconsistently (
van der BergvsBerg), so a correctly-cited prefixed surname can rarely produce a Tier A false positive. - No-DOI entries are resolved by best-effort bibliographic search and flagged low-confidence — treat those as hints, not verdicts.
- Some preprints, datasets, and protocols carry no author metadata in CrossRef and surface as Tier U rather than a mismatch.
Related work & FAQ
ghostcite's niche is deterministic, no-LLM, CLI-first checking focused on the byline-mismatch failure mode (right DOI, wrong author/year) plus retraction flagging — built to run unattended in CI.
| Tool | What it does | How ghostcite differs |
|---|---|---|
| RefChecker | LLM-powered web-search reference validator | ghostcite is no-LLM, deterministic, and CI-safe (no model, no API key) |
| claude-skill-citation-checker | A Claude Code skill for an LLM agent | ghostcite is a standalone CLI + Action — no agent or LLM host needed |
| BibTeX Verifier | In-browser BibTeX checker | ghostcite is scriptable from the CLI and also flags retractions |
| CERCA | Java / AGPL citation checker | ghostcite is Python / MIT / pip install-able |
| scite Reference Check | Commercial, PDF-oriented, retraction focus | ghostcite is free / open-source, BibTeX-native, and catches byline mismatch |
| doimgr | Formats and manages DOIs (doesn't validate) | ghostcite verifies byline and retraction status, not just formatting |
Does it call an LLM? No — a deterministic comparison of the metadata you wrote against CrossRef's (and optionally PubMed's) canonical record. No model, no prompt, no API key required.
Will it hit rate limits? It self-throttles to CrossRef's advertised rate limit
(read from the live response headers); use --max-rps to be more conservative.
Does it catch fabricated DOIs? Indirectly — a DOI that 404s at CrossRef surfaces as Tier U. The core check is byline-vs-DOI consistency, so it catches the common case of a real DOI attached to the wrong citation.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ghostcite-0.1.0.tar.gz.
File metadata
- Download URL: ghostcite-0.1.0.tar.gz
- Upload date:
- Size: 236.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
129aebcdc07058ccf230005f35923ac309951317d6dc4c82f9aeb9da3cc9f4b6
|
|
| MD5 |
03ad7aba865ea4fd73f251e08525b11b
|
|
| BLAKE2b-256 |
67d825a1ab31d508dbcee74de2bc575e2b59ab13f0f45da0cb2afbbcb742f543
|
Provenance
The following attestation bundles were made for ghostcite-0.1.0.tar.gz:
Publisher:
release.yml on musharna/ghostcite
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ghostcite-0.1.0.tar.gz -
Subject digest:
129aebcdc07058ccf230005f35923ac309951317d6dc4c82f9aeb9da3cc9f4b6 - Sigstore transparency entry: 1754945952
- Sigstore integration time:
-
Permalink:
musharna/ghostcite@fdade55ae42a92e3312521a4a92201f7f8b75467 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/musharna
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@fdade55ae42a92e3312521a4a92201f7f8b75467 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ghostcite-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ghostcite-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae7a0a6368780b3216c833e9bf6c00c0fd7883835176c7d13bf03b906569d479
|
|
| MD5 |
3e327c7be663be47a17368e681c6ea5f
|
|
| BLAKE2b-256 |
4a9234cf6e8db35deb4a560d34b698f483949e6a9f4cd84afd7cc732faa200ee
|
Provenance
The following attestation bundles were made for ghostcite-0.1.0-py3-none-any.whl:
Publisher:
release.yml on musharna/ghostcite
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ghostcite-0.1.0-py3-none-any.whl -
Subject digest:
ae7a0a6368780b3216c833e9bf6c00c0fd7883835176c7d13bf03b906569d479 - Sigstore transparency entry: 1754945981
- Sigstore integration time:
-
Permalink:
musharna/ghostcite@fdade55ae42a92e3312521a4a92201f7f8b75467 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/musharna
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@fdade55ae42a92e3312521a4a92201f7f8b75467 -
Trigger Event:
push
-
Statement type: