Skip to main content

Deterministic bibliography verifier for .bib files and PDFs

Reason this release was yanked:

Incorrect README

Project description

citation-checker

A deterministic bibliography verifier for .bib files and PDFs. Checks whether cited works exist in authoritative academic databases and whether core metadata (title, authors, year) actually matches what you cited — no LLMs involved.

How It Works

Each bibliography entry is verified through a priority-ordered strategy chain:

  1. DOI → CrossRef — direct lookup by DOI; the most reliable path
  2. arXiv eprint → arXiv Atom XML API — for preprints with an eprint field
  3. Title + author → CrossRef bibliographic search — for entries without a DOI
  4. Title → OpenAlex — fallback when CrossRef search finds nothing
  5. Title + author → Semantic Scholar — additional fallback with strong ML/CS venue coverage (PMLR, NeurIPS, ICML, etc.)
  6. URL → web title extraction — for news and media sources (Bloomberg, NYT, Reuters, etc.)

All entries are checked concurrently with per-host rate limiting to stay within API guidelines.

Installation

Requires Python ≥ 3.11.

pip install -e .

Dependencies: bibtexparser, pymupdf, httpx, rapidfuzz, rich

Quick Start

# Check a .bib file
citation-checker refs.bib --mailto you@email.com

# Check a PDF's bibliography
citation-checker paper.pdf --mailto you@email.com

# Save a JSON report and show fuzzy match scores
citation-checker refs.bib --mailto you@email.com --output report.json --show-scores

# Show what the database actually found (DB title + authors columns)
citation-checker refs.bib --show-remote

# Only show problems
citation-checker refs.bib --filter-status MISMATCH NOT_FOUND ERROR

# Check specific cite keys
citation-checker refs.bib --filter-keys Vaswani17 LeCun89

Tip: Pass --mailto your@email.com to join the CrossRef polite pool and get higher rate limits.

CLI Reference

Flag Default Description
bib_file required Path to a .bib or .pdf file
--output, -o PATH Write a JSON report to this file
--mailto EMAIL Your email for the CrossRef polite pool
--openalex-key KEY OpenAlex API key for higher rate limits
--timeout SECS 10.0 Per-request HTTP timeout
--retries N 3 Max retries per request
--concurrency N 10 Max simultaneous entry checks
--no-check-urls off Disable supplementary URL reachability checks
--show-scores off Show title/author fuzzy scores in the table
--show-remote off Show the title and authors returned by the matched database
--filter-keys KEY... Only check these cite keys
--filter-status S... Only display entries with these statuses
--quiet off Print summary only; suppress table
--json-only off No terminal output; write JSON only (requires --output)
--verbose off Enable debug logging to stderr

Exit codes: 0 = all OK · 1 = MISMATCH or ERROR found · 2 = parse/file error · 3 = config error

Verification Statuses

Status Meaning
VERIFIED Found in an external database; title, authors, and year match
MISMATCH Found, but one or more core fields differ significantly
NOT_FOUND Not found in any database
GREY_LITERATURE Software, dataset, government report, or news article — not expected in academic DBs
UNVERIFIABLE Too little metadata (no title or authors) to search with
ERROR Network or API failure for this entry

Fuzzy Matching

Field comparison uses RapidFuzz:

  • Title: fuzz.ratio ≥ 85 after NFKD normalization and LaTeX artifact stripping
  • Authors: per-author best-match token_sort_ratio ≥ 80 — handles "Last, First" vs "First Last" ordering; abbreviated first names (e.g., "R. Smith" vs "Robert Smith") are tolerated as soft warnings
  • Year: exact integer match; a mismatch always forces MISMATCH regardless of other scores

Author scores between 55 and 80 produce a warning but do not by themselves trigger MISMATCH.

PDF Input

When given a .pdf file, citation-checker extracts the bibliography section using PyMuPDF and auto-detects the reference list format:

  • Numbered ([1] Author, A. Title. Venue, year.) — brackets or parenthesised numbers
  • Author–year (Surname, A. (year). Title. Venue.) — common in economics and some CS venues

Cite keys are derived from the first author's surname + year (e.g., Vaswani2017, LeCun1989). When two entries share the same base key a letter suffix is appended to the second and later occurrences (Chin2015, Chin2015a).

Extracted entries go through the same verification pipeline as .bib entries. No DOI or arXiv eprint is assumed unless one is found in the text.

Grey Literature

Entries on code/data hosting sites (GitHub, Zenodo, Hugging Face), government and national lab sites (nrel.gov, epa.gov, eia.gov, etc.), or corporate technical resources are automatically classified as GREY_LITERATURE and skipped in academic database searches — they are not expected to appear in CrossRef or Semantic Scholar.

Entries whose URL points to a supported news or media domain (Bloomberg, Financial Times, NYT, Reuters, WSJ, The Guardian, BBC, Wired, MIT Technology Review, and more) are verified by fetching the page and comparing the article title. A match counts as VERIFIED.

JSON Report

{
  "meta": {
    "tool_version": "1.0.0",
    "generated_at": "...",
    "bib_file": "refs.bib",
    "total_entries": 218,
    "elapsed_seconds": 61.4,
    "thresholds": { "title_score": 85.0, "author_score": 80.0 },
    "counts": { "VERIFIED": 190, "MISMATCH": 5, "NOT_FOUND": 8, ... }
  },
  "results": [
    {
      "cite_key": "Vaswani17",
      "entry_type": "article",
      "status": "VERIFIED",
      "strategy": "doi_crossref",
      "local":  { "title": "...", "authors": [...], "year": 2017, "doi": "...", "url": null, "eprint": null },
      "remote": { "title": "...", "authors": [...], "year": 2017, "source": "crossref" },
      "scores": { "title_score": 100.0, "author_score": 95.3, "year_match": true },
      "url_reachable": null,
      "error_message": null,
      "warnings": []
    }
  ]
}

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

Tests use respx to mock all HTTP calls — no real API requests are made during testing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citation_checker-1.0.0.tar.gz (88.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citation_checker-1.0.0-py3-none-any.whl (65.8 kB view details)

Uploaded Python 3

File details

Details for the file citation_checker-1.0.0.tar.gz.

File metadata

  • Download URL: citation_checker-1.0.0.tar.gz
  • Upload date:
  • Size: 88.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for citation_checker-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e9722a7df47a65880e385797e4815031c17f82cbb55a50fead6df9a6c83c0ea0
MD5 c185d8014f4b40d2ee91f40aa72b2159
BLAKE2b-256 74b115b0300556937086a77d1e4ac848958c494a6f7d1f41096e0a5dc34c4da1

See more details on using hashes here.

File details

Details for the file citation_checker-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for citation_checker-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8dca6b4d395ae4b6ec465311a9147b13382ffbec9017341c710f9821779d05ea
MD5 8e947a21ec8fdb5bd5dc7abe593b67ad
BLAKE2b-256 bfd05c3dfa84d1bb05b957aa67d2f8267120f1d7fd273ce3de156be904132efc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page