Skip to main content

Deterministic bibliography verifier for .bib files and PDFs

Project description

citation-checker

A deterministic bibliography verifier for .bib files and PDFs. Checks whether cited works exist in authoritative academic databases and whether core metadata (title, authors, year, venue) actually matches the given citation.

How It Works

Each bibliography entry is verified through a priority-ordered strategy chain:

  1. DOI → CrossRef — direct lookup by DOI; the most reliable path
  2. arXiv eprint → arXiv Atom XML API — for preprints with an eprint field
  3. Title + author → CrossRef bibliographic search — for entries without a DOI
  4. Title → OpenAlex — fallback when CrossRef search finds nothing
  5. Title + author → Semantic Scholar — additional fallback with strong ML/CS venue coverage (PMLR, NeurIPS, ICML, etc.)
  6. URL → web title extraction — for news and media sources (Bloomberg, NYT, Reuters, etc.)

All entries are checked concurrently with per-host rate limiting to stay within API guidelines.

Installation

Requires Python ≥ 3.11.

pip install citation-checker

Dependencies: bibtexparser, pymupdf, httpx, rapidfuzz, rich

Quick Start

# Check a .bib file
citation-checker refs.bib --mailto you@email.com

# Check a PDF's bibliography
citation-checker paper.pdf --mailto you@email.com

# Save a JSON report and show fuzzy match scores
citation-checker refs.bib --mailto you@email.com --output report.json --show-scores

# Show what the database actually found (DB title + authors columns)
citation-checker refs.bib --show-remote

# Only show problems
citation-checker refs.bib --filter-status MISMATCH NOT_FOUND ERROR

# Check specific cite keys
citation-checker refs.bib --filter-keys Vaswani17 LeCun89

Tip: Pass --mailto your@email.com to join the CrossRef polite pool and get higher rate limits.

CLI Reference

Flag Default Description
bib_file required Path to a .bib or .pdf file
--output, -o PATH Write a JSON report to this file
--mailto EMAIL Your email for the CrossRef polite pool
--openalex-key KEY OpenAlex API key for higher rate limits
--timeout SECS 10.0 Per-request HTTP timeout
--retries N 3 Max retries per request
--concurrency N 10 Max simultaneous entry checks
--no-check-urls off Disable supplementary URL reachability checks
--show-scores off Show title/author fuzzy scores in the table
--show-remote off Show the title and authors returned by the matched database
--filter-keys KEY... Only check these cite keys
--filter-status S... Only display entries with these statuses
--quiet off Print summary only; suppress table
--json-only off No terminal output; write JSON only (requires --output)
--verbose off Enable debug logging to stderr

Exit codes: 0 = all OK · 1 = MISMATCH or ERROR found · 2 = parse/file error · 3 = config error

Verification Statuses

Status Meaning
VERIFIED Found in an external database; title, authors, and year match
MISMATCH Found, but one or more core fields differ significantly
NOT_FOUND Not found in any database
GREY_LITERATURE Software, dataset, government report, or news article — not expected in academic DBs
UNVERIFIABLE Too little metadata (no title or authors) to search with
ERROR Network or API failure for this entry

Fuzzy Matching

Field comparison uses RapidFuzz:

  • Title: fuzz.ratio ≥ 85 after NFKD normalization and LaTeX artifact stripping
  • Authors: per-author best-match token_sort_ratio ≥ 80 — handles "Last, First" vs "First Last" ordering; abbreviated first names (e.g., "R. Smith" vs "Robert Smith") are tolerated as soft warnings
  • Year: exact integer match; a mismatch always forces MISMATCH regardless of other scores

Author scores between 55 and 80 produce a warning but do not by themselves trigger MISMATCH.

PDF Input

When given a .pdf file, citation-checker extracts the bibliography section using PyMuPDF and auto-detects the reference list format -- for instance:

  • Numbered ([1] Author, A. Title. Venue, year.) — brackets or parenthesised numbers
  • Author–year (Surname, A. (year). Title. Venue.) — common in economics and some CS venues

Cite keys are derived from the first author + year (e.g., Wallace1996, Fisher2009). When two entries share the same base key a letter suffix is appended to the second and later occurrences (Vaswani2017, Vaswani2017a).

Extracted entries go through the same verification pipeline as .bib entries. No DOI or arXiv eprint is assumed unless one is found in the text.

Grey Literature

Entries on code/data hosting sites (GitHub, Zenodo, Hugging Face), government and national lab sites (nlr.gov, epa.gov, eia.gov, etc.), or corporate technical resources are classified as GREY_LITERATURE and skipped in academic database searches — they are not expected to appear in CrossRef or Semantic Scholar.

Entries whose URL points to a supported news or media domain (Bloomberg, Financial Times, NYT, Reuters, WSJ, The Guardian, BBC, Wired, MIT Technology Review, and more) are verified by fetching the page and comparing the article title. A match counts as VERIFIED.

JSON Report

{
  "meta": {
    "tool_version": "1.0.0",
    "generated_at": "...",
    "bib_file": "refs.bib",
    "total_entries": 218,
    "elapsed_seconds": 61.4,
    "thresholds": { "title_score": 85.0, "author_score": 80.0 },
    "counts": { "VERIFIED": 190, "MISMATCH": 5, "NOT_FOUND": 8, ... }
  },
  "results": [
    {
      "cite_key": "Vaswani17",
      "entry_type": "article",
      "status": "VERIFIED",
      "strategy": "doi_crossref",
      "local":  { "title": "...", "authors": [...], "year": 2017, "doi": "...", "url": null, "eprint": null },
      "remote": { "title": "...", "authors": [...], "year": 2017, "source": "crossref" },
      "scores": { "title_score": 100.0, "author_score": 95.3, "year_match": true },
      "url_reachable": null,
      "error_message": null,
      "warnings": []
    }
  ]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citation_checker-1.0.1.tar.gz (105.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citation_checker-1.0.1-py3-none-any.whl (74.5 kB view details)

Uploaded Python 3

File details

Details for the file citation_checker-1.0.1.tar.gz.

File metadata

  • Download URL: citation_checker-1.0.1.tar.gz
  • Upload date:
  • Size: 105.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for citation_checker-1.0.1.tar.gz
Algorithm Hash digest
SHA256 e95f9653472d86d6a3d75a9a6f6ad334a6c5e2620c0ce541ea55cee618742cb8
MD5 88b9842c8ff658d55de7b43856da8502
BLAKE2b-256 603898d73e8d30d282fe934769b5665c94e6f5ca6a85ec728277847898b58f45

See more details on using hashes here.

File details

Details for the file citation_checker-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for citation_checker-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d05bd577d37e3d51645cb59aed892302d446ca0542344cc2393c42b5e88c717b
MD5 0ef6397d825e633568c35128b52b9aee
BLAKE2b-256 ad2697fcc4c6d4f86fb48509e644ce3acfdabbc79a9f018a663b5c594277fd46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page