Skip to main content

Check bibliography for hallucinations

Project description

hallubib

pypi

Check bibliography references for hallucinations. Parses .bib and .tex files, verifies each reference against online sources (OpenAlex, Semantic Scholar, Crossref, arXiv, DOI resolution), and categorizes them by confidence.

Installation

# With uv (recommended)
uv tool install hallubib

# Or with pip
pip install hallubib

For development:

git clone https://github.com/endremborza/hallubib
cd hallubib
uv sync

Usage

# Quick summary (default)
hallubib references.bib

# Detailed markdown report
hallubib paper.tex --output=md

# HTML report (opens in browser)
hallubib references.bib --output=html

# Clear the cache
hallubib --clear-cache

Output modes

Flag Description
--output=stdout (default) Summary counts per category
--output=md Detailed markdown breakdown to stdout
--output=html Styled HTML report, opened in default browser

How it works

1. Parse

The parser module handles two formats:

  • .bib files: Full BibTeX parsing with LaTeX accent normalization
  • .tex files: Extracts \bibitem entries from thebibliography environments using heuristic text parsing

2. Verify

Each reference is checked against online sources in this order:

  1. DOI validation: If a DOI is present, verify it resolves via doi.org
  2. OpenAlex lookup: Search by DOI (fast path) or by title (full-text search with ±1 year filter)
  3. arXiv search: For arXiv-linked papers or as fallback when OpenAlex yields nothing
  4. Crossref + Semantic Scholar fallback: If not yet verified/auto-correctable, search both for broader coverage (especially older papers without DOIs)
  5. Wider search: If still unknown, retry OpenAlex without year filter

URL-only references (GitHub repos, websites) are validated for reachability instead of bibliographic matching.

API calls run concurrently (thread pool) for speed.

3. Categorize

Each reference is assigned one of five statuses:

Status Meaning
Unknown No plausible match found online
Needs attention Partial match — ambiguous, may be wrong edition or different paper
Auto-correctable Match found but some fields differ (e.g., volume, year, journal name)
URL reference Not a traditional article — URL validated for reachability
Verified Match found; all fields consistent or only missing optional info (DOI, issue number)

Output is ordered most-problematic-first for easy triage.

Matching uses:

  • Title similarity (normalized, accent-stripped, fuzzy matching)
  • First-author last name matching
  • Year tolerance (±1 year for preprint/publication date differences)
  • Journal name fuzzy matching with 41K+ abbreviation database

Field differences are classified as:

  • Corrections: local value conflicts with online value
  • Supplements: local value missing, online has it (shown as (missing))

4. Output

  • stdout: Compact counts, one line per category
  • markdown: Grouped by status, with per-reference match details and field diffs
  • html: Color-coded cards with dark/light mode support, no external dependencies

Year discrepancies

When the local year differs from the online record by exactly 1 year, the tool notes this as a potential online-first vs. print publication difference. This is common: a paper may be published online in December 2019 but appear in the January 2020 print issue.

Known examples from test data:

  • VOSviewer (doi:10.1007/s11192-009-0146-3): DOI landing page shows 2010, OpenAlex records 2009
  • CiteSpace II (doi:10.1002/asi.20317): published 2006, OpenAlex records 2005
  • Gusenbauer (pubmed:31614060): published 2020, online-first 2019

These references are accepted as auto-correctable rather than flagged as errors, with the year discrepancy noted in the output.

Journal abbreviation database

The tool ships with a 41K+ journal abbreviation database (hallubib/data/journal_abbrevs.csv.gz) sourced from JabRef's open abbreviation lists. This enables fuzzy matching between abbreviated and full journal names.

To rebuild the database:

python scripts/build_journal_abbrevs.py

Caching

API responses are cached in ~/.cache/hallubib/ (respects $XDG_CACHE_HOME) with a 30-day TTL. This avoids redundant network requests across runs.

hallubib --clear-cache

Dependencies

Only one runtime dependency:

  • requests — HTTP client for API calls

Features

  • Parses both .bib (structured BibTeX) and .tex (\bibitem free-text) formats
  • Verifies against OpenAlex, Semantic Scholar, Crossref, and arXiv with DOI cross-validation
  • Crossref fallback and wider search for papers not found initially
  • URL-only reference detection with reachability validation (GitHub, websites)
  • GitHub repository and arXiv detection as extensible special cases (special.py)
  • Concurrent API lookups via thread pool
  • Disk caching with configurable TTL
  • Three output formats: terminal summary, markdown, styled HTML
  • HTML report with dark/light mode support
  • LaTeX accent/unicode normalization for author and title comparison
  • 41K+ journal abbreviation database from JabRef
  • Field diffs classified as corrections vs. supplements
  • Year discrepancy detection (online-first vs. print)

Known Limitations & TODOs

  • "et al." handling in verification: When a .bib entry uses and others, only the listed authors are compared. The matcher should weight first-author more heavily in these cases (partially implemented).
  • Minor misspellings in names: Author name comparison strips accents and compares last names, but does not do fuzzy/edit-distance matching on names. A Levenshtein threshold could catch Thomson vs Thompson.
  • Auto-apply corrections: Add a --fix flag that writes corrected entries back to the .bib file.
  • Rate limiting: API sources are polled concurrently with a thread pool cap of 6. For very large bibliographies (100+ entries), more sophisticated rate limiting or backoff may be needed.
  • \cite{} extraction from .tex: Currently only \bibitem entries in thebibliography environments are parsed. Support for \cite{key} + external .bib file resolution is not yet implemented.
  • BibTeX output mode: Generate a corrected .bib file with suggested fixes applied.
  • Hard-to-find papers: Some papers remain hard to find across all sources. In test data, mongell91 (Mongell & Roth, "Sorority rush as a two-sided matching mechanism", Am. Econ. Rev. 1991) could not be matched by any source.

Possible future sources

Additional APIs that could improve coverage further:

Source Notes
DBLP Free, no auth. CS-only (~6M entries). Useful if targeting CS bibliographies.
PubMed / NCBI E-utilities Free (3 RPS with API key). Biomedical only.
OpenCitations Free, fully open. Citation graph metadata, less useful for discovery by title.
Scopus Broad coverage (~90M records), but requires institutional API key.
Google Scholar Best coverage overall, but no API — scraping violates TOS.
JSTOR No free public lookup API. Data for Research (DfR) is bulk-download only; XML Gateway requires institutional license.

Running tests

uv run pytest                          # offline tests
uv run pytest -m network               # include network integration tests

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hallubib-0.1.0.tar.gz (752.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hallubib-0.1.0-py3-none-any.whl (740.3 kB view details)

Uploaded Python 3

File details

Details for the file hallubib-0.1.0.tar.gz.

File metadata

  • Download URL: hallubib-0.1.0.tar.gz
  • Upload date:
  • Size: 752.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hallubib-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5c94fdb37edb4c78ee9df2851c58bb5c9f17f9af8045841fb8cd1c64faa0ee99
MD5 b0583c3f13b37040b6576fdeccc05fb3
BLAKE2b-256 260b5581a48aef35434bba4547f8287611f33877f4a0f9a62577ff71459532ae

See more details on using hashes here.

File details

Details for the file hallubib-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: hallubib-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 740.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hallubib-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a592913d95dbe666f3285b453985b2db98137a1ef1d28f95f055e97bb934a26b
MD5 25494fb62ea0923a577ebcb6a4ada9c9
BLAKE2b-256 4a2c3dc09acdbfbdd4f6276b104a6bfc473db57016d10b0dee7f17ddc77cba48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page