Check bibliography for hallucinations
Project description
hallubib
Check bibliography references for hallucinations. Parses .bib and .tex files, verifies each reference against online sources (OpenAlex, Semantic Scholar, Crossref, arXiv, DOI resolution), and categorizes them by confidence.
Installation
# With uv (recommended)
uv tool install hallubib
# Or with pip
pip install hallubib
For development:
git clone https://github.com/endremborza/hallubib
cd hallubib
uv sync
Usage
# Quick summary (default)
hallubib references.bib
# Detailed markdown report
hallubib paper.tex --output=md
# HTML report (opens in browser)
hallubib references.bib --output=html
# Clear the cache
hallubib --clear-cache
Output modes
| Flag | Description |
|---|---|
--output=stdout |
(default) Summary counts per category |
--output=md |
Detailed markdown breakdown to stdout |
--output=html |
Styled HTML report, opened in default browser |
How it works
1. Parse
The parser module handles two formats:
.bibfiles: Full BibTeX parsing with LaTeX accent normalization.texfiles: Extracts\bibitementries fromthebibliographyenvironments using heuristic text parsing
2. Verify
Each reference is checked against online sources in this order:
- DOI validation: If a DOI is present, verify it resolves via
doi.org - OpenAlex lookup: Search by DOI (fast path) or by title (full-text search with ±1 year filter)
- arXiv search: For arXiv-linked papers or as fallback when OpenAlex yields nothing
- Crossref + Semantic Scholar fallback: If not yet verified/auto-correctable, search both for broader coverage (especially older papers without DOIs)
- Wider search: If still unknown, retry OpenAlex without year filter
URL-only references (GitHub repos, websites) are validated for reachability instead of bibliographic matching.
API calls run concurrently (thread pool) for speed.
3. Categorize
Each reference is assigned one of five statuses:
| Status | Meaning |
|---|---|
| Unknown | No plausible match found online |
| Needs attention | Partial match — ambiguous, may be wrong edition or different paper |
| Auto-correctable | Match found but some fields differ (e.g., volume, year, journal name) |
| URL reference | Not a traditional article — URL validated for reachability |
| Verified | Match found; all fields consistent or only missing optional info (DOI, issue number) |
Output is ordered most-problematic-first for easy triage.
Matching uses:
- Title similarity (normalized, accent-stripped, fuzzy matching)
- First-author last name matching
- Year tolerance (±1 year for preprint/publication date differences)
- Journal name fuzzy matching with 41K+ abbreviation database
Field differences are classified as:
- Corrections: local value conflicts with online value
- Supplements: local value missing, online has it (shown as (missing))
4. Output
- stdout: Compact counts, one line per category
- markdown: Grouped by status, with per-reference match details and field diffs
- html: Color-coded cards with dark/light mode support, no external dependencies
Year discrepancies
When the local year differs from the online record by exactly 1 year, the tool notes this as a potential online-first vs. print publication difference. This is common: a paper may be published online in December 2019 but appear in the January 2020 print issue.
Known examples from test data:
- VOSviewer (doi:10.1007/s11192-009-0146-3): DOI landing page shows 2010, OpenAlex records 2009
- CiteSpace II (doi:10.1002/asi.20317): published 2006, OpenAlex records 2005
- Gusenbauer (pubmed:31614060): published 2020, online-first 2019
These references are accepted as auto-correctable rather than flagged as errors, with the year discrepancy noted in the output.
Journal abbreviation database
The tool ships with a 41K+ journal abbreviation database (hallubib/data/journal_abbrevs.csv.gz) sourced from JabRef's open abbreviation lists. This enables fuzzy matching between abbreviated and full journal names.
To rebuild the database:
python scripts/build_journal_abbrevs.py
Caching
API responses are cached in ~/.cache/hallubib/ (respects $XDG_CACHE_HOME) with a 30-day TTL. This avoids redundant network requests across runs.
hallubib --clear-cache
Dependencies
Only one runtime dependency:
requests— HTTP client for API calls
Features
- Parses both
.bib(structured BibTeX) and.tex(\bibitemfree-text) formats - Verifies against OpenAlex, Semantic Scholar, Crossref, and arXiv with DOI cross-validation
- Crossref fallback and wider search for papers not found initially
- URL-only reference detection with reachability validation (GitHub, websites)
- GitHub repository and arXiv detection as extensible special cases (
special.py) - Concurrent API lookups via thread pool
- Disk caching with configurable TTL
- Three output formats: terminal summary, markdown, styled HTML
- HTML report with dark/light mode support
- LaTeX accent/unicode normalization for author and title comparison
- 41K+ journal abbreviation database from JabRef
- Field diffs classified as corrections vs. supplements
- Year discrepancy detection (online-first vs. print)
Known Limitations & TODOs
- "et al." handling in verification: When a
.bibentry usesand others, only the listed authors are compared. The matcher should weight first-author more heavily in these cases (partially implemented). - Minor misspellings in names: Author name comparison strips accents and compares last names, but does not do fuzzy/edit-distance matching on names. A Levenshtein threshold could catch
ThomsonvsThompson. - Auto-apply corrections: Add a
--fixflag that writes corrected entries back to the.bibfile. - Rate limiting: API sources are polled concurrently with a thread pool cap of 6. For very large bibliographies (100+ entries), more sophisticated rate limiting or backoff may be needed.
-
\cite{}extraction from.tex: Currently only\bibitementries inthebibliographyenvironments are parsed. Support for\cite{key}+ external.bibfile resolution is not yet implemented. - BibTeX output mode: Generate a corrected
.bibfile with suggested fixes applied. - Hard-to-find papers: Some papers remain hard to find across all sources. In test data,
mongell91(Mongell & Roth, "Sorority rush as a two-sided matching mechanism", Am. Econ. Rev. 1991) could not be matched by any source.
Possible future sources
Additional APIs that could improve coverage further:
| Source | Notes |
|---|---|
| DBLP | Free, no auth. CS-only (~6M entries). Useful if targeting CS bibliographies. |
| PubMed / NCBI E-utilities | Free (3 RPS with API key). Biomedical only. |
| OpenCitations | Free, fully open. Citation graph metadata, less useful for discovery by title. |
| Scopus | Broad coverage (~90M records), but requires institutional API key. |
| Google Scholar | Best coverage overall, but no API — scraping violates TOS. |
| JSTOR | No free public lookup API. Data for Research (DfR) is bulk-download only; XML Gateway requires institutional license. |
Running tests
uv run pytest # offline tests
uv run pytest -m network # include network integration tests
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hallubib-0.1.0.tar.gz.
File metadata
- Download URL: hallubib-0.1.0.tar.gz
- Upload date:
- Size: 752.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c94fdb37edb4c78ee9df2851c58bb5c9f17f9af8045841fb8cd1c64faa0ee99
|
|
| MD5 |
b0583c3f13b37040b6576fdeccc05fb3
|
|
| BLAKE2b-256 |
260b5581a48aef35434bba4547f8287611f33877f4a0f9a62577ff71459532ae
|
File details
Details for the file hallubib-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hallubib-0.1.0-py3-none-any.whl
- Upload date:
- Size: 740.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a592913d95dbe666f3285b453985b2db98137a1ef1d28f95f055e97bb934a26b
|
|
| MD5 |
25494fb62ea0923a577ebcb6a4ada9c9
|
|
| BLAKE2b-256 |
4a2c3dc09acdbfbdd4f6276b104a6bfc473db57016d10b0dee7f17ddc77cba48
|