Verify .bib file citations against academic databases (Semantic Scholar, DBLP, Open Library)
Project description
HaRC - Hallucinated Reference Checker
A Python library to verify BibTeX entries against Semantic Scholar. Identify citations in your .bib files that may be incorrect, misspelled, or don't exist in the academic literature.
Installation
uv sync
Quick Start
Python API
from reference_checker import check_citations
# Check a .bib file - returns entries that weren't found/verified
not_found = check_citations("references.bib")
for result in not_found:
print(f"{result.entry.key}: {result.message}")
Command Line
# Basic usage
uv run harc references.bib
# With verbose output
uv run harc references.bib --verbose
# Custom author match threshold
uv run harc references.bib --threshold 0.7
# With Semantic Scholar API key (for higher rate limits)
uv run harc references.bib --api-key YOUR_API_KEY
How It Works
- Parse - Reads your
.bibfile and extracts entries with normalized author names - Search - Queries Semantic Scholar for each paper by title
- Match - Compares authors using fuzzy matching to handle name variations
- Report - Returns entries that couldn't be verified
A paper is considered "found" when:
- Semantic Scholar returns a result for the title
- Author match score meets the threshold (default: 60%)
- Year matches within tolerance (default: ±1 year)
API Reference
check_citations()
def check_citations(
bib_file: str,
author_threshold: float = 0.6,
year_tolerance: int = 1,
api_key: str | None = None,
verbose: bool = False,
) -> list[CheckResult]:
Parameters:
bib_file- Path to the.bibfileauthor_threshold- Minimum author match score (0.0-1.0) to consider verifiedyear_tolerance- Allowed year difference between bib entry and found paperapi_key- Optional Semantic Scholar API key for higher rate limitsverbose- Print progress information
Returns: List of CheckResult objects for entries that were NOT found/verified.
CheckResult
@dataclass
class CheckResult:
entry: BibEntry # The original bib entry
found: bool # Whether the paper was verified
matched_paper: dict | None # Semantic Scholar data if found
author_match_score: float # 0.0-1.0 confidence score
message: str # Human-readable status
BibEntry
@dataclass
class BibEntry:
key: str # BibTeX key (e.g., "smith2023")
title: str # Paper title
authors: list[str] # Normalized author names
year: str | None # Publication year
raw_entry: dict # Original bibtexparser dict
Author Matching
The library handles common author name variations:
- Last, First format:
"Smith, John"→"john smith" - First Last format:
"John Smith"→"john smith" - Initials:
"J. Smith"matches"John Smith" - Protected names:
"{van der Berg}, Jan"→"jan van der berg"
Fuzzy matching accounts for minor spelling differences and missing middle names.
Rate Limiting
Semantic Scholar has rate limits for API access:
- Unauthenticated: Shared pool, may experience throttling
- Authenticated: Higher limits with API key
The library includes:
- Automatic delay between requests (3 seconds default)
- Exponential backoff on rate limit errors
- Up to 3 retries per request
For production use, request an API key from Semantic Scholar.
Development
Run Tests
uv sync --extra dev
uv run pytest tests/ -v
Project Structure
reference_checker/
├── __init__.py # Main API
├── models.py # Data classes
├── parser.py # BibTeX parsing
├── matcher.py # Author name matching
├── scholar.py # Semantic Scholar client
└── cli.py # Command-line interface
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file harcx-0.1.0.tar.gz.
File metadata
- Download URL: harcx-0.1.0.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9f682cd8be6674c2c9028856d35283857ee423249528f514df496244fe010b6
|
|
| MD5 |
24f5d6b7da9998c1c32b15232d26a932
|
|
| BLAKE2b-256 |
0e39e16cf011e89ef934fc50452ae36502d6dad9a8303513b18a370bf7fe5e2d
|
File details
Details for the file harcx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: harcx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15a9f830b32d567922242e14ee5f89a57f6a6bd869de7597b386d0594cf1fc97
|
|
| MD5 |
c3f7c873905427b81ddf00ee8e69c279
|
|
| BLAKE2b-256 |
32f0b714d606cfcf3be6353ddcb56599a901a0316dc3bd910621bd9f7da9d17e
|