Skip to main content

Verify .bib file citations against academic databases (Semantic Scholar, DBLP, Open Library)

Project description

HaRC - Hallucinated Reference Checker

A Python library to verify BibTeX entries against Semantic Scholar. Identify citations in your .bib files that may be incorrect, misspelled, or don't exist in the academic literature.

Installation

uv sync

Quick Start

Python API

from reference_checker import check_citations

# Check a .bib file - returns entries that weren't found/verified
not_found = check_citations("references.bib")

for result in not_found:
    print(f"{result.entry.key}: {result.message}")

Command Line

# Basic usage
uv run harc references.bib

# With verbose output
uv run harc references.bib --verbose

# Custom author match threshold
uv run harc references.bib --threshold 0.7

# With Semantic Scholar API key (for higher rate limits)
uv run harc references.bib --api-key YOUR_API_KEY

How It Works

  1. Parse - Reads your .bib file and extracts entries with normalized author names
  2. Search - Queries Semantic Scholar for each paper by title
  3. Match - Compares authors using fuzzy matching to handle name variations
  4. Report - Returns entries that couldn't be verified

A paper is considered "found" when:

  • Semantic Scholar returns a result for the title
  • Author match score meets the threshold (default: 60%)
  • Year matches within tolerance (default: ±1 year)

API Reference

check_citations()

def check_citations(
    bib_file: str,
    author_threshold: float = 0.6,
    year_tolerance: int = 1,
    api_key: str | None = None,
    verbose: bool = False,
) -> list[CheckResult]:

Parameters:

  • bib_file - Path to the .bib file
  • author_threshold - Minimum author match score (0.0-1.0) to consider verified
  • year_tolerance - Allowed year difference between bib entry and found paper
  • api_key - Optional Semantic Scholar API key for higher rate limits
  • verbose - Print progress information

Returns: List of CheckResult objects for entries that were NOT found/verified.

CheckResult

@dataclass
class CheckResult:
    entry: BibEntry           # The original bib entry
    found: bool               # Whether the paper was verified
    matched_paper: dict | None  # Semantic Scholar data if found
    author_match_score: float # 0.0-1.0 confidence score
    message: str              # Human-readable status

BibEntry

@dataclass
class BibEntry:
    key: str              # BibTeX key (e.g., "smith2023")
    title: str            # Paper title
    authors: list[str]    # Normalized author names
    year: str | None      # Publication year
    raw_entry: dict       # Original bibtexparser dict

Author Matching

The library handles common author name variations:

  • Last, First format: "Smith, John""john smith"
  • First Last format: "John Smith""john smith"
  • Initials: "J. Smith" matches "John Smith"
  • Protected names: "{van der Berg}, Jan""jan van der berg"

Fuzzy matching accounts for minor spelling differences and missing middle names.

Rate Limiting

Semantic Scholar has rate limits for API access:

  • Unauthenticated: Shared pool, may experience throttling
  • Authenticated: Higher limits with API key

The library includes:

  • Automatic delay between requests (3 seconds default)
  • Exponential backoff on rate limit errors
  • Up to 3 retries per request

For production use, request an API key from Semantic Scholar.

Development

Run Tests

uv sync --extra dev
uv run pytest tests/ -v

Project Structure

reference_checker/
├── __init__.py      # Main API
├── models.py        # Data classes
├── parser.py        # BibTeX parsing
├── matcher.py       # Author name matching
├── scholar.py       # Semantic Scholar client
└── cli.py           # Command-line interface

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harcx-0.1.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harcx-0.1.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file harcx-0.1.0.tar.gz.

File metadata

  • Download URL: harcx-0.1.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for harcx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b9f682cd8be6674c2c9028856d35283857ee423249528f514df496244fe010b6
MD5 24f5d6b7da9998c1c32b15232d26a932
BLAKE2b-256 0e39e16cf011e89ef934fc50452ae36502d6dad9a8303513b18a370bf7fe5e2d

See more details on using hashes here.

File details

Details for the file harcx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: harcx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for harcx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 15a9f830b32d567922242e14ee5f89a57f6a6bd869de7597b386d0594cf1fc97
MD5 c3f7c873905427b81ddf00ee8e69c279
BLAKE2b-256 32f0b714d606cfcf3be6353ddcb56599a901a0316dc3bd910621bd9f7da9d17e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page