Skip to main content

Verify .bib file citations against academic databases (Semantic Scholar, DBLP, Google Scholar, Open Library)

Project description

HaRC - Hallucinated Reference Checker

PyPI version Python 3.10+ License: MIT

Verify BibTeX citations against academic databases. Catches fake, misspelled, or incorrect references in your .bib files before submission.

Features

Source Lookup Methods Entry Types
Semantic Scholar DOI, arXiv ID, title search Papers
DBLP Title search Papers
Google Scholar Title search Papers
Open Library ISBN, title search Books

Additional capabilities:

  • Fuzzy author matching - Handles name variations, initials, and spelling differences
  • URL verification - Checks reachability and title matching for web citations
  • Smart fallback - Tries multiple databases until a valid match is found

Installation

# Using uv (recommended)
uv add harcx

# Using pip
pip install harcx

Quick Start

# Basic usage
harcx references.bib

# Also verify URL citations
harcx references.bib --check-urls

# Quiet mode (errors only)
harcx references.bib -q

CLI Reference

harcx [OPTIONS] BIB_FILE

Options:
  -q, --quiet              Suppress progress output
  --threshold FLOAT        Author match threshold (0.0-1.0, default: 0.6)
  --api-key KEY            Semantic Scholar API key for higher rate limits
  --check-urls             Verify URL citations for reachability
  --title-threshold FLOAT  URL title match threshold (0.0-1.0, default: 0.6)
  -h, --help               Show help message

Example Output

Parsed 50 entries from references.bib
[1/50] Checking (article): smith2023
    Trying arXiv ID: 2301.12345
  Found (author match: 1.00)
[2/50] Checking (book): goodfellow2016deep
    Trying Open Library title search
  Found (author match: 0.75)
[3/50] Checking (article): suspicious2023
    Trying Semantic Scholar title search
    Trying DBLP title search
    Trying Google Scholar title search
  ISSUE: Not found in Semantic Scholar, DBLP, or Google Scholar

============================================================
Found 1 entries requiring attention:
============================================================

[suspicious2023]
  Title: This Paper Does Not Exist
  Bib Authors: Suspicious Author
  Year: 2023
  Issue: Not found in Semantic Scholar, DBLP, or Google Scholar

Python API

from reference_checker import check_citations, check_web_citations

# Check citations - returns entries that weren't verified
issues = check_citations("references.bib")

for result in issues:
    print(f"{result.entry.key}: {result.message}")

# Check URL citations
url_issues = check_web_citations("references.bib")

for result in url_issues:
    print(f"{result.entry.key}: {result.url} - {result.message}")

Function Signatures

def check_citations(
    bib_file: str,
    author_threshold: float = 0.6,  # Minimum author match score
    year_tolerance: int = 1,         # Allowed year difference (±)
    api_key: str | None = None,      # Semantic Scholar API key
    verbose: bool = False,           # Print progress
) -> list[CheckResult]

def check_web_citations(
    bib_file: str,
    title_threshold: float = 0.6,    # Minimum title match score
    verbose: bool = False,           # Print progress
) -> list[WebCheckResult]

How It Works

┌─────────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────┐
│  Parse .bib │ ──▶ │    Lookup    │ ──▶ │ Fuzzy Match │ ──▶ │  Report  │
│    file     │     │  (DOI/title) │     │   Authors   │     │  Issues  │
└─────────────┘     └──────────────┘     └─────────────┘     └──────────┘

Lookup Order (Papers):

  1. DOI lookup (Semantic Scholar)
  2. arXiv ID lookup (Semantic Scholar)
  3. Title search (Semantic Scholar → DBLP → Google Scholar)

Lookup Order (Books):

  1. ISBN lookup (Open Library)
  2. Title search (Open Library → Semantic Scholar → DBLP → Google Scholar)

A citation is verified when:

  • Found in at least one database
  • Author match score ≥ threshold (default: 60%)
  • Year matches within tolerance (default: ±1 year)

Rate Limits

  • Semantic Scholar: ~3 req/sec (faster with API key)
  • DBLP: ~1 req/sec
  • Google Scholar: ~0.5 req/sec (may block excessive requests)
  • Open Library: ~1 req/sec

Get a free Semantic Scholar API key at semanticscholar.org/product/api

Development

git clone https://github.com/gurusha01/HaRC.git
cd HaRC
uv sync --all-extras
uv run pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harcx-0.2.0.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harcx-0.2.0-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file harcx-0.2.0.tar.gz.

File metadata

  • Download URL: harcx-0.2.0.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for harcx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 373431700a68630cf3f56fba2354b3300f147ae64d3a3dee2faadc3026a2cb09
MD5 c561436ebb24b8744751d972dc2309e4
BLAKE2b-256 c1b8bede401915b0ac1d1ab84292dd57555b53941895d4439f3f1f8324cba0b7

See more details on using hashes here.

File details

Details for the file harcx-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: harcx-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for harcx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 329ea9e42f653f876b4388c881b7209e4df666ca34b5d928fa987e4998857e80
MD5 75888dc95d7a21600348f32bf113bb00
BLAKE2b-256 065b0bacc3d88b08027c6e10e5523e85d18bf6093aac58c013d4ccac9db41c4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page