Skip to main content

Replace preprint BibTeX entries with published versions and validate bibliography references

Project description

BibTeX Updater

Tools for managing BibTeX bibliographies: automatically update preprints to published versions, validate references against external databases, and filter to only cited references.

9-stage resolution pipeline

Installation

From PyPI (Recommended)

pip install bibtex-updater

# With Google Scholar support
pip install bibtex-updater[scholarly]

# With Zotero support
pip install bibtex-updater[zotero]

# All optional dependencies
pip install bibtex-updater[all]

From Source (Recommended)

git clone https://github.com/rpatrik96/bibtexupdater.git
cd bibtexupdater
uv sync --extra dev --extra all

Using uv (No Installation)

Run directly without cloning using uv:

# Run any command directly
uv run --with "bibtex-updater[all]" bibtex-update references.bib -o updated.bib

# Or use the provided wrapper script
./scripts/bibtex-x update references.bib -o updated.bib
./scripts/bibtex-x check references.bib
./scripts/bibtex-x filter paper.tex -b references.bib -o filtered.bib

CLI Commands

Command Description
bibtex-update Replace preprints with published versions
bibtex-check Validate references exist with correct metadata
bibtex-filter Filter to only cited entries
bibtex-zotero Update preprints in Zotero library
bibtex-zotero-organize Organize Zotero items into collections by research taxonomy
bibtex-obsidian-keywords AI-powered keyword generation for Obsidian paper notes

Quick Start

Update Preprints

# Update preprints to published versions
bibtex-update references.bib -o updated.bib

# Preview changes (dry run)
bibtex-update references.bib --dry-run --verbose

Validate References (Fact-Check)

# Check if references exist and have correct metadata
bibtex-check references.bib --report report.json

# Strict mode: exit with error if hallucinated/not-found entries
bibtex-check references.bib --strict

Filter Bibliography

# Filter to only cited entries
bibtex-filter paper.tex -b references.bib -o filtered.bib

# Multiple tex files
bibtex-filter *.tex -b references.bib -o filtered.bib

Update Zotero Library

# Set credentials (get from zotero.org/settings/keys)
export ZOTERO_LIBRARY_ID="your_user_id"
export ZOTERO_API_KEY="your_api_key"

# Preview changes
bibtex-zotero --dry-run

# Apply updates
bibtex-zotero

Sync BibTeX Updates to Zotero

When updating a .bib file, you can simultaneously update matching entries in your Zotero library:

# Set Zotero credentials
export ZOTERO_LIBRARY_ID="your_user_id"
export ZOTERO_API_KEY="your_api_key"

# Update bib file AND sync to Zotero
bibtex-update references.bib -o updated.bib --zotero

# Preview Zotero changes only (bib changes still apply)
bibtex-update references.bib -o updated.bib --zotero --zotero-dry-run

# Limit to a specific Zotero collection
bibtex-update references.bib -o updated.bib --zotero --zotero-collection ABCD1234

The sync matches bib entries to Zotero items by:

  1. arXiv ID - Most reliable for preprints
  2. DOI - For preprints with DOIs (e.g., bioRxiv)
  3. Title + Author - Fuzzy matching as fallback

Standalone Scripts

For environments without pip (e.g., Overleaf), filter_bibliography.py can be used directly as it has no dependencies:

# Copy the script and run directly
python filter_bibliography.py paper.tex -b references.bib -o filtered.bib

Documentation

Document Description
docs/BIBTEX_UPDATER.md Full BibTeX updater documentation
docs/REFERENCE_FACT_CHECKER.md Full reference fact-checker documentation
docs/ZOTERO_UPDATER.md Full Zotero updater documentation
docs/FILTER_BIBLIOGRAPHY.md Full filter documentation
docs/LANDSCAPE.md Databases, competing tools, and ecosystem landscape
examples/ Example workflows and configuration files

Overleaf Integration

Both tools integrate with Overleaf via GitHub Actions or latexmkrc.

GitHub Actions (Recommended)

  1. Enable GitHub sync in Overleaf (Menu -> Sync -> GitHub)
  2. Copy a workflow from examples/workflows/ to .github/workflows/
  3. Changes synced from Overleaf automatically trigger updates

latexmkrc (Direct Overleaf)

For filter_bibliography.py only (no dependencies required):

  1. Upload filter_bibliography.py to your Overleaf project
  2. Create .latexmkrc based on examples/latexmkrc
  3. Recompile - filtered bibliography appears in your file list

Features

BibTeX Updater (bibtex-update)

Preprint to published

  • Multi-source resolution: arXiv, OpenAlex, Europe PMC, Crossref, DBLP, ACL Anthology, Semantic Scholar, Google Scholar
  • High accuracy: Title and author fuzzy matching with confidence thresholds
  • ACL Anthology support: Zero-overhead resolution for NLP papers (ACL, EMNLP, NAACL, etc.)
  • Batch processing: Multiple files with concurrent workers (default: 8)
  • Deduplication: Merge duplicates by DOI or normalized title+authors
  • Smart caching: On-disk cache + semantic resolution cache with TTL
  • Per-service rate limiting: Optimized rate limits per API (Crossref, S2, DBLP, ACL Anthology, arXiv, OpenAlex, Europe PMC)
  • Batch API support: Faster bulk lookups via arXiv/S2/Crossref batch endpoints
  • Resolution tracking: --mark-resolved tags updated entries to skip on re-runs

Zotero Updater (bibtex-zotero)

Zotero integration

  • Direct Zotero integration: Fetches and updates items via Zotero API
  • Same resolution pipeline: Uses the same multi-source resolution
  • Preserves metadata: Keeps notes, tags, and attachments intact
  • Idempotent: Already-published papers are automatically skipped
  • Dry-run mode: Preview changes before applying
  • Tag-based chunking: Track processing state with preprint-upgraded/preprint-checked/preprint-error tags

Zotero Organizer (bibtex-zotero-organize)

  • AI-powered taxonomy: Organize items into hierarchical collections automatically
  • Multiple backends: Claude, OpenAI, or local embeddings for classification
  • Caching: Classification results cached to reduce API calls
  • Batch processing: Configurable limits and dry-run mode

Obsidian Keywords (bibtex-obsidian-keywords)

AI auto-keywording

  • AI-powered keywords: Generate [[wikilinks]] for Obsidian paper notes
  • Multiple backends: Claude, OpenAI, or local embeddings
  • Smart skipping: --min-keywords to skip notes that already have enough keywords
  • Topics file: Provide existing topics for consistent tagging across notes
  • Dry-run mode: Preview changes before modifying files

Reference Fact-Checker (bibtex-check)

Reference fact-checker

  • Multi-source validation: Crossref, DBLP, Semantic Scholar, OpenAlex
  • Detailed mismatch detection: Title, author, year, venue comparisons
  • Hallucination detection: Identifies likely fabricated references
  • Structured reports: JSON and JSONL output formats
  • CI/CD integration: Strict mode with exit codes for automation

Cascading verification

Inspired by Abbonato 2026 (CheckIfExist), verification orders sources CrossRef → OpenAlex → DBLP → Semantic Scholar and short-circuits as soon as one source produces a high-confidence match. Combined with top-K candidate retrieval and cross-source author intersection, it catches swapped_authors / chimeric citations that single-source verification misses.

The order is throughput-aware: CrossRef and OpenAlex (polite pool, ~100 req/min) come first, so the slow keyless Semantic Scholar fallback (~10 req/min) is only reached on hard entries. Set a Semantic Scholar API key (--s2-api-key or S2_API_KEY) to lift S2 from ~10 to ~60 req/min.

# Verification with top-3 candidates per source
bibtex-check references.bib --top-k 3 --jsonl out.jsonl

# Polite OpenAlex pool (recommended)
bibtex-check references.bib --openalex-mailto you@example.com

A 0–100 numeric confidence_score (additive in the JSONL output) summarizes per-field similarity with explicit penalty/bonus contributions:

  • Multi-source bonus: +10 when ≥2 sources confirm the same authors
  • Penalties: title-mismatch -20, author-mismatch -20, journal-mismatch -15, fabricated-author -10 each (capped at -20)
  • Asymmetric formula for the high-title-low-author chimeric case: confidence = S_title − 0.5 × (100 − S_author)

Verdicts: verified vs. could-not-verify vs. problematic

VERIFIED requires every claimed field to be positively confirmed against the matched record — not merely "not contradicted". When a record is found but a claimed field can't be confirmed (e.g. a published venue backed only by a preprint, or an incomplete author list), the entry is reported as could-not-verify (UNCONFIRMED/NOT_FOUND), distinct from a problematic flag (*_mismatch, doi_mismatch, chimeric, …) which is positive evidence of a defect. A "could-not-verify" is not a clean pass: it means the tool couldn't decide, and such entries warrant review.

Author handling

All four sources return authors in as-published order, so an author-order difference is treated as a real citation error (e.g. a transposed or wrong lead author) and flagged — it is not an API artifact. Surname comparison uses each source's structured family field where available, so family-first/CJK names like "Chen Xing" ↔ "Xing Chen" are not falsely flagged; when the matched source lacks structured names (Semantic Scholar flat names, DBLP), a Crossref structured-name lookup is used to vet a potential author mismatch before reporting it.

Non-generative-AI mode (--non-generative)

For venue-policy compliance (ACL ARR, ICML 2026) the --non-generative flag (or BIBTEX_CHECK_NON_GENERATIVE=1 env var) refuses to load any LLM backend at runtime. Today the package has no LLM backends, so this is a forward-compat guard plus a startup banner:

bibtex-check references.bib --non-generative --strict
# bibtex-check running in non-generative mode (no LLM calls).
# Compliant with ICML 2026 / ACL ARR LLM-in-review policies.

Filter Bibliography (bibtex-filter)

  • Zero dependencies: Uses only Python standard library
  • Works on Overleaf: No pip install needed
  • Multiple bib files: Merge and filter from multiple sources
  • Citation detection: Supports natbib, biblatex, and standard LaTeX citations

Python API

from bibtex_updater import Detector, Resolver, Updater, HttpClient, RateLimiter, DiskCache

# Create HTTP client with rate limiting and caching
rate_limiter = RateLimiter(req_per_min=30)
cache = DiskCache(".cache.json")
http_client = HttpClient(
    timeout=30.0,
    user_agent="bibtex-updater/0.5.0",
    rate_limiter=rate_limiter,
    cache=cache
)

# Detect preprints
detector = Detector()
detection = detector.detect(entry)

if detection.is_preprint:
    # Resolve to published version
    resolver = Resolver(http_client)
    candidate = resolver.resolve(detection)

    if candidate and candidate.confidence >= 0.9:
        # Update the entry
        updater = Updater()
        updated_entry = updater.update_entry(entry, candidate.record, detection)

Development

# Clone and install in development mode
git clone https://github.com/rpatrik96/bibtexupdater.git
cd bibtexupdater
uv sync --extra dev --extra all

# Run tests
uv run pytest tests/ -v

# Run tests with coverage
uv run pytest tests/ -v --cov=bibtex_updater --cov-report=term-missing

# Code quality
pre-commit run --all-files

# Build package
uv build

# Check package
uv run twine check dist/*

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bibtex_updater-1.0.0.tar.gz (305.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bibtex_updater-1.0.0-py3-none-any.whl (184.2 kB view details)

Uploaded Python 3

File details

Details for the file bibtex_updater-1.0.0.tar.gz.

File metadata

  • Download URL: bibtex_updater-1.0.0.tar.gz
  • Upload date:
  • Size: 305.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bibtex_updater-1.0.0.tar.gz
Algorithm Hash digest
SHA256 68be5cdcdaeeb97968bf11054fe69889ba9d4b74ec519e10a16ebc31a4cb5d72
MD5 20d8601db8a80e541065ad1f46989fa6
BLAKE2b-256 03fea24d9b2c7bca65b7e581d25d254afc2fc32ca80b39f04e9595070ea1e0fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for bibtex_updater-1.0.0.tar.gz:

Publisher: publish.yml on rpatrik96/bibtexupdater

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bibtex_updater-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: bibtex_updater-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 184.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bibtex_updater-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a9e37911c3608c27faebaf727de4dfebc79ed46a169547eae93ae7d91884498c
MD5 3ba000f62dee291dcb71377f1c0717b9
BLAKE2b-256 1864aba39ca6cadea9e6d3c2f0eac115ccb479faee794110f0170ae8330be4f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for bibtex_updater-1.0.0-py3-none-any.whl:

Publisher: publish.yml on rpatrik96/bibtexupdater

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page