Replace preprint BibTeX entries with published versions and validate bibliography references

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

BibTeX Updater

Tools for managing BibTeX bibliographies: automatically update preprints to published versions, validate references against external databases, and filter to only cited references.

10-stage resolution pipeline

Installation

From PyPI (Recommended)

pip install bibtex-updater

# With Google Scholar support
pip install bibtex-updater[scholarly]

# With Zotero support
pip install bibtex-updater[zotero]

# All optional dependencies
pip install bibtex-updater[all]

From Source (Recommended)

git clone https://github.com/rpatrik96/bibtexupdater.git
cd bibtexupdater
uv sync --extra dev --extra all

Using uv (No Installation)

Run directly without cloning using uv:

# Run any command directly
uv run --with "bibtex-updater[all]" bibtex-update references.bib -o updated.bib

# Or use the provided wrapper script
./scripts/bibtex-x update references.bib -o updated.bib
./scripts/bibtex-x check references.bib
./scripts/bibtex-x filter paper.tex -b references.bib -o filtered.bib

CLI Commands

Command	Description
`bibtex-update`	Replace preprints with published versions
`bibtex-check`	Validate references exist with correct metadata
`bibtex-filter`	Filter to only cited entries
`bibtex-zotero`	Update preprints in Zotero library
`bibtex-zotero-organize`	Organize Zotero items into collections by research taxonomy
`bibtex-obsidian-keywords`	AI-powered keyword generation for Obsidian paper notes

Quick Start

Update Preprints

# Update preprints to published versions
bibtex-update references.bib -o updated.bib

# Preview changes (dry run)
bibtex-update references.bib --dry-run --verbose

Validate References (Fact-Check)

# Check if references exist and have correct metadata
bibtex-check references.bib --report report.json

# Strict mode: exit with error if hallucinated/not-found entries
bibtex-check references.bib --strict

Filter Bibliography

# Filter to only cited entries
bibtex-filter paper.tex -b references.bib -o filtered.bib

# Multiple tex files
bibtex-filter *.tex -b references.bib -o filtered.bib

Update Zotero Library

# Set credentials (get from zotero.org/settings/keys)
export ZOTERO_LIBRARY_ID="your_user_id"
export ZOTERO_API_KEY="your_api_key"

# Preview changes
bibtex-zotero --dry-run

# Apply updates
bibtex-zotero

Sync BibTeX Updates to Zotero

When updating a .bib file, you can simultaneously update matching entries in your Zotero library:

# Set Zotero credentials
export ZOTERO_LIBRARY_ID="your_user_id"
export ZOTERO_API_KEY="your_api_key"

# Update bib file AND sync to Zotero
bibtex-update references.bib -o updated.bib --zotero

# Preview Zotero changes only (bib changes still apply)
bibtex-update references.bib -o updated.bib --zotero --zotero-dry-run

# Limit to a specific Zotero collection
bibtex-update references.bib -o updated.bib --zotero --zotero-collection ABCD1234

The sync matches bib entries to Zotero items by:

arXiv ID - Most reliable for preprints
DOI - For preprints with DOIs (e.g., bioRxiv)
Title + Author - Fuzzy matching as fallback

Standalone Scripts

For environments without pip (e.g., Overleaf), filter_bibliography.py can be used directly as it has no dependencies:

# Copy the script and run directly
python filter_bibliography.py paper.tex -b references.bib -o filtered.bib

Documentation

Document	Description
docs/BIBTEX_UPDATER.md	Full BibTeX updater documentation
docs/REFERENCE_FACT_CHECKER.md	Full reference fact-checker documentation
docs/ZOTERO_UPDATER.md	Full Zotero updater documentation
docs/FILTER_BIBLIOGRAPHY.md	Full filter documentation
docs/LANDSCAPE.md	Databases, competing tools, and ecosystem landscape
benchmarks/HALLMARK.md	`bibtex-check` v1.2.0 detection metrics on HALLMARK v1.1.1 (all splits) + reproduction
examples/	Example workflows and configuration files

Overleaf Integration

Both tools integrate with Overleaf via GitHub Actions or latexmkrc.

GitHub Actions (Recommended)

Enable GitHub sync in Overleaf (Menu -> Sync -> GitHub)
Copy a workflow from examples/workflows/ to .github/workflows/
Changes synced from Overleaf automatically trigger updates

latexmkrc (Direct Overleaf)

For filter_bibliography.py only (no dependencies required):

Upload filter_bibliography.py to your Overleaf project
Create .latexmkrc based on examples/latexmkrc
Recompile - filtered bibliography appears in your file list

Features

BibTeX Updater (`bibtex-update`)

Preprint to published

Multi-source resolution: arXiv, OpenAlex, Europe PMC, Crossref, DBLP, ACL Anthology, OpenReview, Semantic Scholar, Google Scholar
High accuracy: Title and author fuzzy matching with confidence thresholds
ACL Anthology support: Zero-overhead resolution for NLP papers (ACL, EMNLP, NAACL, etc.)
Batch processing: Multiple files with concurrent workers (default: 8)
Deduplication: Merge duplicates by DOI or normalized title+authors
Smart caching: On-disk cache + semantic resolution cache with TTL
Per-service rate limiting: Optimized rate limits per API (Crossref, S2, DBLP, ACL Anthology, arXiv, OpenAlex, Europe PMC, OpenReview)
Batch API support: Faster bulk lookups via arXiv/S2/Crossref batch endpoints
Resolution tracking: --mark-resolved tags updated entries to skip on re-runs

Zotero Updater (`bibtex-zotero`)

Zotero integration

Direct Zotero integration: Fetches and updates items via Zotero API
Same resolution pipeline: Uses the same multi-source resolution
Preserves metadata: Keeps notes, tags, and attachments intact
Idempotent: Already-published papers are automatically skipped
Dry-run mode: Preview changes before applying
Tag-based chunking: Track processing state with preprint-upgraded/preprint-checked/preprint-error tags

Zotero Organizer (`bibtex-zotero-organize`)

AI-powered taxonomy: Organize items into hierarchical collections automatically
Multiple backends: Claude, OpenAI, or local embeddings for classification
Caching: Classification results cached to reduce API calls
Batch processing: Configurable limits and dry-run mode

Obsidian Keywords (`bibtex-obsidian-keywords`)

AI auto-keywording

AI-powered keywords: Generate [[wikilinks]] for Obsidian paper notes
Multiple backends: Claude, OpenAI, or local embeddings
Smart skipping: --min-keywords to skip notes that already have enough keywords
Topics file: Provide existing topics for consistent tagging across notes
Dry-run mode: Preview changes before modifying files

Reference Fact-Checker (`bibtex-check`)

Reference fact-checker

v1.2.0 carries v1.1.0's held-out FPR work into the catch-rate dimension: ~110 previously-abstained hallucinations are now flagged as problematic, the SCoRe wrong-venue leak class is caught, and a --strict evaluation mode tuned for arXiv's 2026 hallucinated-reference policy (1-year ban followed by peer-review-first requirement) is available for high-stakes audits. Against the corrected HALLMARK v1.0 gold:

Held-out test FPR steady at 2.32% (8.94% in v1.0.0 → 2.32% in v1.1.0+v1.2.0; −74% vs v1.0.0). Dev FPR 1.59% → 1.99% (+0.4pp; 3 small documented regression FPs)
Caught-on-hallucinated: dev 60% → 75%, test 58% → 74% (+15pp on both splits) — driven by the new cross-source venue verification (catches SCoRe-shape leaks), the ID-anchored venue/year mismatch helper, and the relaxed-author retrieval fallback
Leak rate: 0.65% dev (4 entries), 0.57% test (3 entries; SCoRe caught — was 4 in v1.1.0); policy-adjusted 0.32% / 0.38% — remaining "leaks" are mostly 1-character title perturbations; hyphen-only differences are explicitly not counted as leaks in default mode (hyphenation is bibliographic noise that varies across DBLP/Crossref/publisher records — flagging it would generate false positives on most legit refs). --strict catches every 1-char title diff (Levenshtein-1, hyphen included) for arXiv-style high-stakes audits, plus tolerance-0 year, single-source author-fab detection, and truncated-author flagging. See docs/KNOWN_LEAKS.md for the per-leak enumeration and policy
Could-not-verify on real refs dropped ~70% via venue + retrieval refinements (OpenReview/PMLR track-suffix normalization, TMLR/JMLR ISO-4 alias expansion, diacritic-preserving paperhash)

The "leak" headline is mostly benchmark noise: HALLMARK PR #9 corrects 30 entries — including FlashAttention, DDPM, Imagen, SimCLR, Performers, ViT-vs-CNN, Chain-of-Thought (Wei), Zero-Shot Reasoner (Kojima), MERLOT — that the v1.0 auto-labeller flagged as fabricated but are in fact real, correctly-cited papers (arXiv DOIs register with DataCite, not CrossRef, so the auto-labeller's "no resolve" check returned false). The corrected leak rate isolates genuine catch opportunities.

bibtex-check v1.2.0 accuracy

The full per-split detection grid (DR / FPR / Precision / F1 / MCC / Coverage on dev_public, test_public, stress_test, test_crossdomain) against the corrected HALLMARK v1.1.1 gold lives in benchmarks/HALLMARK.md, with a reproducible eval script (scripts/eval_hallmark.py):

# Score bibtex-check on a HALLMARK split and emit detection metrics
export S2_API_KEY=...   # optional: lifts Semantic Scholar rate limits
python scripts/eval_hallmark.py --split /path/to/hallmark/data/v1.0/test_public.jsonl --out test_public.json

Multi-source validation: Crossref, OpenAlex, DBLP, OpenReview, Semantic Scholar
Detailed mismatch detection: Title, author, year, venue comparisons
Integrity checks: DOI- and arXiv-ID-target consistency, ID-anchored author fabrication, chimeric-title detection
Hallucination detection: Reserves hallucinated for positive evidence (fabricated DOI, future/invalid year, ID misattribution); abstains (not_found) on weak matches
Structured reports: JSON and JSONL output formats
CI/CD integration: Strict mode with exit codes for automation

Cascading verification

Inspired by Abbonato 2026 (CheckIfExist), verification runs a single cascade — CrossRef → OpenAlex → DBLP → OpenReview → Semantic Scholar — that short-circuits as soon as one source produces a high-confidence match (≥0.95). Each step retrieves top-K candidates and re-ranks them by title similarity; combined with cross-source author intersection, this catches swapped-author / chimeric citations that single-source verification misses.

The order is throughput-aware: CrossRef and OpenAlex (polite pool, ~100 req/min) come first, then DBLP and OpenReview (~30 req/min) as the CS-conference and ICLR/NeurIPS/TMLR authorities, so the slow keyless Semantic Scholar fallback (~10 req/min) is only reached on hard entries. Set a Semantic Scholar API key (--s2-api-key or S2_API_KEY) to lift S2 from ~10 to ~60 req/min.

OpenReview owns the submission record for most ML conferences, so it positively confirms ICLR/NeurIPS/TMLR papers that the DOI- and CS-index sources above can only leave in the "could-not-verify" bucket. Retrieval uses fielded title search (CrossRef query.title, OpenAlex title.search) rather than a free-text title+author blob, which keeps DOI-less ML-conference titles ranked correctly.

# Verification with top-3 candidates per source
bibtex-check references.bib --top-k 3 --jsonl out.jsonl

# Polite OpenAlex pool (recommended)
bibtex-check references.bib --openalex-mailto you@example.com

A 0–100 numeric confidence_score (additive in the JSONL output) summarizes per-field similarity with explicit penalty/bonus contributions:

Multi-source bonus: +10 when ≥2 sources confirm the same authors
Penalties: title-mismatch -20, author-mismatch -20, journal-mismatch -15, fabricated-author -10 each (capped at -20)
Asymmetric formula for the high-title-low-author chimeric case: confidence = S_title − 0.5 × (100 − S_author)

Verdicts: verified vs. could-not-verify vs. problematic

VERIFIED requires every claimed field to be positively confirmed against the matched record — not merely "not contradicted". When a record is found but a claimed field can't be confirmed (e.g. a published venue backed only by a preprint, or an incomplete author list), the entry is reported as could-not-verify (UNCONFIRMED/NOT_FOUND), distinct from a problematic flag (*_mismatch, doi_mismatch, chimeric, …) which is positive evidence of a defect. A "could-not-verify" is not a clean pass: it means the tool couldn't decide, and such entries warrant review.

For full transparency, every residual VERIFIED-on-a-real-leak case against the corrected HALLMARK v1.0 gold is enumerated in docs/KNOWN_LEAKS.md, with the perturbation, the default verdict, and the --strict rule that catches it.

Author handling

All sources return authors in as-published order, so a multiset-equal reordering is treated as a real swapped-authors defect — except when the API record is alphabetized (Crossref NeurIPS/ICML proceedings deposits, prefix 10.52202, sort contributors A–Z; that's a record-sort artifact, not a swap). Surname comparison uses each source's structured family field where available (Crossref, OpenAlex, OpenReview ~Given_Family handles), so family-first/CJK names like "Chen Xing" ↔ "Xing Chen" match cleanly; when the matched source lacks structured names, a Crossref structured-name lookup vets a potential author mismatch before reporting it.

The lead author's given name is graded via classify_given_pair: diacritic / initial / abbreviation / nickname / transliteration variants pass; a true substitution (e.g. "Shunyu Zhou" vs canonical "Denny Zhou") flags as GIVEN_NAME_SUBSTITUTION. The cross-source author-fabrication check downgrades the author outcome to AUTHOR_MISMATCH when the entry contributes ≥2 surnames absent from every order-reliable candidate's full author set (no and others sentinel, ≥2 sources contributing), catching fabricated trailing authors that slip past the prefix-N slice. DBLP-scraped XML entities (', &) are decoded before any matching, so d'Amore, D'Hondt, Ch'ng no longer trigger spurious mismatches.

Strict mode (`--strict`)

For high-stakes submissions where the asymmetric cost is leak ≫ FP — arXiv as of May 2026 imposes a 1-year ban for incontrovertible hallucinated references, thereafter requiring submissions to be accepted by a reputable peer-reviewed venue first — --strict (or BIBTEX_CHECK_STRICT=1) tightens the verdict gate:

Title: Levenshtein-1 catches 1-character typos and added/removed hyphens ("Privacy"/"Privacys", "Schema Variable"/"Schema-Variable").
Year: tolerance 0; a preprint-twin record returns STRICT_WARN_PREPRINT_YEAR instead of silently confirming.
Author-set: even a single entry-side surname absent from a single complete canonical record flags AUTHOR_MISMATCH (the default requires ≥2 absent across ≥2 sources, to avoid false positives on stub records).
Author order: the alphabetized-record escape is disabled — every same-multiset reordering on an order-reliable source flags.
Truncated author list without an and others/et al sentinel flags AUTHOR_TRUNCATED (silent truncation is a misrepresentation; an explicit sentinel discloses it).

The companion --strict-warn-cnv subflag (requires --strict) promotes unconfirmed/not_found to a fourth visible category STRICT_WARN_CNV, so CI integrations can fail on entries the tool couldn't anchor. Default mode keeps the principled three-way verdict unchanged.

# Strict pass for an arXiv submission
bibtex-check references.bib --strict --strict-warn-cnv --jsonl strict.jsonl

Non-generative-AI mode (`--non-generative`)

For venue-policy compliance (ACL ARR, ICML 2026) the --non-generative flag (or BIBTEX_CHECK_NON_GENERATIVE=1 env var) refuses to load any LLM backend at runtime. Today the package has no LLM backends, so this is a forward-compat guard plus a startup banner:

bibtex-check references.bib --non-generative --strict
# bibtex-check running in non-generative mode (no LLM calls).
# Compliant with ICML 2026 / ACL ARR LLM-in-review policies.

Filter Bibliography (`bibtex-filter`)

Zero dependencies: Uses only Python standard library
Works on Overleaf: No pip install needed
Multiple bib files: Merge and filter from multiple sources
Citation detection: Supports natbib, biblatex, and standard LaTeX citations

Python API

from bibtex_updater import Detector, Resolver, Updater, HttpClient, RateLimiter, DiskCache

# Create HTTP client with rate limiting and caching
rate_limiter = RateLimiter(req_per_min=30)
cache = DiskCache(".cache.json")
http_client = HttpClient(
    timeout=30.0,
    user_agent="bibtex-updater/0.5.0",
    rate_limiter=rate_limiter,
    cache=cache
)

# Detect preprints
detector = Detector()
detection = detector.detect(entry)

if detection.is_preprint:
    # Resolve to published version
    resolver = Resolver(http_client)
    candidate = resolver.resolve(detection)

    if candidate and candidate.confidence >= 0.9:
        # Update the entry
        updater = Updater()
        updated_entry = updater.update_entry(entry, candidate.record, detection)

Development

# Clone and install in development mode
git clone https://github.com/rpatrik96/bibtexupdater.git
cd bibtexupdater
uv sync --extra dev --extra all

# Run tests
uv run pytest tests/ -v

# Run tests with coverage
uv run pytest tests/ -v --cov=bibtex_updater --cov-report=term-missing

# Code quality
pre-commit run --all-files

# Build package
uv build

# Check package
uv run twine check dist/*

License

MIT License - see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rpatrik96

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.4.0

Jun 15, 2026

This version

1.3.0

Jun 11, 2026

1.2.0

May 30, 2026

1.1.0

May 30, 2026

1.0.0

May 29, 2026

0.10.0

May 28, 2026

0.9.2

May 27, 2026

0.9.1

May 26, 2026

0.9.0

May 8, 2026

0.8.0

Feb 14, 2026

0.7.0

Feb 12, 2026

0.6.1

Feb 10, 2026

0.6.0

Feb 10, 2026

0.5.1

Feb 8, 2026

0.5.0

Feb 6, 2026

0.4.1

Feb 1, 2026

0.4.0

Feb 1, 2026

0.3.0

Feb 1, 2026

0.2.0

Feb 1, 2026

0.1.4

Jan 30, 2026

0.1.3

Jan 30, 2026

0.1.2

Jan 30, 2026

0.1.1

Jan 29, 2026

0.1.0

Jan 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bibtex_updater-1.3.0.tar.gz (380.3 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bibtex_updater-1.3.0-py3-none-any.whl (217.5 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file bibtex_updater-1.3.0.tar.gz.

File metadata

Download URL: bibtex_updater-1.3.0.tar.gz
Upload date: Jun 11, 2026
Size: 380.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bibtex_updater-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`d2e34ac01fc16639efe12128bce67f04db305fa656081dd1c9d5b5a6bd552b2d`
MD5	`500db14a39bcaa8548e592053ec00c85`
BLAKE2b-256	`5f46c42cf3270970bdcbfda50b423eb8ce3377fe2e4cf273fd3908369feac144`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bibtex_updater-1.3.0.tar.gz:

Publisher: publish.yml on rpatrik96/bibtexupdater

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bibtex_updater-1.3.0.tar.gz
- Subject digest: d2e34ac01fc16639efe12128bce67f04db305fa656081dd1c9d5b5a6bd552b2d
- Sigstore transparency entry: 1790704741
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: rpatrik96/bibtexupdater@bebfd8a8a22b41c34b7ddf91fdd89ad2e8641fca
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/rpatrik96
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bebfd8a8a22b41c34b7ddf91fdd89ad2e8641fca
- Trigger Event: release

File details

Details for the file bibtex_updater-1.3.0-py3-none-any.whl.

File metadata

Download URL: bibtex_updater-1.3.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 217.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bibtex_updater-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33d259676bf2526a442b8f784b9856c3a00b97f8e267e78430bfd5c7bc966929`
MD5	`46a7c960b4ed4c7de61d28e0d84d067e`
BLAKE2b-256	`ac73e0415873d0ff37a92321d8b74d2f5a8454d2581acaee5e4177b1d75daad8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bibtex_updater-1.3.0-py3-none-any.whl:

Publisher: publish.yml on rpatrik96/bibtexupdater

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bibtex_updater-1.3.0-py3-none-any.whl
- Subject digest: 33d259676bf2526a442b8f784b9856c3a00b97f8e267e78430bfd5c7bc966929
- Sigstore transparency entry: 1790704850
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: rpatrik96/bibtexupdater@bebfd8a8a22b41c34b7ddf91fdd89ad2e8641fca
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/rpatrik96
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bebfd8a8a22b41c34b7ddf91fdd89ad2e8641fca
- Trigger Event: release

bibtex-updater 1.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

BibTeX Updater

Installation

From PyPI (Recommended)

From Source (Recommended)

Using uv (No Installation)

CLI Commands

Quick Start

Update Preprints

Validate References (Fact-Check)

Filter Bibliography

Update Zotero Library

Sync BibTeX Updates to Zotero

Standalone Scripts

Documentation

Overleaf Integration

GitHub Actions (Recommended)

latexmkrc (Direct Overleaf)

Features

BibTeX Updater (bibtex-update)

Zotero Updater (bibtex-zotero)

Zotero Organizer (bibtex-zotero-organize)

Obsidian Keywords (bibtex-obsidian-keywords)

Reference Fact-Checker (bibtex-check)

Cascading verification

Verdicts: verified vs. could-not-verify vs. problematic

Author handling

Strict mode (--strict)

Non-generative-AI mode (--non-generative)

Filter Bibliography (bibtex-filter)

Python API

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

BibTeX Updater (`bibtex-update`)

Zotero Updater (`bibtex-zotero`)

Zotero Organizer (`bibtex-zotero-organize`)

Obsidian Keywords (`bibtex-obsidian-keywords`)

Reference Fact-Checker (`bibtex-check`)

Strict mode (`--strict`)

Non-generative-AI mode (`--non-generative`)

Filter Bibliography (`bibtex-filter`)