Skip to main content

No project description provided

Project description

citefinder

OpenAlex (default) + Crossref reference lookups with local JSONL caching.

A small Python library + CLI for verifying academic references against the OpenAlex and Crossref APIs. Every lookup is appended to an append-only JSONL log so repeated queries (across verification passes or sessions) are served from the cache. Negative results (404s) are cached too, so known-missing DOIs aren't re-hit.

OpenAlex is the default source: it merges Crossref + Unpaywall + ORCID + ROR

  • repository sources, so it covers what Crossref alone is missing — arXiv DOIs (10.48550/arXiv.*), other preprints, repository deposits — and frequently has richer metadata (abstracts, full author lists, affiliations) for records that exist in both. Crossref is still available via the crossref subcommand for its own workflows (book-chapter lookup, the canonical published-deposit metadata).

Configuration: API key and mailto

OpenAlex works without authentication, but a free API key gives you higher limits and tier-specific endpoints. Both Crossref and OpenAlex honor a mailto for their polite pools (faster responses, higher quotas).

Lookup order (CLI), highest priority first:

  1. CLI flag: --api-key, --mailto.
  2. Shell environment: OPENALEX_API_KEY, OPENALEX_MAILTO, CROSSREF_MAILTO.
  3. Project-local .env in the current working directory or any parent.
  4. ~/.config/citefinder/config.toml (honors $XDG_CONFIG_HOME) — store it once on this machine.
# ~/.config/citefinder/config.toml
[openalex]
api_key = "your-openalex-key"
mailto = "you@example.com"

[crossref]
mailto = "you@example.com"

The file is plain-text — if your environment is shared, chmod 600 ~/.config/citefinder/config.toml so it's only readable by you. Each section is optional; omit anything you don't need.

Library users: pass api_key=... and mailto=... to the client constructors explicitly. The config-file fallback is CLI-only (it shouldn't be a surprise side effect of importing the library).

The API key is sent as Authorization: Bearer ..., never as a URL parameter, so it doesn't land in cache keys, logs, or referer headers.

Install

uv add citefinder

Or for development:

git clone https://github.com/gitronald/citefinder
cd citefinder
uv sync

Library usage

OpenAlex (default)

from citefinder import OpenAlexClient, is_arxiv_doi, reconstruct_abstract

openalex = OpenAlexClient(
    cache_path="~/.cache/citefinder/openalex.jsonl",
    mailto="you@example.com",  # opts into OpenAlex's polite pool — faster, higher quota
)

# Single DOI (works for arXiv DOIs that Crossref doesn't index)
work = openalex.lookup_doi("10.48550/arXiv.2410.21554")

# Title-only search — tuned for citation verification. Handles OpenAlex's
# curly-apostrophe quirk and strips filter-reserved punctuation that would
# 400 the request, so straight ASCII inputs match curly-quoted indexed titles.
hits = openalex.search_title("Backstabber's Knife Collection", rows=3)

# Free-text search across titles + abstracts (noisier; prefer search_title
# for citation lookup)
hits = openalex.search("fact-checking large language models", rows=3)

# OpenAlex stores abstracts as an inverted index — reconstruct to plain text
abstract = reconstruct_abstract(work) if work else None

# Helper for routing logic
assert is_arxiv_doi("10.48550/arXiv.2410.21554")

The mailto argument is optional but recommended: it puts requests into OpenAlex's polite pool for faster responses. The cache key strips mailto so changing it doesn't invalidate prior entries.

Crossref

from citefinder import CrossrefClient

client = CrossrefClient(
    cache_path="~/.cache/citefinder/crossref.jsonl",
    mailto="you@example.com",  # opts into Crossref's polite pool — faster, higher quota
)

# Single DOI
work = client.lookup_doi("10.1126/science.aap9559")
print(work["title"][0])

# Bibliographic search (author + title + year)
hits = client.search_bibliographic("Wolfowicz hate speech meta-analysis", rows=3)

# Book chapter via {book_doi}.{NNN} pattern
chapter = client.lookup_book_chapter("10.1017/9781108890960", 5)

Crossref and OpenAlex both honor mailto for their polite pools; the cache key strips it on either side, so rotating the email doesn't invalidate prior entries.

OpenAlex's schema differs from Crossref. Quick map:

Field Crossref OpenAlex
Title work["title"][0] (+ optional subtitle[0]) work["display_name"]
First author work["author"][0]["family"] (surname only) work["authorships"][0]["author"]["display_name"] (full name — parse for surname)
Container work["container-title"][0] (+ short-container-title) work["primary_location"]["source"]["display_name"] (+ host_venue on older records)
Year published-print / published-online / issued / created["date-parts"][0][0] work["publication_year"] (int)

Bib verification

A .bib file can be parsed and verified against either source end-to-end:

from citefinder import (
    OpenAlexClient,
    Source,
    parse_entries,
    verify_entry,
)

source = Source(name="openalex", client=OpenAlexClient(cache_path="cache.jsonl"))

for entry in parse_entries(open("refs.bib").read()):
    result = verify_entry(entry, source)
    print(result.key, result.status, result.matched_doi)

Each Result reports a Status (matched / probable / mismatch / unmatched / doi-not-found / skip-source / error) plus the four signals — title, year, first-author surname, container — that drove the verdict. BibCitation and Work are the canonical shapes; crossref_to_work and openalex_to_work adapt source-specific JSON into Work. See citefinder/signals.py for the signal-check thresholds.

Bib ↔ table

A .bib file can be loaded into a wide polars DataFrame (one row per entry, one column per field) for inspection or bulk editing, then serialized back:

from citefinder import bib_to_table, table_to_bib

df = bib_to_table(open("refs.bib").read())   # key, entry_type, then fields alphabetical
new_bib = table_to_bib(df)                    # back to .bib, null cells skipped

bib_to_table lowercases field keys (DOIdoi) and stores the entry kind in entry_type to avoid collision with the literal type field that some entries carry (e.g., SSRN papers set type = {SSRN Scholarly Paper}). table_to_bib requires key and entry_type columns and serializes the rest in column order. The round-trip is lossless on field values and entry types; the original within-entry field order and any source-file @string/@comment blocks are not preserved.

CLI usage

# OpenAlex (default)
citefinder doi 10.48550/arXiv.2410.21554 --mailto you@example.com
citefinder search "Backstabber's Knife Collection" --rows 3

# Crossref
citefinder crossref doi 10.1126/science.aap9559 --mailto you@example.com
citefinder crossref search "Wolfowicz hate speech meta-analysis" --rows 3
citefinder crossref chapter 10.1017/9781108890960 5

# .bib parsing & verification
citefinder parse refs.bib                                # CSV to stdout (no network)
citefinder parse refs.bib --out parsed.csv               # ...or to a file
citefinder verify refs.bib                               # full pipeline (defaults to OpenAlex)
citefinder verify refs.bib --source crossref             # ...or against Crossref
citefinder verify refs.bib --out path/to/output/dir/     # custom output directory

# .bib ↔ table
citefinder bib-to-table refs.bib                            # wide polars table to terminal
citefinder bib-to-table refs.bib --csv > refs.csv           # ...or CSV to stdout
citefinder bib-to-table refs.bib --fields title,year,doi    # subset of columns
citefinder table-to-bib refs.csv                            # CSV back to .bib on stdout
citefinder table-to-bib refs.csv --out refs.regen.bib       # ...or to a file

parse emits a CSV with columns key, etype, title, author, year, doi, container where author is the first-author surname (the form used downstream for matching) and container is the entry's journal or booktitle.

verify walks each entry: if a doi field is present it resolves the DOI; otherwise it searches by author + title + year. Each result is checked against four signals (title, year, first-author surname, container) and bucketed by status. Output goes to data/citefinder/<bib-stem>/<source>/: a <source>.jsonl cache and a structured results.json. Re-running is cheap — every cache hit is served from disk.

bib-to-table and table-to-bib are inverses: the first turns a .bib into a wide table (terminal view by default, --csv for piping), the second reads such a CSV back into a .bib. Useful for spreadsheet-style review or bulk edits before regenerating the file. The round-trip is lossless on data; within-entry field order and source-file formatting are not preserved.

CLI arguments

  • --cache PATH — JSONL cache path. Defaults to ~/.cache/citefinder/openalex.jsonl for top-level commands and ~/.cache/citefinder/crossref.jsonl for crossref subcommands. Separate files so sources don't mix; override per command if you want per-project caches (e.g., --cache ./data/refs.jsonl).
  • --rows N (search only) — Number of results to return. Default 3.
  • --mailto EMAIL — Opts the request into the source's polite pool (both OpenAlex and Crossref honor it): faster responses and a higher quota. Sent as a ?mailto=… query param; stripped from the cache key, so rotating the email doesn't invalidate prior entries.
  • --api-key KEY (OpenAlex only) — OpenAlex API key for higher rate limits and tier-specific endpoints. Also read from OPENALEX_API_KEY in the env or a .env file (loaded from cwd or any parent). Sent as Authorization: Bearer <key> so it never lands in cache keys, URL logs, or referer headers.

Why JSONL?

The cache is an append-only log: every lookup is one JSON object per line. Benefits:

  • Auditable: cat/grep to see every query that ever ran.
  • Diffable: plays nicely with git if you want to commit a project's cache.
  • Crash-safe: an interrupted write loses at most the last line.
  • Recoverable: rebuild the in-memory dict by replaying the log.

Latest value wins on replay, so over-writes are a no-op semantic.

SQLite alternative. A SQLite-backed cache is another reasonable implementation — it would trade the audit log and grep-ability for faster random access on very large caches (millions of entries) and concurrent writers. The current scale of citefinder use (per-project bibs, tens of thousands of entries at most) doesn't need it, and replaying a JSONL on startup is fast enough that the simplicity wins. If a future workload pushes past those limits, swapping the storage layer is a single class — JsonlCache in citefinder/cache.py — behind the same get / put / __contains__ interface.

Tests

uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citefinder-0.4.1.tar.gz (89.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citefinder-0.4.1-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file citefinder-0.4.1.tar.gz.

File metadata

  • Download URL: citefinder-0.4.1.tar.gz
  • Upload date:
  • Size: 89.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for citefinder-0.4.1.tar.gz
Algorithm Hash digest
SHA256 121e038e457bf9f0a86520e0cc0bcafc60c30a668bffc5165e52529b18bfa032
MD5 b4f1b07a17206b03709a3931504a14c8
BLAKE2b-256 ab8ada35d53782156e51ce9bdf2413f27da2170252c3f9d4b5751b25c5a23449

See more details on using hashes here.

Provenance

The following attestation bundles were made for citefinder-0.4.1.tar.gz:

Publisher: publish.yml on gitronald/citefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file citefinder-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: citefinder-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 30.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for citefinder-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fbcfa9702889aac57c0aa43ee4eb2b7d66eae33b56b7d299b50a36cfff1a5117
MD5 26e97d5b0b2a7c452f5d5b9a3244a318
BLAKE2b-256 047ee5e906176a98ac7cb1db000909e7167c59c25fa91b277e80314d0edd442d

See more details on using hashes here.

Provenance

The following attestation bundles were made for citefinder-0.4.1-py3-none-any.whl:

Publisher: publish.yml on gitronald/citefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page