Skip to main content

No project description provided

Project description

citefinder

OpenAlex (default) + Crossref reference lookups with local JSONL caching.

A small Python library + CLI for verifying academic references against the OpenAlex and Crossref APIs. Every lookup is appended to an append-only JSONL log so repeated queries (across verification passes or sessions) are served from the cache. Negative results (404s) are cached too, so known-missing DOIs aren't re-hit.

OpenAlex is the default source: it merges Crossref + Unpaywall + ORCID + ROR

  • repository sources, so it covers what Crossref alone is missing — arXiv DOIs (10.48550/arXiv.*), other preprints, repository deposits — and frequently has richer metadata (abstracts, full author lists, affiliations) for records that exist in both. Crossref is still available via the crossref subcommand for its own workflows (book-chapter lookup, the canonical published-deposit metadata).

OpenAlex API key (optional)

OpenAlex works without authentication, but a free API key gives you higher limits and tier-specific endpoints.

The key is read in this order:

  1. api_key=... argument to OpenAlexClient(...) (or --api-key on the CLI).
  2. OPENALEX_API_KEY environment variable.
  3. A .env file in the current working directory or any parent (loaded by the CLI; library users can opt in via from dotenv import load_dotenv).
# .env
OPENALEX_API_KEY=oa_pk_...

The key is sent as Authorization: Bearer ..., never as a URL parameter, so it doesn't land in cache keys, logs, or referer headers.

Install

uv add citefinder

Or for development:

git clone https://github.com/gitronald/citefinder
cd citefinder
uv sync

Library usage

OpenAlex (default)

from citefinder import OpenAlexClient, is_arxiv_doi, reconstruct_abstract

openalex = OpenAlexClient(
    cache_path="~/.cache/citefinder/openalex.jsonl",
    mailto="you@example.com",  # opts into OpenAlex's polite pool — faster, higher quota
)

# Single DOI (works for arXiv DOIs that Crossref doesn't index)
work = openalex.lookup_doi("10.48550/arXiv.2410.21554")

# Title-only search — tuned for citation verification. Handles OpenAlex's
# curly-apostrophe quirk and strips filter-reserved punctuation that would
# 400 the request, so straight ASCII inputs match curly-quoted indexed titles.
hits = openalex.search_title("Backstabber's Knife Collection", rows=3)

# Free-text search across titles + abstracts (noisier; prefer search_title
# for citation lookup)
hits = openalex.search("fact-checking large language models", rows=3)

# OpenAlex stores abstracts as an inverted index — reconstruct to plain text
abstract = reconstruct_abstract(work) if work else None

# Helper for routing logic
assert is_arxiv_doi("10.48550/arXiv.2410.21554")

The mailto argument is optional but recommended: it puts requests into OpenAlex's polite pool for faster responses. The cache key strips mailto so changing it doesn't invalidate prior entries.

Crossref

from citefinder import CrossrefClient

client = CrossrefClient(
    cache_path="~/.cache/citefinder/crossref.jsonl",
    mailto="you@example.com",  # opts into Crossref's polite pool — faster, higher quota
)

# Single DOI
work = client.lookup_doi("10.1126/science.aap9559")
print(work["title"][0])

# Bibliographic search (author + title + year)
hits = client.search_bibliographic("Wolfowicz hate speech meta-analysis", rows=3)

# Book chapter via {book_doi}.{NNN} pattern
chapter = client.lookup_book_chapter("10.1017/9781108890960", 5)

Crossref and OpenAlex both honor mailto for their polite pools; the cache key strips it on either side, so rotating the email doesn't invalidate prior entries.

OpenAlex's schema differs from Crossref. Quick map:

Field Crossref OpenAlex
Title work["title"][0] (+ optional subtitle[0]) work["display_name"]
First author work["author"][0]["family"] (surname only) work["authorships"][0]["author"]["display_name"] (full name — parse for surname)
Container work["container-title"][0] (+ short-container-title) work["primary_location"]["source"]["display_name"] (+ host_venue on older records)
Year published-print / published-online / issued / created["date-parts"][0][0] work["publication_year"] (int)

CLI usage

# OpenAlex (default)
citefinder doi 10.48550/arXiv.2410.21554 --mailto you@example.com
citefinder search "Backstabber's Knife Collection" --rows 3

# Crossref
citefinder crossref doi 10.1126/science.aap9559 --mailto you@example.com
citefinder crossref search "Wolfowicz hate speech meta-analysis" --rows 3
citefinder crossref chapter 10.1017/9781108890960 5

CLI arguments

  • --cache PATH — JSONL cache path. Defaults to ~/.cache/citefinder/openalex.jsonl for top-level commands and ~/.cache/citefinder/crossref.jsonl for crossref subcommands. Separate files so sources don't mix; override per command if you want per-project caches (e.g., --cache ./data/refs.jsonl).
  • --rows N (search only) — Number of results to return. Default 3.
  • --mailto EMAIL — Opts the request into the source's polite pool (both OpenAlex and Crossref honor it): faster responses and a higher quota. Sent as a ?mailto=… query param; stripped from the cache key, so rotating the email doesn't invalidate prior entries.
  • --api-key KEY (OpenAlex only) — OpenAlex API key for higher rate limits and tier-specific endpoints. Also read from OPENALEX_API_KEY in the env or a .env file (loaded from cwd or any parent). Sent as Authorization: Bearer <key> so it never lands in cache keys, URL logs, or referer headers.

Why JSONL?

The cache is an append-only log: every lookup is one JSON object per line. Benefits:

  • Auditable: cat/grep to see every query that ever ran.
  • Diffable: plays nicely with git if you want to commit a project's cache.
  • Crash-safe: an interrupted write loses at most the last line.
  • Recoverable: rebuild the in-memory dict by replaying the log.

Latest value wins on replay, so over-writes are a no-op semantic.

SQLite alternative. A SQLite-backed cache is another reasonable implementation — it would trade the audit log and grep-ability for faster random access on very large caches (millions of entries) and concurrent writers. The current scale of citefinder use (per-project bibs, tens of thousands of entries at most) doesn't need it, and replaying a JSONL on startup is fast enough that the simplicity wins. If a future workload pushes past those limits, swapping the storage layer is a single class — JsonlCache in citefinder/cache.py — behind the same get / put / __contains__ interface.

Tests

uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citefinder-0.3.0.tar.gz (59.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citefinder-0.3.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file citefinder-0.3.0.tar.gz.

File metadata

  • Download URL: citefinder-0.3.0.tar.gz
  • Upload date:
  • Size: 59.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for citefinder-0.3.0.tar.gz
Algorithm Hash digest
SHA256 28d8b2629a878dc1d1f94399ac24474f739debd2352249c129cb6711d2f11497
MD5 73fa295358f8251365c21921987796af
BLAKE2b-256 de46bf196469f1cbe45cc4675487f09731c7b215866aeb122f87f5ebeb9ca11d

See more details on using hashes here.

Provenance

The following attestation bundles were made for citefinder-0.3.0.tar.gz:

Publisher: publish.yml on gitronald/citefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file citefinder-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: citefinder-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for citefinder-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 37acedf737e3c8a908999ad834dd8a33b46eb04cb2078719dae5e8712ed51e79
MD5 e5f514c24481098bcc60a9ede6425195
BLAKE2b-256 efbe5d9dd367a5121076ad65351637441fcf0564f931c66092e16e460b7a19b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for citefinder-0.3.0-py3-none-any.whl:

Publisher: publish.yml on gitronald/citefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page