No project description provided
Project description
citefinder
OpenAlex (default) + Crossref reference lookups with local JSONL caching.
A small Python library + CLI for verifying academic references against the OpenAlex and Crossref APIs. Every lookup is appended to an append-only JSONL log so repeated queries (across verification passes or sessions) are served from the cache. Negative results (404s) are cached too, so known-missing DOIs aren't re-hit.
OpenAlex is the default source: it merges Crossref + Unpaywall + ORCID + ROR
- repository sources, so it covers what Crossref alone is missing — arXiv
DOIs (
10.48550/arXiv.*), other preprints, repository deposits — and frequently has richer metadata (abstracts, full author lists, affiliations) for records that exist in both. Crossref is still available via thecrossrefsubcommand for its own workflows (book-chapter lookup, the canonical published-deposit metadata).
OpenAlex API key (optional)
OpenAlex works without authentication, but a free API key gives you higher limits and tier-specific endpoints.
- Docs: https://developers.openalex.org/
- Sign up / generate a key: https://openalex.org/login?redirect=/settings/api-key
The key is read in this order:
api_key=...argument toOpenAlexClient(...)(or--api-keyon the CLI).OPENALEX_API_KEYenvironment variable.- A
.envfile in the current working directory or any parent (loaded by the CLI; library users can opt in viafrom dotenv import load_dotenv).
# .env
OPENALEX_API_KEY=oa_pk_...
The key is sent as Authorization: Bearer ..., never as a URL parameter, so
it doesn't land in cache keys, logs, or referer headers.
Install
uv add citefinder
Or for development:
git clone https://github.com/gitronald/citefinder
cd citefinder
uv sync
Library usage
OpenAlex (default)
from citefinder import OpenAlexClient, is_arxiv_doi, reconstruct_abstract
openalex = OpenAlexClient(
cache_path="~/.cache/citefinder/openalex.jsonl",
mailto="you@example.com", # opts into OpenAlex's polite pool — faster, higher quota
)
# Single DOI (works for arXiv DOIs that Crossref doesn't index)
work = openalex.lookup_doi("10.48550/arXiv.2410.21554")
# Title-only search — tuned for citation verification. Handles OpenAlex's
# curly-apostrophe quirk and strips filter-reserved punctuation that would
# 400 the request, so straight ASCII inputs match curly-quoted indexed titles.
hits = openalex.search_title("Backstabber's Knife Collection", rows=3)
# Free-text search across titles + abstracts (noisier; prefer search_title
# for citation lookup)
hits = openalex.search("fact-checking large language models", rows=3)
# OpenAlex stores abstracts as an inverted index — reconstruct to plain text
abstract = reconstruct_abstract(work) if work else None
# Helper for routing logic
assert is_arxiv_doi("10.48550/arXiv.2410.21554")
The mailto argument is optional but recommended: it puts requests into
OpenAlex's polite pool
for faster responses. The cache key strips mailto so changing it doesn't
invalidate prior entries.
Crossref
from citefinder import CrossrefClient
client = CrossrefClient(
cache_path="~/.cache/citefinder/crossref.jsonl",
mailto="you@example.com", # opts into Crossref's polite pool — faster, higher quota
)
# Single DOI
work = client.lookup_doi("10.1126/science.aap9559")
print(work["title"][0])
# Bibliographic search (author + title + year)
hits = client.search_bibliographic("Wolfowicz hate speech meta-analysis", rows=3)
# Book chapter via {book_doi}.{NNN} pattern
chapter = client.lookup_book_chapter("10.1017/9781108890960", 5)
Crossref and OpenAlex both honor mailto for their polite pools; the cache
key strips it on either side, so rotating the email doesn't invalidate prior
entries.
OpenAlex's schema differs from Crossref. Quick map:
| Field | Crossref | OpenAlex |
|---|---|---|
| Title | work["title"][0] (+ optional subtitle[0]) |
work["display_name"] |
| First author | work["author"][0]["family"] (surname only) |
work["authorships"][0]["author"]["display_name"] (full name — parse for surname) |
| Container | work["container-title"][0] (+ short-container-title) |
work["primary_location"]["source"]["display_name"] (+ host_venue on older records) |
| Year | published-print / published-online / issued / created → ["date-parts"][0][0] |
work["publication_year"] (int) |
CLI usage
# OpenAlex (default)
citefinder doi 10.48550/arXiv.2410.21554 --mailto you@example.com
citefinder search "Backstabber's Knife Collection" --rows 3
# Crossref
citefinder crossref doi 10.1126/science.aap9559 --mailto you@example.com
citefinder crossref search "Wolfowicz hate speech meta-analysis" --rows 3
citefinder crossref chapter 10.1017/9781108890960 5
CLI arguments
--cache PATH— JSONL cache path. Defaults to~/.cache/citefinder/openalex.jsonlfor top-level commands and~/.cache/citefinder/crossref.jsonlforcrossrefsubcommands. Separate files so sources don't mix; override per command if you want per-project caches (e.g.,--cache ./data/refs.jsonl).--rows N(search only) — Number of results to return. Default3.--mailto EMAIL— Opts the request into the source's polite pool (both OpenAlex and Crossref honor it): faster responses and a higher quota. Sent as a?mailto=…query param; stripped from the cache key, so rotating the email doesn't invalidate prior entries.--api-key KEY(OpenAlex only) — OpenAlex API key for higher rate limits and tier-specific endpoints. Also read fromOPENALEX_API_KEYin the env or a.envfile (loaded from cwd or any parent). Sent asAuthorization: Bearer <key>so it never lands in cache keys, URL logs, or referer headers.
Why JSONL?
The cache is an append-only log: every lookup is one JSON object per line. Benefits:
- Auditable:
cat/grepto see every query that ever ran. - Diffable: plays nicely with git if you want to commit a project's cache.
- Crash-safe: an interrupted write loses at most the last line.
- Recoverable: rebuild the in-memory dict by replaying the log.
Latest value wins on replay, so over-writes are a no-op semantic.
SQLite alternative. A SQLite-backed cache is another reasonable
implementation — it would trade the audit log and grep-ability for faster
random access on very large caches (millions of entries) and concurrent
writers. The current scale of citefinder use (per-project bibs, tens of
thousands of entries at most) doesn't need it, and replaying a JSONL on
startup is fast enough that the simplicity wins. If a future workload pushes
past those limits, swapping the storage layer is a single class — JsonlCache
in citefinder/cache.py — behind the same get / put / __contains__
interface.
Tests
uv run pytest
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file citefinder-0.3.0.tar.gz.
File metadata
- Download URL: citefinder-0.3.0.tar.gz
- Upload date:
- Size: 59.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28d8b2629a878dc1d1f94399ac24474f739debd2352249c129cb6711d2f11497
|
|
| MD5 |
73fa295358f8251365c21921987796af
|
|
| BLAKE2b-256 |
de46bf196469f1cbe45cc4675487f09731c7b215866aeb122f87f5ebeb9ca11d
|
Provenance
The following attestation bundles were made for citefinder-0.3.0.tar.gz:
Publisher:
publish.yml on gitronald/citefinder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
citefinder-0.3.0.tar.gz -
Subject digest:
28d8b2629a878dc1d1f94399ac24474f739debd2352249c129cb6711d2f11497 - Sigstore transparency entry: 1403537158
- Sigstore integration time:
-
Permalink:
gitronald/citefinder@31ce77fb6099891b8e1ea2eee85beb6fe91b6e24 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/gitronald
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@31ce77fb6099891b8e1ea2eee85beb6fe91b6e24 -
Trigger Event:
push
-
Statement type:
File details
Details for the file citefinder-0.3.0-py3-none-any.whl.
File metadata
- Download URL: citefinder-0.3.0-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37acedf737e3c8a908999ad834dd8a33b46eb04cb2078719dae5e8712ed51e79
|
|
| MD5 |
e5f514c24481098bcc60a9ede6425195
|
|
| BLAKE2b-256 |
efbe5d9dd367a5121076ad65351637441fcf0564f931c66092e16e460b7a19b8
|
Provenance
The following attestation bundles were made for citefinder-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on gitronald/citefinder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
citefinder-0.3.0-py3-none-any.whl -
Subject digest:
37acedf737e3c8a908999ad834dd8a33b46eb04cb2078719dae5e8712ed51e79 - Sigstore transparency entry: 1403537204
- Sigstore integration time:
-
Permalink:
gitronald/citefinder@31ce77fb6099891b8e1ea2eee85beb6fe91b6e24 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/gitronald
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@31ce77fb6099891b8e1ea2eee85beb6fe91b6e24 -
Trigger Event:
push
-
Statement type: