No project description provided
Project description
citefinder
OpenAlex (default) + Crossref reference lookups with local JSONL caching.
A small Python library + CLI for verifying academic references against the OpenAlex and Crossref APIs. Every lookup is appended to an append-only JSONL log so repeated queries (across verification passes or sessions) are served from the cache. Negative results (404s) are cached too, so known-missing DOIs aren't re-hit.
OpenAlex is the default source: it merges Crossref + Unpaywall + ORCID + ROR
- repository sources, so it covers what Crossref alone is missing — arXiv
DOIs (
10.48550/arXiv.*), other preprints, repository deposits — and frequently has richer metadata (abstracts, full author lists, affiliations) for records that exist in both. Crossref is still available via thecrossrefsubcommand for its own workflows (book-chapter lookup, the canonical published-deposit metadata).
Configuration: API key and mailto
OpenAlex works without authentication, but a free API key gives you higher
limits and tier-specific endpoints. Both Crossref and OpenAlex honor a
mailto for their polite pools (faster responses, higher quotas).
- OpenAlex docs: https://developers.openalex.org/
- Sign up / generate an OpenAlex key: https://openalex.org/login?redirect=/settings/api-key
Lookup order (CLI), highest priority first:
- CLI flag:
--api-key,--mailto. - Shell environment:
OPENALEX_API_KEY,OPENALEX_MAILTO,CROSSREF_MAILTO. - Project-local
.envin the current working directory or any parent. ~/.config/citefinder/config.toml(honors$XDG_CONFIG_HOME) — store it once on this machine.
# ~/.config/citefinder/config.toml
[openalex]
api_key = "your-openalex-key"
mailto = "you@example.com"
[crossref]
mailto = "you@example.com"
The file is plain-text — if your environment is shared, chmod 600 ~/.config/citefinder/config.toml so it's only readable by you. Each section
is optional; omit anything you don't need.
Library users: pass api_key=... and mailto=... to the client constructors
explicitly. The config-file fallback is CLI-only (it shouldn't be a surprise
side effect of importing the library).
The API key is sent as Authorization: Bearer ..., never as a URL parameter,
so it doesn't land in cache keys, logs, or referer headers.
Install
uv add citefinder
Or for development:
git clone https://github.com/gitronald/citefinder
cd citefinder
uv sync
Library usage
OpenAlex (default)
from citefinder import OpenAlexClient, is_arxiv_doi, reconstruct_abstract
openalex = OpenAlexClient(
cache_path="~/.cache/citefinder/openalex.jsonl",
mailto="you@example.com", # opts into OpenAlex's polite pool — faster, higher quota
)
# Single DOI (works for arXiv DOIs that Crossref doesn't index)
work = openalex.lookup_doi("10.48550/arXiv.2410.21554")
# Title-only search — tuned for citation verification. Handles OpenAlex's
# curly-apostrophe quirk and strips filter-reserved punctuation that would
# 400 the request, so straight ASCII inputs match curly-quoted indexed titles.
hits = openalex.search_title("Backstabber's Knife Collection", rows=3)
# Free-text search across titles + abstracts (noisier; prefer search_title
# for citation lookup)
hits = openalex.search("fact-checking large language models", rows=3)
# OpenAlex stores abstracts as an inverted index — reconstruct to plain text
abstract = reconstruct_abstract(work) if work else None
# Helper for routing logic
assert is_arxiv_doi("10.48550/arXiv.2410.21554")
The mailto argument is optional but recommended: it puts requests into
OpenAlex's polite pool
for faster responses. The cache key strips mailto so changing it doesn't
invalidate prior entries.
Crossref
from citefinder import CrossrefClient
client = CrossrefClient(
cache_path="~/.cache/citefinder/crossref.jsonl",
mailto="you@example.com", # opts into Crossref's polite pool — faster, higher quota
)
# Single DOI
work = client.lookup_doi("10.1126/science.aap9559")
print(work["title"][0])
# Bibliographic search (author + title + year)
hits = client.search_bibliographic("Wolfowicz hate speech meta-analysis", rows=3)
# Book chapter via {book_doi}.{NNN} pattern
chapter = client.lookup_book_chapter("10.1017/9781108890960", 5)
Crossref and OpenAlex both honor mailto for their polite pools; the cache
key strips it on either side, so rotating the email doesn't invalidate prior
entries.
OpenAlex's schema differs from Crossref. Quick map:
| Field | Crossref | OpenAlex |
|---|---|---|
| Title | work["title"][0] (+ optional subtitle[0]) |
work["display_name"] |
| First author | work["author"][0]["family"] (surname only) |
work["authorships"][0]["author"]["display_name"] (full name — parse for surname) |
| Container | work["container-title"][0] (+ short-container-title) |
work["primary_location"]["source"]["display_name"] (+ host_venue on older records) |
| Year | published-print / published-online / issued / created → ["date-parts"][0][0] |
work["publication_year"] (int) |
Bib verification
A .bib file can be parsed and verified against either source end-to-end:
from citefinder import (
OpenAlexClient,
Source,
parse_entries,
verify_entry,
)
source = Source(name="openalex", client=OpenAlexClient(cache_path="cache.jsonl"))
for entry in parse_entries(open("refs.bib").read()):
result = verify_entry(entry, source)
print(result.key, result.status, result.matched_doi)
Each Result reports a Status (matched / probable / mismatch / unmatched / doi-not-found / skip-source / error) plus the four signals — title, year, first-author surname, container — that drove the verdict. BibCitation and Work are the canonical shapes; crossref_to_work and openalex_to_work adapt source-specific JSON into Work. See citefinder/signals.py for the signal-check thresholds.
CLI usage
# OpenAlex (default)
citefinder doi 10.48550/arXiv.2410.21554 --mailto you@example.com
citefinder search "Backstabber's Knife Collection" --rows 3
# Crossref
citefinder crossref doi 10.1126/science.aap9559 --mailto you@example.com
citefinder crossref search "Wolfowicz hate speech meta-analysis" --rows 3
citefinder crossref chapter 10.1017/9781108890960 5
# .bib parsing & verification
citefinder parse refs.bib # CSV to stdout (no network)
citefinder parse refs.bib --out parsed.csv # ...or to a file
citefinder verify refs.bib # full pipeline (defaults to OpenAlex)
citefinder verify refs.bib --source crossref # ...or against Crossref
citefinder verify refs.bib --out path/to/output/dir/ # custom output directory
parse emits a CSV with columns key, etype, title, author, year, doi, container where author is the first-author surname (the form used downstream for matching) and container is the entry's journal or booktitle.
verify walks each entry: if a doi field is present it resolves the DOI; otherwise it searches by author + title + year. Each result is checked against four signals (title, year, first-author surname, container) and bucketed by status. Output goes to data/citefinder/<bib-stem>/<source>/: a <source>.jsonl cache and a structured results.json. Re-running is cheap — every cache hit is served from disk.
CLI arguments
--cache PATH— JSONL cache path. Defaults to~/.cache/citefinder/openalex.jsonlfor top-level commands and~/.cache/citefinder/crossref.jsonlforcrossrefsubcommands. Separate files so sources don't mix; override per command if you want per-project caches (e.g.,--cache ./data/refs.jsonl).--rows N(search only) — Number of results to return. Default3.--mailto EMAIL— Opts the request into the source's polite pool (both OpenAlex and Crossref honor it): faster responses and a higher quota. Sent as a?mailto=…query param; stripped from the cache key, so rotating the email doesn't invalidate prior entries.--api-key KEY(OpenAlex only) — OpenAlex API key for higher rate limits and tier-specific endpoints. Also read fromOPENALEX_API_KEYin the env or a.envfile (loaded from cwd or any parent). Sent asAuthorization: Bearer <key>so it never lands in cache keys, URL logs, or referer headers.
Why JSONL?
The cache is an append-only log: every lookup is one JSON object per line. Benefits:
- Auditable:
cat/grepto see every query that ever ran. - Diffable: plays nicely with git if you want to commit a project's cache.
- Crash-safe: an interrupted write loses at most the last line.
- Recoverable: rebuild the in-memory dict by replaying the log.
Latest value wins on replay, so over-writes are a no-op semantic.
SQLite alternative. A SQLite-backed cache is another reasonable
implementation — it would trade the audit log and grep-ability for faster
random access on very large caches (millions of entries) and concurrent
writers. The current scale of citefinder use (per-project bibs, tens of
thousands of entries at most) doesn't need it, and replaying a JSONL on
startup is fast enough that the simplicity wins. If a future workload pushes
past those limits, swapping the storage layer is a single class — JsonlCache
in citefinder/cache.py — behind the same get / put / __contains__
interface.
Tests
uv run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file citefinder-0.4.0.tar.gz.
File metadata
- Download URL: citefinder-0.4.0.tar.gz
- Upload date:
- Size: 84.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdfd162367d4eae1b58d7ef5e6c7962c94dd545181f8dad73e1c9085b6beb35c
|
|
| MD5 |
05c4ee874e3dbc0cb9fd13ea0b2309d7
|
|
| BLAKE2b-256 |
0957f9f48c27a1b87423c321668311980b9aa23e9eb2ef4abfedee4ae66538db
|
Provenance
The following attestation bundles were made for citefinder-0.4.0.tar.gz:
Publisher:
publish.yml on gitronald/citefinder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
citefinder-0.4.0.tar.gz -
Subject digest:
fdfd162367d4eae1b58d7ef5e6c7962c94dd545181f8dad73e1c9085b6beb35c - Sigstore transparency entry: 1420961863
- Sigstore integration time:
-
Permalink:
gitronald/citefinder@b1125d9d2cd687c14b10099798b0327942172c6f -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/gitronald
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b1125d9d2cd687c14b10099798b0327942172c6f -
Trigger Event:
push
-
Statement type:
File details
Details for the file citefinder-0.4.0-py3-none-any.whl.
File metadata
- Download URL: citefinder-0.4.0-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b161190dae77df83630e629a8d9de8abf3040f8def5b92f6280bd7346587997a
|
|
| MD5 |
de4f47c7b80e5c8e19934c8f20f2c269
|
|
| BLAKE2b-256 |
c334659e6913fc28eb6802eb890cba11a946122257743c889a1404d65ef8447a
|
Provenance
The following attestation bundles were made for citefinder-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on gitronald/citefinder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
citefinder-0.4.0-py3-none-any.whl -
Subject digest:
b161190dae77df83630e629a8d9de8abf3040f8def5b92f6280bd7346587997a - Sigstore transparency entry: 1420961985
- Sigstore integration time:
-
Permalink:
gitronald/citefinder@b1125d9d2cd687c14b10099798b0327942172c6f -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/gitronald
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b1125d9d2cd687c14b10099798b0327942172c6f -
Trigger Event:
push
-
Statement type: