Skip to main content

No project description provided

Project description

citefinder

OpenAlex (default) + Crossref reference lookups with local JSONL caching.

A small Python library + CLI for verifying academic references against the OpenAlex and Crossref APIs. Every lookup is appended to an append-only JSONL log so repeated queries (across verification passes or sessions) are served from the cache. Negative results (404s) are cached too, so known-missing DOIs aren't re-hit.

OpenAlex is the default source: it merges Crossref + Unpaywall + ORCID + ROR

  • repository sources, so it covers what Crossref alone is missing — arXiv DOIs (10.48550/arXiv.*), other preprints, repository deposits — and frequently has richer metadata (abstracts, full author lists, affiliations) for records that exist in both. Crossref is still available via the crossref subcommand for its own workflows (book-chapter lookup, the canonical published-deposit metadata).

Configuration: API key and mailto

OpenAlex works without authentication, but a free API key gives you higher limits and tier-specific endpoints. Both Crossref and OpenAlex honor a mailto for their polite pools (faster responses, higher quotas).

Lookup order (CLI), highest priority first:

  1. CLI flag: --api-key, --mailto.
  2. Shell environment: OPENALEX_API_KEY, OPENALEX_MAILTO, CROSSREF_MAILTO.
  3. Project-local .env in the current working directory or any parent.
  4. ~/.config/citefinder/config.toml (honors $XDG_CONFIG_HOME) — store it once on this machine.
# ~/.config/citefinder/config.toml
[openalex]
api_key = "your-openalex-key"
mailto = "you@example.com"

[crossref]
mailto = "you@example.com"

The file is plain-text — if your environment is shared, chmod 600 ~/.config/citefinder/config.toml so it's only readable by you. Each section is optional; omit anything you don't need.

Library users: pass api_key=... and mailto=... to the client constructors explicitly. The config-file fallback is CLI-only (it shouldn't be a surprise side effect of importing the library).

The API key is sent as Authorization: Bearer ..., never as a URL parameter, so it doesn't land in cache keys, logs, or referer headers.

Install

uv add citefinder

Or for development:

git clone https://github.com/gitronald/citefinder
cd citefinder
uv sync

Library usage

OpenAlex (default)

from citefinder import OpenAlexClient, is_arxiv_doi, reconstruct_abstract

openalex = OpenAlexClient(
    cache_path="~/.cache/citefinder/openalex.jsonl",
    mailto="you@example.com",  # opts into OpenAlex's polite pool — faster, higher quota
)

# Single DOI (works for arXiv DOIs that Crossref doesn't index)
work = openalex.lookup_doi("10.48550/arXiv.2410.21554")

# Title-only search — tuned for citation verification. Handles OpenAlex's
# curly-apostrophe quirk and strips filter-reserved punctuation that would
# 400 the request, so straight ASCII inputs match curly-quoted indexed titles.
hits = openalex.search_title("Backstabber's Knife Collection", rows=3)

# Free-text search across titles + abstracts (noisier; prefer search_title
# for citation lookup)
hits = openalex.search("fact-checking large language models", rows=3)

# OpenAlex stores abstracts as an inverted index — reconstruct to plain text
abstract = reconstruct_abstract(work) if work else None

# Helper for routing logic
assert is_arxiv_doi("10.48550/arXiv.2410.21554")

The mailto argument is optional but recommended: it puts requests into OpenAlex's polite pool for faster responses. The cache key strips mailto so changing it doesn't invalidate prior entries.

Crossref

from citefinder import CrossrefClient

client = CrossrefClient(
    cache_path="~/.cache/citefinder/crossref.jsonl",
    mailto="you@example.com",  # opts into Crossref's polite pool — faster, higher quota
)

# Single DOI
work = client.lookup_doi("10.1126/science.aap9559")
print(work["title"][0])

# Bibliographic search (author + title + year)
hits = client.search_bibliographic("Wolfowicz hate speech meta-analysis", rows=3)

# Book chapter via {book_doi}.{NNN} pattern
chapter = client.lookup_book_chapter("10.1017/9781108890960", 5)

Crossref and OpenAlex both honor mailto for their polite pools; the cache key strips it on either side, so rotating the email doesn't invalidate prior entries.

OpenAlex's schema differs from Crossref. Quick map:

Field Crossref OpenAlex
Title work["title"][0] (+ optional subtitle[0]) work["display_name"]
First author work["author"][0]["family"] (surname only) work["authorships"][0]["author"]["display_name"] (full name — parse for surname)
Container work["container-title"][0] (+ short-container-title) work["primary_location"]["source"]["display_name"] (+ host_venue on older records)
Year published-print / published-online / issued / created["date-parts"][0][0] work["publication_year"] (int)

Bib verification

A .bib file can be parsed and verified against either source end-to-end:

from citefinder import (
    OpenAlexClient,
    Source,
    parse_entries,
    verify_entry,
)

source = Source(name="openalex", client=OpenAlexClient(cache_path="cache.jsonl"))

for entry in parse_entries(open("refs.bib").read()):
    result = verify_entry(entry, source)
    print(result.key, result.status, result.matched_doi)

Each Result reports a Status (matched / probable / mismatch / unmatched / doi-not-found / skip-source / error) plus the four signals — title, year, first-author surname, container — that drove the verdict. BibCitation and Work are the canonical shapes; crossref_to_work and openalex_to_work adapt source-specific JSON into Work. See citefinder/signals.py for the signal-check thresholds.

Bib ↔ table

A .bib file can be loaded into a wide polars DataFrame (one row per entry, one column per field) for inspection or bulk editing, then serialized back:

from citefinder import bib_to_table, table_to_bib

df = bib_to_table(open("refs.bib").read())   # key, entry_type, then fields alphabetical
new_bib = table_to_bib(df)                    # back to .bib, null cells skipped

bib_to_table lowercases field keys (DOIdoi) and stores the entry kind in entry_type to avoid collision with the literal type field that some entries carry (e.g., SSRN papers set type = {SSRN Scholarly Paper}). table_to_bib requires key and entry_type columns and serializes the rest in column order. The round-trip is lossless on field values and entry types; the original within-entry field order and any source-file @string/@comment blocks are not preserved.

CLI usage

# OpenAlex (default)
citefinder doi 10.48550/arXiv.2410.21554 --mailto you@example.com
citefinder search "Backstabber's Knife Collection" --rows 3

# Crossref
citefinder crossref doi 10.1126/science.aap9559 --mailto you@example.com
citefinder crossref search "Wolfowicz hate speech meta-analysis" --rows 3
citefinder crossref chapter 10.1017/9781108890960 5

# .bib verification
citefinder verify refs.bib                               # full pipeline (defaults to OpenAlex)
citefinder verify refs.bib --source crossref             # ...or against Crossref
citefinder verify refs.bib --out path/to/output/dir/     # custom output directory

# .bib ↔ table
citefinder bib-to-table refs.bib                            # wide polars table to terminal
citefinder bib-to-table refs.bib --csv > refs.csv           # ...or CSV to stdout
citefinder bib-to-table refs.bib --fields title,year,doi    # subset of columns
citefinder table-to-bib refs.csv                            # CSV back to .bib on stdout
citefinder table-to-bib refs.csv --out refs.regen.bib       # ...or to a file

verify walks each entry: if a doi field is present it resolves the DOI; otherwise it searches by author + title + year. Each result is checked against four signals (title, year, first-author surname, container) and bucketed by status. Output goes to data/citefinder/<bib-stem>/<source>/: a <source>.jsonl cache and a structured results.json. Re-running is cheap — every cache hit is served from disk.

bib-to-table and table-to-bib are inverses: the first turns a .bib into a wide table (terminal view by default, --csv for piping), the second reads such a CSV back into a .bib. Useful for spreadsheet-style review or bulk edits before regenerating the file. The round-trip is lossless on data; within-entry field order and source-file formatting are not preserved.

CLI arguments

  • --cache PATH — JSONL cache path. Defaults to ~/.cache/citefinder/openalex.jsonl for top-level commands and ~/.cache/citefinder/crossref.jsonl for crossref subcommands. Separate files so sources don't mix; override per command if you want per-project caches (e.g., --cache ./data/refs.jsonl).
  • --rows N (search only) — Number of results to return. Default 3.
  • --mailto EMAIL — Opts the request into the source's polite pool (both OpenAlex and Crossref honor it): faster responses and a higher quota. Sent as a ?mailto=… query param; stripped from the cache key, so rotating the email doesn't invalidate prior entries.
  • --api-key KEY (OpenAlex only) — OpenAlex API key for higher rate limits and tier-specific endpoints. Also read from OPENALEX_API_KEY in the env or a .env file (loaded from cwd or any parent). Sent as Authorization: Bearer <key> so it never lands in cache keys, URL logs, or referer headers.

Why JSONL?

The cache is an append-only log: every lookup is one JSON object per line. Benefits:

  • Auditable: cat/grep to see every query that ever ran.
  • Diffable: plays nicely with git if you want to commit a project's cache.
  • Crash-safe: an interrupted write loses at most the last line.
  • Recoverable: rebuild the in-memory dict by replaying the log.

Latest value wins on replay, so over-writes are a no-op semantic.

SQLite alternative. A SQLite-backed cache is another reasonable implementation — it would trade the audit log and grep-ability for faster random access on very large caches (millions of entries) and concurrent writers. The current scale of citefinder use (per-project bibs, tens of thousands of entries at most) doesn't need it, and replaying a JSONL on startup is fast enough that the simplicity wins. If a future workload pushes past those limits, swapping the storage layer is a single class — JsonlCache in citefinder/cache.py — behind the same get / put / __contains__ interface.

Tests

uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citefinder-0.4.2.tar.gz (90.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citefinder-0.4.2-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file citefinder-0.4.2.tar.gz.

File metadata

  • Download URL: citefinder-0.4.2.tar.gz
  • Upload date:
  • Size: 90.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for citefinder-0.4.2.tar.gz
Algorithm Hash digest
SHA256 02e762809bb0dadc9259787d86c46e53a753a40c121d2da5a5cf969717b5ce33
MD5 5f263852bcc81dfde1638979b8286304
BLAKE2b-256 08fe06cbb6e239ef7bf63161600df5d207791ca3e1083e9e1930199f93aa57d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for citefinder-0.4.2.tar.gz:

Publisher: publish.yml on gitronald/citefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file citefinder-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: citefinder-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for citefinder-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4ef8a93db47a11a308bf1769abd829a56157e929ab79d428718ddddb49f56ece
MD5 d30fe4729112b4f2b72d5efd9d9d7270
BLAKE2b-256 9f76a9bcde5a3b6b5f26e6a1650b2fd516a24aeb2e7f35b9a31cc4e2cf678bb8

See more details on using hashes here.

Provenance

The following attestation bundles were made for citefinder-0.4.2-py3-none-any.whl:

Publisher: publish.yml on gitronald/citefinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page