Identifier -> the retrievable artifacts of a scholarly article: a pluggable source ladder + identifier resolvers.
Project description
litfetch
Resolve a scholarly article identifier to its retrievable artifacts — the full-text body and any supplementary material — and fetch their bytes.
litfetch is two cooperating seams:
- a fetch ladder — pluggable
Fetcherbackends (PMC Open Access S3, Europe PMC, Elsevier OA) tried in priority order; the first to serve the body wins, returning aBlob(aFileplus its bytes); - an optional resolver layer — pluggable
Resolvers that enrich what you know about a paper (pmid→pmcid/doi, etc.) so the ladder can act.
You hand it an ArticleIds bundle (any of pmid / pmcid / doi). Resolution
is demand-driven: a resolver only runs when the next fetcher needs an
identifier you don't yet have, and runs at most once.
An article is modelled as a file-set: a collection of File references (the
body in its various media types, plus supplementary material, distinguished by
FileKind), each hosted upstream. litfetch fetches the raw artifacts and reports
their access terms; it does not render them. To turn a fetched JATS/Elsevier
body into markdown, run litdown
on the bytes yourself (see Render to markdown).
The examples below are a tour; docs/api.md is the full
reference for the public surface.
Install
pip install litfetch
bioRxiv / medRxiv preprint full text needs a browser-fingerprint HTTP client,
enabled by the biorxiv extra:
pip install 'litfetch[biorxiv]'
Usage
Fetch the body
Hand fetch_body an ArticleIds; the default ladder serves the first available
body as a Blob:
from litfetch import ArticleIds, fetch_body
blob = await fetch_body(ArticleIds(pmcid='PMC5334499'))
if blob:
print(blob.file.source, blob.file.media_type, len(blob.content))
Render to markdown
litfetch returns raw bytes, not markdown. Convert a JATS/Elsevier body with litdown — you pick and pin the converter:
import io
import litdown
from litfetch import ArticleIds, fetch_body
blob = await fetch_body(ArticleIds(pmcid='PMC5334499'))
if blob:
markdown = litdown.convert(io.BytesIO(blob.content))
Inject your own resolver
A resolver is an async (ArticleIds, Http) -> ArticleIds — the session running
it supplies the Http. Enrich from whatever you have — a corpus client, a local
cache, an API — and merge it in (this one ignores Http, hence _http):
from litfetch import ArticleIds, Http, fetch_body
async def my_resolver(ids: ArticleIds, _http: Http) -> ArticleIds:
if not ids.pmid:
return ids
pmcid, doi = await my_corpus.lookup(ids.pmid)
return ids.merge(ArticleIds(pmcid=pmcid, doi=doi))
blob = await fetch_body(ArticleIds(pmid='29622564'), resolver=my_resolver)
Use a bundled resolver
Bundled resolvers are constructed with their config, then passed in the same
slot. chain(...) composes several (yours first, fallbacks after); it stops
once every identifier is known:
from litfetch import ArticleIds, fetch_body
from litfetch.resolvers import SemanticScholarResolver, NcbiIdConverterResolver, chain
resolver = chain(
my_resolver, # your own
SemanticScholarResolver(api_key=S2_KEY), # bundled
NcbiIdConverterResolver(tool='myapp'), # bundled
)
blob = await fetch_body(ArticleIds(pmid='29622564'), resolver=resolver)
Polite-pool identification (NCBI/Crossref email, Unpaywall's required email)
comes from a session contact, not a hardcoded default — set it on the session:
async with litfetch.Session(contact='you@example.org') as s: await s.fetch_body(...).
default_resolver() is a batteries-included, keyless chain
(Europe PMC search + NCBI ID Converter).
No resolver — you already hold the IDs
A non-PubMed paper you only have a DOI for, plus your own Elsevier key:
blob = await fetch_body(
ArticleIds(doi='10.1016/j.cell.2020.01.001'),
credentials={'elsevier_api_key': key},
)
Supplementary material
list_files enumerates the file-set (references, no bytes); fetch_file
materialises one:
from litfetch import ArticleIds, FileKind, list_files, fetch_file
files = await list_files(ArticleIds(pmcid='PMC5334499'), kind=FileKind.SUPPLEMENTARY)
for file in files:
blob = await fetch_file(file)
Access terms
Read the licence from the fetched bytes, falling back to an access authority (Unpaywall) when the bytes carry none:
from litfetch import extract_source_metadata, resolve_access
meta = extract_source_metadata(blob) # from the JATS/Elsevier bytes
if meta.licence is None:
meta = await resolve_access(ArticleIds(doi='10.1016/j.cell.2020.01.001'))
Resolvers stand alone
Each resolver is usable on its own as a cross-reference tool, independent of
fetching. A resolver is given the Http to use, so run it inside a session:
from litfetch import ArticleIds, Session
from litfetch.resolvers import SemanticScholarResolver
async with Session() as s:
ids = await SemanticScholarResolver()(ArticleIds(doi='10.1016/j.cell.2020.01.001'), s)
print(ids.pmid, ids.pmcid)
Batch: one session, a scope per paper
The one-shot functions above each open a throwaway session. For many papers,
hold one Session (pooled connection, shared pacing) and open a scope per
paper — the scope caches within itself, so a duplicate upstream call (e.g.
Unpaywall for both licence and PDF) is fetched once:
from litfetch import ArticleIds, Session
async with Session() as session:
for pmid in pmids:
async with session.scope() as s:
blob = await s.fetch_body(ArticleIds(pmid=pmid))
access = await s.resolve_access(ArticleIds(pmid=pmid))
Extending
- A new body fetcher: implement the
Fetcherprotocol — aname, arequires: frozenset[str]of theArticleIdsfields it needs, and an asyncfetch(ids, *, credentials, http)returning a bodyBloborNone. Add it to afetchers=list (or your owndefault_fetchers). - A new file source: implement the
FileSourceprotocol — aname, and asynclist_files(ids, ...)/fetch_file(file, ...)— to enumerate and materialise an article's file-set (body renditions and supplementary alike). - A new resolver: write an async
ArticleIds -> ArticleIdsthat fills gaps viaArticleIds.mergeand never overwrites a known id.
Development
uv sync
uv run ruff check . && uv run ruff format --check .
uv run pyright
uv run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file litfetch-0.1.0.tar.gz.
File metadata
- Download URL: litfetch-0.1.0.tar.gz
- Upload date:
- Size: 48.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cca2a4a220fbb617c38e4a752731407459eb878ce3ddc67a782935ca52c631e0
|
|
| MD5 |
f1e7fbfe94700795ffab32b4513cf7b9
|
|
| BLAKE2b-256 |
76797470d55c7c951cce63360b88b0000b9ed27bc6089d7d82b0b2d4c522f6df
|
Provenance
The following attestation bundles were made for litfetch-0.1.0.tar.gz:
Publisher:
release.yml on populationgenomics/litfetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litfetch-0.1.0.tar.gz -
Subject digest:
cca2a4a220fbb617c38e4a752731407459eb878ce3ddc67a782935ca52c631e0 - Sigstore transparency entry: 2055579337
- Sigstore integration time:
-
Permalink:
populationgenomics/litfetch@9ee565b47bd06422dcf5cee83e16e4a0563209e6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/populationgenomics
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9ee565b47bd06422dcf5cee83e16e4a0563209e6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file litfetch-0.1.0-py3-none-any.whl.
File metadata
- Download URL: litfetch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf8dd8b4a3bde7494f202344a50fd4f819d3ea51415d2b15f7dca5d7f32a896f
|
|
| MD5 |
bef809d77f39841474bdd45e43c9a58a
|
|
| BLAKE2b-256 |
ceaf24dbf8efeeb2174a2d758d7c03c396fc2a76c14f0deedffab0a226cbd43a
|
Provenance
The following attestation bundles were made for litfetch-0.1.0-py3-none-any.whl:
Publisher:
release.yml on populationgenomics/litfetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litfetch-0.1.0-py3-none-any.whl -
Subject digest:
bf8dd8b4a3bde7494f202344a50fd4f819d3ea51415d2b15f7dca5d7f32a896f - Sigstore transparency entry: 2055579593
- Sigstore integration time:
-
Permalink:
populationgenomics/litfetch@9ee565b47bd06422dcf5cee83e16e4a0563209e6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/populationgenomics
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9ee565b47bd06422dcf5cee83e16e4a0563209e6 -
Trigger Event:
release
-
Statement type: