Skip to main content

Identifier -> the retrievable artifacts of a scholarly article: a pluggable source ladder + identifier resolvers.

Project description

litfetch

Resolve a scholarly article identifier to its retrievable artifacts — the full-text body and any supplementary material — and fetch their bytes.

litfetch is two cooperating seams:

  • a fetch ladder — pluggable Fetcher backends (PMC Open Access S3, Europe PMC, Elsevier OA) tried in priority order; the first to serve the body wins, returning a Blob (a File plus its bytes);
  • an optional resolver layer — pluggable Resolvers that enrich what you know about a paper (pmidpmcid/doi, etc.) so the ladder can act.

You hand it an ArticleIds bundle (any of pmid / pmcid / doi). Resolution is demand-driven: a resolver only runs when the next fetcher needs an identifier you don't yet have, and runs at most once.

An article is modelled as a file-set: a collection of File references (the body in its various media types, plus supplementary material, distinguished by FileKind), each hosted upstream. litfetch fetches the raw artifacts and reports their access terms; it does not render them. To turn a fetched JATS/Elsevier body into markdown, run litdown on the bytes yourself (see Render to markdown).

The examples below are a tour; docs/api.md is the full reference for the public surface.

Install

pip install litfetch

bioRxiv / medRxiv preprint full text needs a browser-fingerprint HTTP client, enabled by the biorxiv extra:

pip install 'litfetch[biorxiv]'

Usage

Fetch the body

Hand fetch_body an ArticleIds; the default ladder serves the first available body as a Blob:

from litfetch import ArticleIds, fetch_body

blob = await fetch_body(ArticleIds(pmcid='PMC5334499'))
if blob:
    print(blob.file.source, blob.file.media_type, len(blob.content))

Render to markdown

litfetch returns raw bytes, not markdown. Convert a JATS/Elsevier body with litdown — you pick and pin the converter:

import io
import litdown
from litfetch import ArticleIds, fetch_body

blob = await fetch_body(ArticleIds(pmcid='PMC5334499'))
if blob:
    markdown = litdown.convert(io.BytesIO(blob.content))

Inject your own resolver

A resolver is an async (ArticleIds, Http) -> ArticleIds — the session running it supplies the Http. Enrich from whatever you have — a corpus client, a local cache, an API — and merge it in (this one ignores Http, hence _http):

from litfetch import ArticleIds, Http, fetch_body

async def my_resolver(ids: ArticleIds, _http: Http) -> ArticleIds:
    if not ids.pmid:
        return ids
    pmcid, doi = await my_corpus.lookup(ids.pmid)
    return ids.merge(ArticleIds(pmcid=pmcid, doi=doi))

blob = await fetch_body(ArticleIds(pmid='29622564'), resolver=my_resolver)

Use a bundled resolver

Bundled resolvers are constructed with their config, then passed in the same slot. chain(...) composes several (yours first, fallbacks after); it stops once every identifier is known:

from litfetch import ArticleIds, fetch_body
from litfetch.resolvers import SemanticScholarResolver, NcbiIdConverterResolver, chain

resolver = chain(
    my_resolver,                              # your own
    SemanticScholarResolver(api_key=S2_KEY),  # bundled
    NcbiIdConverterResolver(tool='myapp'),    # bundled
)
blob = await fetch_body(ArticleIds(pmid='29622564'), resolver=resolver)

Polite-pool identification (NCBI/Crossref email, Unpaywall's required email) comes from a session contact, not a hardcoded default — set it on the session: async with litfetch.Session(contact='you@example.org') as s: await s.fetch_body(...).

default_resolver() is a batteries-included, keyless chain (Europe PMC search + NCBI ID Converter).

No resolver — you already hold the IDs

A non-PubMed paper you only have a DOI for, plus your own Elsevier key:

blob = await fetch_body(
    ArticleIds(doi='10.1016/j.cell.2020.01.001'),
    credentials={'elsevier_api_key': key},
)

Supplementary material

list_files enumerates the file-set (references, no bytes); fetch_file materialises one:

from litfetch import ArticleIds, FileKind, list_files, fetch_file

files = await list_files(ArticleIds(pmcid='PMC5334499'), kind=FileKind.SUPPLEMENTARY)
for file in files:
    blob = await fetch_file(file)

Access terms

Read the licence from the fetched bytes, falling back to an access authority (Unpaywall) when the bytes carry none:

from litfetch import extract_source_metadata, resolve_access

meta = extract_source_metadata(blob)          # from the JATS/Elsevier bytes
if meta.licence is None:
    meta = await resolve_access(ArticleIds(doi='10.1016/j.cell.2020.01.001'))

Resolvers stand alone

Each resolver is usable on its own as a cross-reference tool, independent of fetching. A resolver is given the Http to use, so run it inside a session:

from litfetch import ArticleIds, Session
from litfetch.resolvers import SemanticScholarResolver

async with Session() as s:
    ids = await SemanticScholarResolver()(ArticleIds(doi='10.1016/j.cell.2020.01.001'), s)
print(ids.pmid, ids.pmcid)

Batch: one session, a scope per paper

The one-shot functions above each open a throwaway session. For many papers, hold one Session (pooled connection, shared pacing) and open a scope per paper — the scope caches within itself, so a duplicate upstream call (e.g. Unpaywall for both licence and PDF) is fetched once:

from litfetch import ArticleIds, Session

async with Session() as session:
    for pmid in pmids:
        async with session.scope() as s:
            blob = await s.fetch_body(ArticleIds(pmid=pmid))
            access = await s.resolve_access(ArticleIds(pmid=pmid))

Extending

  • A new body fetcher: implement the Fetcher protocol — a name, a requires: frozenset[str] of the ArticleIds fields it needs, and an async fetch(ids, *, credentials, http) returning a body Blob or None. Add it to a fetchers= list (or your own default_fetchers).
  • A new file source: implement the FileSource protocol — a name, and async list_files(ids, ...) / fetch_file(file, ...) — to enumerate and materialise an article's file-set (body renditions and supplementary alike).
  • A new resolver: write an async ArticleIds -> ArticleIds that fills gaps via ArticleIds.merge and never overwrites a known id.

Development

uv sync
uv run ruff check . && uv run ruff format --check .
uv run pyright
uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litfetch-0.1.0.tar.gz (48.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litfetch-0.1.0-py3-none-any.whl (39.2 kB view details)

Uploaded Python 3

File details

Details for the file litfetch-0.1.0.tar.gz.

File metadata

  • Download URL: litfetch-0.1.0.tar.gz
  • Upload date:
  • Size: 48.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litfetch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cca2a4a220fbb617c38e4a752731407459eb878ce3ddc67a782935ca52c631e0
MD5 f1e7fbfe94700795ffab32b4513cf7b9
BLAKE2b-256 76797470d55c7c951cce63360b88b0000b9ed27bc6089d7d82b0b2d4c522f6df

See more details on using hashes here.

Provenance

The following attestation bundles were made for litfetch-0.1.0.tar.gz:

Publisher: release.yml on populationgenomics/litfetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file litfetch-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: litfetch-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litfetch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf8dd8b4a3bde7494f202344a50fd4f819d3ea51415d2b15f7dca5d7f32a896f
MD5 bef809d77f39841474bdd45e43c9a58a
BLAKE2b-256 ceaf24dbf8efeeb2174a2d758d7c03c396fc2a76c14f0deedffab0a226cbd43a

See more details on using hashes here.

Provenance

The following attestation bundles were made for litfetch-0.1.0-py3-none-any.whl:

Publisher: release.yml on populationgenomics/litfetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page