Legal-only paper full-text retrieval and conversion. DOI/PMID/PMCID/arXiv/Corpus ID + BYO PDF/JATS → markdown with license classification.

These details have not been verified by PyPI

Project links

Project description

asta-papers

Legal-only paper full-text retrieval and conversion. Identifier (DOI / PMID / PMCID / arXiv / Semantic Scholar Corpus ID) — or BYO PDF / JATS — to markdown with explicit license classification.

Lifts paper recovery on biomedical literature from ~22% (Mistral OCR alone) to ~85% using only publisher-blessed legal channels (NCBI E-utilities, Unpaywall, EuropePMC, bioRxiv, institutional repositories).

Install

pip install asta-papers                 # core (JATS conversion only)
pip install 'asta-papers[mistral]'      # + Mistral OCR for PDFs
pip install 'asta-papers[olmocr]'       # + local olmOCR for PDFs (offline)
pip install 'asta-papers[s3]'           # + s3:// BYO support
pip install 'asta-papers[all]'          # everything

Quickstart

import os
from asta_papers import Client
from asta_papers.converters.mistral import MistralConverter

c = Client(
    email="me@allenai.org",
    ncbi_api_key=os.environ.get("NCBI_API_KEY"),       # optional, lifts NCBI 3→10 rps
    converters=[MistralConverter()],                    # for PDF→markdown
)

# By identifier
r = c.fetch(doi="10.1186/s12943-024-02093-w")
print(r.success, r.license_class, r.markdown[:200])

# Storage-tier policy
if r.may_redistribute:                                   # CC BY / CC0 / CC BY-SA
    save_artifact(r.bytes)
elif r.may_use_for_tdm:                                  # TDM-permissive licenses; not BRONZE/UNKNOWN/CLOSED
    extract_inline(r.markdown)

# BYO PDF — bytes, local path, or URI
r = c.fetch(pdf=b"%PDF-...")
r = c.fetch(pdf="paper.pdf")
r = c.fetch(pdf="s3://my-bucket/paper.pdf")              # requires [s3]

# Batch with bounded concurrency + per-host rate limits
results = c.fetch_many([
    {"doi": "10.1038/foo"},
    {"pmcid": "PMC123"},
    {"pdf": "paper.pdf", "doi": "10.99/local"},
])

How it works

A strategy ladder runs against legal aggregator APIs in order, returning the first successful result:

PMC E-utilities efetch — JATS XML for PMC OA Subset articles
NCBI elink — PMID → PMC self-link when S2 didn't surface it
Published-version handoff — when input is a preprint DOI, route to the published version (bioRxiv API or Crossref relation) so callers get the most-recent public version of the paper
arXiv — direct PDF for arXiv papers
bioRxiv / medRxiv API — JATS XML or PDF for preprints
Unpaywall — best legal OA URL; routes PMC URLs through efetch
Institutional repo scrape — hdl.handle.net, pure.eur.nl, etc. (now respects robots.txt)
EuropePMC PDF render — text-mining-licensed PDFs for free-to-read papers NCBI's OA Subset doesn't include

Per-host token-bucket rate limiting honors every publisher's published quota exactly. arXiv at 0.33 rps (their explicit rule). NCBI at 3 rps (10 with key) shared across eutils.*, pmc.*, www.* hostnames. Independent services run fully in parallel.

License classification

Every successful fetch carries a LicenseClass:

cc-by  cc-by-sa  cc-by-nd  cc-by-nc  cc-by-nc-sa  cc-by-nc-nd  cc0
text-mining-only   bronze   arxiv-default   closed   unknown

Plus helper booleans on FetchResult: may_redistribute, may_redistribute_nc, may_make_derivatives, may_train_models, may_use_for_tdm, plus a source_type field (publisher / repository / other) for callers who want publisher-vs-repository policy without parsing strings. Storage-tier policy is a one-line check.

Successful results also carry an attribution blockquote at the top of the markdown by default (source URL + DOI/PMCID + license + retrieval strategy), so the markdown is self-attributing when it travels to end users. Disable with Client(include_attribution=False).

What's NOT here

No Sci-Hub, no archive scraping, no UA spoofing past WAFs.
No title-only paper search (use PaperFinder / S2 first to get an identifier).
No multi-tenant Credentials per-call object (one Client per credential set).
No async API (sync only in v0.1).

Configuration

See Client.__init__ and docs/concepts.md for the full list. Required: email (Crossref polite-pool identifier; kwarg or ASTA_PAPERS_EMAIL env). Recommended: NCBI_API_KEY env (free, 5-minute registration, 3.3× throughput).

Tests

pytest tests/integration -v                  # 29 real-API tests
python tools/check_test_legitimacy.py --strict  # asserts mock-ratio < 30%

The full integration suite hits live upstream APIs — no mocks. Tests run in ~90 seconds. A per-paper snapshot recovery benchmark (53 biomedical DOIs that fail Mistral-OCR-only retrieval; 46/53 = 87% recovered) gates regressions.

Design

Full design at docs/DESIGN.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asta_papers-0.0.1.tar.gz (89.5 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

asta_papers-0.0.1-py3-none-any.whl (54.6 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file asta_papers-0.0.1.tar.gz.

File metadata

Download URL: asta_papers-0.0.1.tar.gz
Upload date: May 6, 2026
Size: 89.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for asta_papers-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`31af548cd0c6d16e426ee53b84ca1c3f43fcd7930f7d2a14de1a52518c3fbc9e`
MD5	`f1b4c5fb0f03009a44c0adc29691ffd8`
BLAKE2b-256	`dd07dedd7c57a52fb7476452ccebe92a49fd45cece7a9fdcfa47bec8776464e9`

See more details on using hashes here.

File details

Details for the file asta_papers-0.0.1-py3-none-any.whl.

File metadata

Download URL: asta_papers-0.0.1-py3-none-any.whl
Upload date: May 6, 2026
Size: 54.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for asta_papers-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c01dbeb6ff0b73e1f24ea81d2e4baa23d5a02bf86443048fdd5dc45067655c7e`
MD5	`8a561e4ddb425aaa983660d364e2cbeb`
BLAKE2b-256	`163df013d734bd3faa80bab2a66b2b3876111d33a7e8bafa0d7fd52cfc92b862`

See more details on using hashes here.

asta-papers 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

asta-papers

Install

Quickstart

How it works

License classification

What's NOT here

Configuration

Tests

Design

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes