Skip to main content

Legal-only paper full-text retrieval and conversion. DOI/PMID/PMCID/arXiv/Corpus ID + BYO PDF/JATS → markdown with license classification.

Project description

asta-papers

Legal-only paper full-text retrieval and conversion. Identifier (DOI / PMID / PMCID / arXiv / Semantic Scholar Corpus ID) — or BYO PDF / JATS — to markdown with explicit license classification.

Lifts paper recovery on biomedical literature from ~22% (Mistral OCR alone) to ~85% using only publisher-blessed legal channels (NCBI E-utilities, Unpaywall, EuropePMC, bioRxiv, institutional repositories).

Install

pip install asta-papers                 # core (JATS conversion only)
pip install 'asta-papers[mistral]'      # + Mistral OCR for PDFs
pip install 'asta-papers[olmocr]'       # + local olmOCR for PDFs (offline)
pip install 'asta-papers[s3]'           # + s3:// BYO support
pip install 'asta-papers[all]'          # everything

Quickstart

import os
from asta_papers import Client
from asta_papers.converters.mistral import MistralConverter

c = Client(
    email="me@allenai.org",
    ncbi_api_key=os.environ.get("NCBI_API_KEY"),       # optional, lifts NCBI 3→10 rps
    converters=[MistralConverter()],                    # for PDF→markdown
)

# By identifier
r = c.fetch(doi="10.1186/s12943-024-02093-w")
print(r.success, r.license_class, r.markdown[:200])

# Storage-tier policy
if r.may_redistribute:                                   # CC BY / CC0 / CC BY-SA
    save_artifact(r.bytes)
elif r.may_use_for_tdm:                                  # TDM-permissive licenses; not BRONZE/UNKNOWN/CLOSED
    extract_inline(r.markdown)

# BYO PDF — bytes, local path, or URI
r = c.fetch(pdf=b"%PDF-...")
r = c.fetch(pdf="paper.pdf")
r = c.fetch(pdf="s3://my-bucket/paper.pdf")              # requires [s3]

# Batch with bounded concurrency + per-host rate limits
results = c.fetch_many([
    {"doi": "10.1038/foo"},
    {"pmcid": "PMC123"},
    {"pdf": "paper.pdf", "doi": "10.99/local"},
])

How it works

A strategy ladder runs against legal aggregator APIs in order, returning the first successful result:

  1. PMC E-utilities efetch — JATS XML for PMC OA Subset articles
  2. NCBI elink — PMID → PMC self-link when S2 didn't surface it
  3. Published-version handoff — when input is a preprint DOI, route to the published version (bioRxiv API or Crossref relation) so callers get the most-recent public version of the paper
  4. arXiv — direct PDF for arXiv papers
  5. bioRxiv / medRxiv API — JATS XML or PDF for preprints
  6. Unpaywall — best legal OA URL; routes PMC URLs through efetch
  7. Institutional repo scrapehdl.handle.net, pure.eur.nl, etc. (now respects robots.txt)
  8. EuropePMC PDF render — text-mining-licensed PDFs for free-to-read papers NCBI's OA Subset doesn't include

Per-host token-bucket rate limiting honors every publisher's published quota exactly. arXiv at 0.33 rps (their explicit rule). NCBI at 3 rps (10 with key) shared across eutils.*, pmc.*, www.* hostnames. Independent services run fully in parallel.

License classification

Every successful fetch carries a LicenseClass:

cc-by  cc-by-sa  cc-by-nd  cc-by-nc  cc-by-nc-sa  cc-by-nc-nd  cc0
text-mining-only   bronze   arxiv-default   closed   unknown

Plus helper booleans on FetchResult: may_redistribute, may_redistribute_nc, may_make_derivatives, may_train_models, may_use_for_tdm, plus a source_type field (publisher / repository / other) for callers who want publisher-vs-repository policy without parsing strings. Storage-tier policy is a one-line check.

Successful results also carry an attribution blockquote at the top of the markdown by default (source URL + DOI/PMCID + license + retrieval strategy), so the markdown is self-attributing when it travels to end users. Disable with Client(include_attribution=False).

What's NOT here

  • No Sci-Hub, no archive scraping, no UA spoofing past WAFs.
  • No title-only paper search (use PaperFinder / S2 first to get an identifier).
  • No multi-tenant Credentials per-call object (one Client per credential set).
  • No async API (sync only in v0.1).

Configuration

See Client.__init__ and docs/concepts.md for the full list. Required: email (Crossref polite-pool identifier; kwarg or ASTA_PAPERS_EMAIL env). Recommended: NCBI_API_KEY env (free, 5-minute registration, 3.3× throughput).

Tests

pytest tests/integration -v                  # 29 real-API tests
python tools/check_test_legitimacy.py --strict  # asserts mock-ratio < 30%

The full integration suite hits live upstream APIs — no mocks. Tests run in ~90 seconds. A per-paper snapshot recovery benchmark (53 biomedical DOIs that fail Mistral-OCR-only retrieval; 46/53 = 87% recovered) gates regressions.

Design

Full design at docs/DESIGN.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asta_papers-0.0.1.tar.gz (89.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asta_papers-0.0.1-py3-none-any.whl (54.6 kB view details)

Uploaded Python 3

File details

Details for the file asta_papers-0.0.1.tar.gz.

File metadata

  • Download URL: asta_papers-0.0.1.tar.gz
  • Upload date:
  • Size: 89.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for asta_papers-0.0.1.tar.gz
Algorithm Hash digest
SHA256 31af548cd0c6d16e426ee53b84ca1c3f43fcd7930f7d2a14de1a52518c3fbc9e
MD5 f1b4c5fb0f03009a44c0adc29691ffd8
BLAKE2b-256 dd07dedd7c57a52fb7476452ccebe92a49fd45cece7a9fdcfa47bec8776464e9

See more details on using hashes here.

File details

Details for the file asta_papers-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: asta_papers-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 54.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for asta_papers-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c01dbeb6ff0b73e1f24ea81d2e4baa23d5a02bf86443048fdd5dc45067655c7e
MD5 8a561e4ddb425aaa983660d364e2cbeb
BLAKE2b-256 163df013d734bd3faa80bab2a66b2b3876111d33a7e8bafa0d7fd52cfc92b862

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page