Legal-only paper full-text retrieval and conversion. DOI/PMID/PMCID/arXiv/Corpus ID + BYO PDF/JATS → markdown with license classification.
Project description
asta-papers
Legal-only paper full-text retrieval and conversion. Identifier (DOI / PMID / PMCID / arXiv / Semantic Scholar Corpus ID) — or BYO PDF / JATS — to markdown with explicit license classification.
Lifts paper recovery on biomedical literature from ~22% (Mistral OCR alone) to ~85% using only publisher-blessed legal channels (NCBI E-utilities, Unpaywall, EuropePMC, bioRxiv, institutional repositories).
Install
pip install asta-papers # core (JATS conversion only)
pip install 'asta-papers[mistral]' # + Mistral OCR for PDFs
pip install 'asta-papers[olmocr]' # + local olmOCR for PDFs (offline)
pip install 'asta-papers[s3]' # + s3:// BYO support
pip install 'asta-papers[all]' # everything
Quickstart
import os
from asta_papers import Client
from asta_papers.converters.mistral import MistralConverter
c = Client(
email="me@allenai.org",
ncbi_api_key=os.environ.get("NCBI_API_KEY"), # optional, lifts NCBI 3→10 rps
converters=[MistralConverter()], # for PDF→markdown
)
# By identifier
r = c.fetch(doi="10.1186/s12943-024-02093-w")
print(r.success, r.license_class, r.markdown[:200])
# Storage-tier policy
if r.may_redistribute: # CC BY / CC0 / CC BY-SA
save_artifact(r.bytes)
elif r.may_use_for_tdm: # TDM-permissive licenses; not BRONZE/UNKNOWN/CLOSED
extract_inline(r.markdown)
# BYO PDF — bytes, local path, or URI
r = c.fetch(pdf=b"%PDF-...")
r = c.fetch(pdf="paper.pdf")
r = c.fetch(pdf="s3://my-bucket/paper.pdf") # requires [s3]
# Batch with bounded concurrency + per-host rate limits
results = c.fetch_many([
{"doi": "10.1038/foo"},
{"pmcid": "PMC123"},
{"pdf": "paper.pdf", "doi": "10.99/local"},
])
How it works
A strategy ladder runs against legal aggregator APIs in order, returning the first successful result:
- PMC E-utilities efetch — JATS XML for PMC OA Subset articles
- NCBI elink — PMID → PMC self-link when S2 didn't surface it
- Published-version handoff — when input is a preprint DOI, route to
the published version (bioRxiv API or Crossref
relation) so callers get the most-recent public version of the paper - arXiv — direct PDF for arXiv papers
- bioRxiv / medRxiv API — JATS XML or PDF for preprints
- Unpaywall — best legal OA URL; routes PMC URLs through efetch
- Institutional repo scrape —
hdl.handle.net,pure.eur.nl, etc. (now respectsrobots.txt) - EuropePMC PDF render — text-mining-licensed PDFs for free-to-read papers NCBI's OA Subset doesn't include
Per-host token-bucket rate limiting honors every publisher's published quota
exactly. arXiv at 0.33 rps (their explicit rule). NCBI at 3 rps (10 with key)
shared across eutils.*, pmc.*, www.* hostnames. Independent services
run fully in parallel.
License classification
Every successful fetch carries a LicenseClass:
cc-by cc-by-sa cc-by-nd cc-by-nc cc-by-nc-sa cc-by-nc-nd cc0
text-mining-only bronze arxiv-default closed unknown
Plus helper booleans on FetchResult: may_redistribute,
may_redistribute_nc, may_make_derivatives, may_train_models,
may_use_for_tdm, plus a source_type field (publisher /
repository / other) for callers who want publisher-vs-repository
policy without parsing strings. Storage-tier policy is a one-line check.
Successful results also carry an attribution blockquote at the top of
the markdown by default (source URL + DOI/PMCID + license + retrieval
strategy), so the markdown is self-attributing when it travels to end
users. Disable with Client(include_attribution=False).
What's NOT here
- No Sci-Hub, no archive scraping, no UA spoofing past WAFs.
- No title-only paper search (use PaperFinder / S2 first to get an identifier).
- No multi-tenant
Credentialsper-call object (one Client per credential set). - No async API (sync only in v0.1).
Configuration
See Client.__init__ and docs/concepts.md for the full list. Required:
email (Crossref polite-pool identifier; kwarg or ASTA_PAPERS_EMAIL env).
Recommended: NCBI_API_KEY env (free, 5-minute registration, 3.3× throughput).
Tests
pytest tests/integration -v # 29 real-API tests
python tools/check_test_legitimacy.py --strict # asserts mock-ratio < 30%
The full integration suite hits live upstream APIs — no mocks. Tests run in ~90 seconds. A per-paper snapshot recovery benchmark (53 biomedical DOIs that fail Mistral-OCR-only retrieval; 46/53 = 87% recovered) gates regressions.
Design
Full design at docs/DESIGN.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file asta_papers-0.0.1.tar.gz.
File metadata
- Download URL: asta_papers-0.0.1.tar.gz
- Upload date:
- Size: 89.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31af548cd0c6d16e426ee53b84ca1c3f43fcd7930f7d2a14de1a52518c3fbc9e
|
|
| MD5 |
f1b4c5fb0f03009a44c0adc29691ffd8
|
|
| BLAKE2b-256 |
dd07dedd7c57a52fb7476452ccebe92a49fd45cece7a9fdcfa47bec8776464e9
|
File details
Details for the file asta_papers-0.0.1-py3-none-any.whl.
File metadata
- Download URL: asta_papers-0.0.1-py3-none-any.whl
- Upload date:
- Size: 54.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c01dbeb6ff0b73e1f24ea81d2e4baa23d5a02bf86443048fdd5dc45067655c7e
|
|
| MD5 |
8a561e4ddb425aaa983660d364e2cbeb
|
|
| BLAKE2b-256 |
163df013d734bd3faa80bab2a66b2b3876111d33a7e8bafa0d7fd52cfc92b862
|