Academic paper search CLI with multi-provider discovery, reading, downloading, and optional LLM planning

These details have not been verified by PyPI

Project description

paperhub-cli

paperhub-cli is a Python CLI for searching, reading, and downloading academic papers across multiple providers.

It supports:

direct multi-provider search with aiohttp
optional LLM-guided query planning and decomposition
normalized paper records with stable ids like arxiv:..., acl:..., doi:..., and openalex:W...
provider capability metadata for search / download / read

Features

Search many providers from one CLI entrypoint.
Read metadata and, where available, extract PDF text.
Download PDFs when an open or direct PDF link exists.
Fan out across providers and merge/dedupe results.
Keep planner tool hints aligned with the same provider registry used by direct search.

Supported Providers

Current provider ids include:

arxiv
acl
crossref
openalex
dblp
openaire
pubmed
europepmc
pmc
biorxiv
medrxiv
zenodo
hal
semantic_scholar
core
doaj
unpaywall
iacr
citeseerx
base
ssrn
google_scholar
scihub
ieee
acm

Run this to see the exact capability levels in your install:

paperhub-cli providers

Capability levels are defined in code as values such as full, info_only, oa_only, best_effort, unsupported, and skeleton.

Platform	Search	Download	Read	Notes
arXiv	✅	✅	✅	Open API; reliable
PubMed	✅	❌	⚠️ info-only	Open API; reliable
bioRxiv	✅	✅	✅	Open API; reliable
medRxiv	✅	✅	✅	Open API; reliable
Google Scholar	⚠️	❌	❌	Bot-detection active; optional `PAPERHUB_GOOGLE_SCHOLAR_PROXY_URL`
IACR	✅	✅	✅	Open API; reliable
Semantic Scholar	✅	✅ (OA)	✅ (OA)	Works without key (rate-limited); key improves limits; key rejection (403) retried automatically without key
Crossref	✅	❌	⚠️ info-only	Open API; reliable
OpenAlex	✅	❌	⚠️ info-only	Open API; reliable
PMC	✅	✅ (OA only)	✅ (OA only)	OA PDFs only; direct download may be blocked by some proxy environments
CORE	✅	✅ (record-dependent)	✅ (record-dependent)	Free key recommended; connector retries with backoff and falls back to key-less on 401/403
Europe PMC	✅	✅ (OA)	✅ (OA)	OA PDFs only; direct download may be blocked by some proxy environments
dblp	✅	❌	⚠️ info-only	Open API; reliable
OpenAIRE	✅	❌	❌	Open API; retries 3× with escalating request profiles on transient 403
CiteSeerX	⚠️	✅ (record-dependent)	⚠️	API endpoint intermittently unavailable / redirects to web archive
DOAJ	✅	⚠️ (URL-dependent)	⚠️ (URL-dependent)	PDF availability varies by article; free key raises rate limits
BASE	⚠️	✅ (record-dependent)	✅ (record-dependent)	OAI-PMH endpoint requires institutional IP registration; returns empty gracefully otherwise
Zenodo	✅	✅ (record-dependent)	✅ (record-dependent)	Open API; reliable
HAL	✅	✅ (record-dependent)	✅ (record-dependent)	Open API; reliable
SSRN	⚠️	⚠️ best-effort	⚠️ best-effort	403 bot-detection active; public PDF only
Unpaywall	✅ (DOI lookup)	❌	❌	Requires `PAPERHUB_UNPAYWALL_EMAIL`
Sci-Hub (optional)	⚠️ fallback-only	✅	❌	Optional; unstable mirrors; user responsibility
IEEE Xplore 🔑	🚧 skeleton	🚧 skeleton	🚧 skeleton	Requires `PAPERHUB_IEEE_API_KEY` to activate
ACM DL 🔑	🚧 skeleton	🚧 skeleton	🚧 skeleton	Requires `PAPERHUB_ACM_API_KEY` to activate

✅ = reliable in live tests. ⚠️ = works but subject to upstream instability or access restrictions. ❌ = not supported. 🔑 = key required. 🚧 = skeleton only.

Installation

Install from PyPI

Once published, end users can install it with:

pip install paperhub-cli

Local editable install

python -m venv .venv
source .venv/bin/activate
pip install -e .

Install with dev dependencies

pip install -e .[dev]

Build and publish to PyPI

From the repository root:

python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
python -m twine upload dist/*

After upload succeeds, users can install it anywhere with:

pip install paperhub-cli

CLI Usage

Search

LLM-assisted planning:

paperhub-cli search "retrieval augmented generation for scientific QA"

Direct search without planner:

paperhub-cli search "vision transformers" --no-plan

Search specific providers:

paperhub-cli search "long context language models" \
  --no-plan \
  --sources arxiv,openalex,semantic_scholar

Restrict by year:

paperhub-cli search "biomedical relation extraction" \
  --no-plan \
  --sources pubmed,europepmc \
  --year-from 2022

Use recent-year mode:

paperhub-cli search "multimodal agents" --recent-years 3 --no-plan

Resume a previous planned run:

paperhub-cli search --resume <research_id>

Read

Read normalized metadata by stable id:

paperhub-cli read --id arxiv:2005.11401
paperhub-cli read --id acl:2023.acl-long.1
paperhub-cli read --id doi:10.1145/nnnnnnn.nnnnnnn
paperhub-cli read --id openalex:W2741809807

Try full-text PDF extraction when possible:

paperhub-cli read --id arxiv:2005.11401 --full

Download

Download a paper PDF to the current directory:

paperhub-cli download --id arxiv:2005.11401

Export as plain text or Markdown instead of PDF:

paperhub-cli download --id arxiv:2005.11401 --format txt
paperhub-cli download --id arxiv:2005.11401 --format md

Choose a destination:

paperhub-cli download --id doi:10.1000/182 --dest papers/

Important Flags

--sources: comma-separated provider ids for direct backend selection
--source: legacy arxiv / acl / both selector used when --sources is not set
--no-plan: bypass LLM planning and run one direct search
--depth: planner depth for decomposed research runs
--top-k: maximum number of papers per query or subtopic
--recent-years: convenience filter for recent work
--verbose: print LLM diagnostics and INFO logs

Environment Variables

The project uses PAPERHUB_* environment variables for provider-specific settings.

General

PAPERHUB_HTTP_USER_AGENT: override the default HTTP user agent
PAPERHUB_LL_DIAG: enable LLM diagnostics output

Provider-specific

PAPERHUB_CROSSREF_MAILTO: contact email sent in Crossref requests
PAPERHUB_OPENALEX_EMAIL: email used for polite OpenAlex identification
PAPERHUB_UNPAYWALL_EMAIL: required for Unpaywall DOI lookups
PAPERHUB_SEMANTIC_SCHOLAR_API_KEY: optional Semantic Scholar API key
PAPERHUB_CORE_API_KEY: optional CORE API key
PAPERHUB_DOAJ_API_KEY: optional DOAJ API key
PAPERHUB_ZENODO_ACCESS_TOKEN: optional Zenodo token
PAPERHUB_GOOGLE_SCHOLAR_PROXY_URL: optional proxy endpoint for fragile Scholar access
PAPERHUB_SCIHUB_ENABLED=1: explicit opt-in gate for the Sci-Hub stub
PAPERHUB_IEEE_API_KEY: future IEEE integration gate
PAPERHUB_ACM_API_KEY: future ACM integration gate

Example:

export PAPERHUB_UNPAYWALL_EMAIL="you@example.com"
export PAPERHUB_SEMANTIC_SCHOLAR_API_KEY="..."
paperhub-cli search "agentic retrieval" --no-plan --sources semantic_scholar,openalex

LLM Provider Configuration

Planning mode can run with:

direct OpenAI-compatible APIs
direct Gemini API
a LiteLLM proxy that routes to many providers (OpenAI, Anthropic Claude, Vertex Gemini, Bedrock, and others)

Option A: OpenAI-compatible direct

export LLM_API_KEY="sk-..."
export LLM_MODEL="gpt-4o-mini"
# optional (defaults to https://api.openai.com/v1)
export LLM_HOST="https://api.openai.com/v1"

You can also use OPENAI_API_KEY instead of LLM_API_KEY.

Option B: Gemini direct

export LLM_PROVIDER="gemini"
export GEMINI_API_KEY="..."
export LLM_MODEL="gemini-2.0-flash"

Architecture

The codebase centers around a normalized Paper record and an async provider layer.

paperhub_cli/models.py: Paper, SearchFilters, and known Source values
paperhub_cli/providers/: provider implementations, capability metadata, id parsing, merging, and registry
paperhub_cli/search/orchestrator.py: direct multi-provider search orchestration
paperhub_cli/tools/__init__.py: planner-facing tool registry built from the same provider layer
paperhub_cli/planner/agents/: LLM rephrase + decomposition workflow
paperhub_cli/reader/fetcher.py: provider-aware metadata fetch and PDF text extraction

Planner Tool Hints

Planner execution uses grouped tool hints backed by the same provider registry. Examples include:

multi_default
open_metadata
biomedical
preprints_wide
broad_scholarly

This keeps direct CLI search and planner-guided search aligned.

Testing

Run the test suite with:

PYTHONPATH=. pytest -q

The test suite includes:

unit tests for provider resolution and reader id normalization
fixture-based parsing tests for provider payload normalization
planner/tool registry coverage

MCP Server Notes

This repository currently provides the provider layer and tool abstractions needed for an MCP server, but it does not yet ship a dedicated MCP server module.

If you want to expose it over MCP, the intended pattern is:

create a thin MCP adapter around the existing provider registry and reader entrypoints
expose stable tools such as search_papers, read_paper, download_paper, and list_providers
return JSON-serializable Paper.to_dict() payloads
keep MCP tool definitions generic and pass provider ids as parameters instead of creating one MCP tool per provider

Example MCP tools to expose:

search_papers: search across one or more providers
read_paper: fetch normalized metadata by stable id
download_paper: download or export a paper as pdf, txt, or md
list_providers: return provider capability metadata

Example Claude Desktop configuration:

{
  "mcpServers": {
    "paperhub": {
      "command": "python",
      "args": ["-m", "paperhub_cli.mcp.server"],
      "env": {
        "PAPERHUB_UNPAYWALL_EMAIL": "you@example.com",
        "PAPERHUB_SEMANTIC_SCHOLAR_API_KEY": "your-api-key"
      }
    }
  }
}

If you prefer a CLI-style entrypoint, the intended UX would be similar to:

{
  "mcpServers": {
    "paperhub": {
      "command": "paperhub-cli",
      "args": ["mcp"]
    }
  }
}

Caveats

Some providers are best-effort or metadata-only.
Read/download support depends on OA links or provider capabilities.
Fragile sources such as Scholar-style scraping and Sci-Hub-style flows should remain opt-in.
Upstream APIs and HTML layouts may change and require parser maintenance.

License / Attribution

This project references provider behavior and capability ideas similar to paper-search-mcp, but implements native async provider support directly in this repository rather than depending on that package.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperhub_cli-0.1.0.tar.gz (59.0 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paperhub_cli-0.1.0-py3-none-any.whl (77.6 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file paperhub_cli-0.1.0.tar.gz.

File metadata

Download URL: paperhub_cli-0.1.0.tar.gz
Upload date: Apr 6, 2026
Size: 59.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paperhub_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cec6f82fae24fbe0d22d96f830989f549b03208ec968b3e5088500227eec35ba`
MD5	`814a2ced550a48b08094be9564c8d5a0`
BLAKE2b-256	`0eddacfeb1c321ac53be5e3e8125136a1de8c0487ee76000d58f87f1e388a5a3`

See more details on using hashes here.

File details

Details for the file paperhub_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: paperhub_cli-0.1.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 77.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paperhub_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e303f27d6e2b330fed5756fa44dfb18882d01276f2a495a5938191f2276d2d53`
MD5	`18e277b8917bfe21269d4b3672b0f9e0`
BLAKE2b-256	`098ff1baf6dcd6d064f209d8ae8cdf9bcc1f4589ea4db62c4670b70cdf330f08`

See more details on using hashes here.

paperhub-cli 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

paperhub-cli

Features

Supported Providers

Installation

Install from PyPI

Local editable install

Install with dev dependencies

Build and publish to PyPI

CLI Usage

Search

Read

Download

Important Flags

Environment Variables

General

Provider-specific

LLM Provider Configuration

Option A: OpenAI-compatible direct

Option B: Gemini direct

Architecture

Planner Tool Hints

This keeps direct CLI search and planner-guided search aligned.

Testing

MCP Server Notes

Caveats

License / Attribution

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes