Skip to main content

Academic paper search CLI with multi-provider discovery, reading, downloading, and optional LLM planning

Project description

paperhub-cli

paperhub-cli is a Python CLI for searching, reading, and downloading academic papers across multiple providers.

It supports:

  • direct multi-provider search with aiohttp
  • optional LLM-guided query planning and decomposition
  • normalized paper records with stable ids like arxiv:..., acl:..., doi:..., and openalex:W...
  • provider capability metadata for search / download / read

Features

  • Search many providers from one CLI entrypoint.
  • Read metadata and, where available, extract PDF text.
  • Download PDFs when an open or direct PDF link exists.
  • Fan out across providers and merge/dedupe results.
  • Keep planner tool hints aligned with the same provider registry used by direct search.

Supported Providers

Current provider ids include:

  • arxiv
  • acl
  • crossref
  • openalex
  • dblp
  • openaire
  • pubmed
  • europepmc
  • pmc
  • biorxiv
  • medrxiv
  • zenodo
  • hal
  • semantic_scholar
  • core
  • doaj
  • unpaywall
  • iacr
  • citeseerx
  • base
  • ssrn
  • google_scholar
  • scihub
  • ieee
  • acm

Run this to see the exact capability levels in your install:

paperhub-cli providers

Capability levels are defined in code as values such as full, info_only, oa_only, best_effort, unsupported, and skeleton.

Platform Search Download Read Notes
arXiv Open API; reliable
PubMed ⚠️ info-only Open API; reliable
bioRxiv Open API; reliable
medRxiv Open API; reliable
Google Scholar ⚠️ Bot-detection active; optional PAPERHUB_GOOGLE_SCHOLAR_PROXY_URL
IACR Open API; reliable
Semantic Scholar ✅ (OA) ✅ (OA) Works without key (rate-limited); key improves limits; key rejection (403) retried automatically without key
Crossref ⚠️ info-only Open API; reliable
OpenAlex ⚠️ info-only Open API; reliable
PMC ✅ (OA only) ✅ (OA only) OA PDFs only; direct download may be blocked by some proxy environments
CORE ✅ (record-dependent) ✅ (record-dependent) Free key recommended; connector retries with backoff and falls back to key-less on 401/403
Europe PMC ✅ (OA) ✅ (OA) OA PDFs only; direct download may be blocked by some proxy environments
dblp ⚠️ info-only Open API; reliable
OpenAIRE Open API; retries 3× with escalating request profiles on transient 403
CiteSeerX ⚠️ ✅ (record-dependent) ⚠️ API endpoint intermittently unavailable / redirects to web archive
DOAJ ⚠️ (URL-dependent) ⚠️ (URL-dependent) PDF availability varies by article; free key raises rate limits
BASE ⚠️ ✅ (record-dependent) ✅ (record-dependent) OAI-PMH endpoint requires institutional IP registration; returns empty gracefully otherwise
Zenodo ✅ (record-dependent) ✅ (record-dependent) Open API; reliable
HAL ✅ (record-dependent) ✅ (record-dependent) Open API; reliable
SSRN ⚠️ ⚠️ best-effort ⚠️ best-effort 403 bot-detection active; public PDF only
Unpaywall ✅ (DOI lookup) Requires PAPERHUB_UNPAYWALL_EMAIL
Sci-Hub (optional) ⚠️ fallback-only Optional; unstable mirrors; user responsibility
IEEE Xplore 🔑 🚧 skeleton 🚧 skeleton 🚧 skeleton Requires PAPERHUB_IEEE_API_KEY to activate
ACM DL 🔑 🚧 skeleton 🚧 skeleton 🚧 skeleton Requires PAPERHUB_ACM_API_KEY to activate

✅ = reliable in live tests. ⚠️ = works but subject to upstream instability or access restrictions. ❌ = not supported. 🔑 = key required. 🚧 = skeleton only.


Installation

Install from PyPI

Once published, end users can install it with:

pip install paperhub-cli

Local editable install

python -m venv .venv
source .venv/bin/activate
pip install -e .

Install with dev dependencies

pip install -e .[dev]

Build and publish to PyPI

From the repository root:

python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
python -m twine upload dist/*

After upload succeeds, users can install it anywhere with:

pip install paperhub-cli

CLI Usage

Search

LLM-assisted planning:

paperhub-cli search "retrieval augmented generation for scientific QA"

Direct search without planner:

paperhub-cli search "vision transformers" --no-plan

Search specific providers:

paperhub-cli search "long context language models" \
  --no-plan \
  --sources arxiv,openalex,semantic_scholar

Restrict by year:

paperhub-cli search "biomedical relation extraction" \
  --no-plan \
  --sources pubmed,europepmc \
  --year-from 2022

Use recent-year mode:

paperhub-cli search "multimodal agents" --recent-years 3 --no-plan

Resume a previous planned run:

paperhub-cli search --resume <research_id>

Read

Read normalized metadata by stable id:

paperhub-cli read --id arxiv:2005.11401
paperhub-cli read --id acl:2023.acl-long.1
paperhub-cli read --id doi:10.1145/nnnnnnn.nnnnnnn
paperhub-cli read --id openalex:W2741809807

Try full-text PDF extraction when possible:

paperhub-cli read --id arxiv:2005.11401 --full

Download

Download a paper PDF to the current directory:

paperhub-cli download --id arxiv:2005.11401

Export as plain text or Markdown instead of PDF:

paperhub-cli download --id arxiv:2005.11401 --format txt
paperhub-cli download --id arxiv:2005.11401 --format md

Choose a destination:

paperhub-cli download --id doi:10.1000/182 --dest papers/

Important Flags

  • --sources: comma-separated provider ids for direct backend selection
  • --source: legacy arxiv / acl / both selector used when --sources is not set
  • --no-plan: bypass LLM planning and run one direct search
  • --depth: planner depth for decomposed research runs
  • --top-k: maximum number of papers per query or subtopic
  • --recent-years: convenience filter for recent work
  • --verbose: print LLM diagnostics and INFO logs

Environment Variables

The project uses PAPERHUB_* environment variables for provider-specific settings.

General

  • PAPERHUB_HTTP_USER_AGENT: override the default HTTP user agent
  • PAPERHUB_LL_DIAG: enable LLM diagnostics output

Provider-specific

  • PAPERHUB_CROSSREF_MAILTO: contact email sent in Crossref requests
  • PAPERHUB_OPENALEX_EMAIL: email used for polite OpenAlex identification
  • PAPERHUB_UNPAYWALL_EMAIL: required for Unpaywall DOI lookups
  • PAPERHUB_SEMANTIC_SCHOLAR_API_KEY: optional Semantic Scholar API key
  • PAPERHUB_CORE_API_KEY: optional CORE API key
  • PAPERHUB_DOAJ_API_KEY: optional DOAJ API key
  • PAPERHUB_ZENODO_ACCESS_TOKEN: optional Zenodo token
  • PAPERHUB_GOOGLE_SCHOLAR_PROXY_URL: optional proxy endpoint for fragile Scholar access
  • PAPERHUB_SCIHUB_ENABLED=1: explicit opt-in gate for the Sci-Hub stub
  • PAPERHUB_IEEE_API_KEY: future IEEE integration gate
  • PAPERHUB_ACM_API_KEY: future ACM integration gate

Example:

export PAPERHUB_UNPAYWALL_EMAIL="you@example.com"
export PAPERHUB_SEMANTIC_SCHOLAR_API_KEY="..."
paperhub-cli search "agentic retrieval" --no-plan --sources semantic_scholar,openalex

LLM Provider Configuration

Planning mode can run with:

  • direct OpenAI-compatible APIs
  • direct Gemini API
  • a LiteLLM proxy that routes to many providers (OpenAI, Anthropic Claude, Vertex Gemini, Bedrock, and others)

Option A: OpenAI-compatible direct

export LLM_API_KEY="sk-..."
export LLM_MODEL="gpt-4o-mini"
# optional (defaults to https://api.openai.com/v1)
export LLM_HOST="https://api.openai.com/v1"

You can also use OPENAI_API_KEY instead of LLM_API_KEY.

Option B: Gemini direct

export LLM_PROVIDER="gemini"
export GEMINI_API_KEY="..."
export LLM_MODEL="gemini-2.0-flash"

Architecture

The codebase centers around a normalized Paper record and an async provider layer.

  • paperhub_cli/models.py: Paper, SearchFilters, and known Source values
  • paperhub_cli/providers/: provider implementations, capability metadata, id parsing, merging, and registry
  • paperhub_cli/search/orchestrator.py: direct multi-provider search orchestration
  • paperhub_cli/tools/__init__.py: planner-facing tool registry built from the same provider layer
  • paperhub_cli/planner/agents/: LLM rephrase + decomposition workflow
  • paperhub_cli/reader/fetcher.py: provider-aware metadata fetch and PDF text extraction

Planner Tool Hints

Planner execution uses grouped tool hints backed by the same provider registry. Examples include:

  • multi_default
  • open_metadata
  • biomedical
  • preprints_wide
  • broad_scholarly

This keeps direct CLI search and planner-guided search aligned.

Testing

Run the test suite with:

PYTHONPATH=. pytest -q

The test suite includes:

  • unit tests for provider resolution and reader id normalization
  • fixture-based parsing tests for provider payload normalization
  • planner/tool registry coverage

MCP Server Notes

This repository currently provides the provider layer and tool abstractions needed for an MCP server, but it does not yet ship a dedicated MCP server module.

If you want to expose it over MCP, the intended pattern is:

  1. create a thin MCP adapter around the existing provider registry and reader entrypoints
  2. expose stable tools such as search_papers, read_paper, download_paper, and list_providers
  3. return JSON-serializable Paper.to_dict() payloads
  4. keep MCP tool definitions generic and pass provider ids as parameters instead of creating one MCP tool per provider

Example MCP tools to expose:

  • search_papers: search across one or more providers
  • read_paper: fetch normalized metadata by stable id
  • download_paper: download or export a paper as pdf, txt, or md
  • list_providers: return provider capability metadata

Example Claude Desktop configuration:

{
  "mcpServers": {
    "paperhub": {
      "command": "python",
      "args": ["-m", "paperhub_cli.mcp.server"],
      "env": {
        "PAPERHUB_UNPAYWALL_EMAIL": "you@example.com",
        "PAPERHUB_SEMANTIC_SCHOLAR_API_KEY": "your-api-key"
      }
    }
  }
}

If you prefer a CLI-style entrypoint, the intended UX would be similar to:

{
  "mcpServers": {
    "paperhub": {
      "command": "paperhub-cli",
      "args": ["mcp"]
    }
  }
}

Caveats

  • Some providers are best-effort or metadata-only.
  • Read/download support depends on OA links or provider capabilities.
  • Fragile sources such as Scholar-style scraping and Sci-Hub-style flows should remain opt-in.
  • Upstream APIs and HTML layouts may change and require parser maintenance.

License / Attribution

This project references provider behavior and capability ideas similar to paper-search-mcp, but implements native async provider support directly in this repository rather than depending on that package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperhub_cli-0.1.0.tar.gz (59.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperhub_cli-0.1.0-py3-none-any.whl (77.6 kB view details)

Uploaded Python 3

File details

Details for the file paperhub_cli-0.1.0.tar.gz.

File metadata

  • Download URL: paperhub_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 59.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paperhub_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cec6f82fae24fbe0d22d96f830989f549b03208ec968b3e5088500227eec35ba
MD5 814a2ced550a48b08094be9564c8d5a0
BLAKE2b-256 0eddacfeb1c321ac53be5e3e8125136a1de8c0487ee76000d58f87f1e388a5a3

See more details on using hashes here.

File details

Details for the file paperhub_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: paperhub_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 77.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for paperhub_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e303f27d6e2b330fed5756fa44dfb18882d01276f2a495a5938191f2276d2d53
MD5 18e277b8917bfe21269d4b3672b0f9e0
BLAKE2b-256 098ff1baf6dcd6d064f209d8ae8cdf9bcc1f4589ea4db62c4670b70cdf330f08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page