Academic paper search CLI with multi-provider discovery, reading, downloading, and optional LLM planning
Project description
paperhub-cli
paperhub-cli is a Python CLI for searching, reading, and downloading academic papers across multiple providers.
It supports:
- direct multi-provider search with
aiohttp - optional LLM-guided query planning and decomposition
- normalized paper records with stable ids like
arxiv:...,acl:...,doi:..., andopenalex:W... - provider capability metadata for search / download / read
Features
- Search many providers from one CLI entrypoint.
- Read metadata and, where available, extract PDF text.
- Download PDFs when an open or direct PDF link exists.
- Fan out across providers and merge/dedupe results.
- Keep planner tool hints aligned with the same provider registry used by direct search.
Supported Providers
Current provider ids include:
arxivaclcrossrefopenalexdblpopenairepubmedeuropepmcpmcbiorxivmedrxivzenodohalsemantic_scholarcoredoajunpaywalliacrciteseerxbasessrngoogle_scholarscihubieeeacm
Run this to see the exact capability levels in your install:
paperhub-cli providers
Capability levels are defined in code as values such as full, info_only, oa_only, best_effort, unsupported, and skeleton.
| Platform | Search | Download | Read | Notes |
|---|---|---|---|---|
| arXiv | ✅ | ✅ | ✅ | Open API; reliable |
| PubMed | ✅ | ❌ | ⚠️ info-only | Open API; reliable |
| bioRxiv | ✅ | ✅ | ✅ | Open API; reliable |
| medRxiv | ✅ | ✅ | ✅ | Open API; reliable |
| Google Scholar | ⚠️ | ❌ | ❌ | Bot-detection active; optional PAPERHUB_GOOGLE_SCHOLAR_PROXY_URL |
| IACR | ✅ | ✅ | ✅ | Open API; reliable |
| Semantic Scholar | ✅ | ✅ (OA) | ✅ (OA) | Works without key (rate-limited); key improves limits; key rejection (403) retried automatically without key |
| Crossref | ✅ | ❌ | ⚠️ info-only | Open API; reliable |
| OpenAlex | ✅ | ❌ | ⚠️ info-only | Open API; reliable |
| PMC | ✅ | ✅ (OA only) | ✅ (OA only) | OA PDFs only; direct download may be blocked by some proxy environments |
| CORE | ✅ | ✅ (record-dependent) | ✅ (record-dependent) | Free key recommended; connector retries with backoff and falls back to key-less on 401/403 |
| Europe PMC | ✅ | ✅ (OA) | ✅ (OA) | OA PDFs only; direct download may be blocked by some proxy environments |
| dblp | ✅ | ❌ | ⚠️ info-only | Open API; reliable |
| OpenAIRE | ✅ | ❌ | ❌ | Open API; retries 3× with escalating request profiles on transient 403 |
| CiteSeerX | ⚠️ | ✅ (record-dependent) | ⚠️ | API endpoint intermittently unavailable / redirects to web archive |
| DOAJ | ✅ | ⚠️ (URL-dependent) | ⚠️ (URL-dependent) | PDF availability varies by article; free key raises rate limits |
| BASE | ⚠️ | ✅ (record-dependent) | ✅ (record-dependent) | OAI-PMH endpoint requires institutional IP registration; returns empty gracefully otherwise |
| Zenodo | ✅ | ✅ (record-dependent) | ✅ (record-dependent) | Open API; reliable |
| HAL | ✅ | ✅ (record-dependent) | ✅ (record-dependent) | Open API; reliable |
| SSRN | ⚠️ | ⚠️ best-effort | ⚠️ best-effort | 403 bot-detection active; public PDF only |
| Unpaywall | ✅ (DOI lookup) | ❌ | ❌ | Requires PAPERHUB_UNPAYWALL_EMAIL |
| Sci-Hub (optional) | ⚠️ fallback-only | ✅ | ❌ | Optional; unstable mirrors; user responsibility |
| IEEE Xplore 🔑 | 🚧 skeleton | 🚧 skeleton | 🚧 skeleton | Requires PAPERHUB_IEEE_API_KEY to activate |
| ACM DL 🔑 | 🚧 skeleton | 🚧 skeleton | 🚧 skeleton | Requires PAPERHUB_ACM_API_KEY to activate |
✅ = reliable in live tests. ⚠️ = works but subject to upstream instability or access restrictions. ❌ = not supported. 🔑 = key required. 🚧 = skeleton only.
Installation
Install from PyPI
Once published, end users can install it with:
pip install paperhub-cli
Local editable install
python -m venv .venv
source .venv/bin/activate
pip install -e .
Install with dev dependencies
pip install -e .[dev]
Build and publish to PyPI
From the repository root:
python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
python -m twine upload dist/*
After upload succeeds, users can install it anywhere with:
pip install paperhub-cli
CLI Usage
Search
LLM-assisted planning:
paperhub-cli search "retrieval augmented generation for scientific QA"
Direct search without planner:
paperhub-cli search "vision transformers" --no-plan
Search specific providers:
paperhub-cli search "long context language models" \
--no-plan \
--sources arxiv,openalex,semantic_scholar
Restrict by year:
paperhub-cli search "biomedical relation extraction" \
--no-plan \
--sources pubmed,europepmc \
--year-from 2022
Use recent-year mode:
paperhub-cli search "multimodal agents" --recent-years 3 --no-plan
Resume a previous planned run:
paperhub-cli search --resume <research_id>
Read
Read normalized metadata by stable id:
paperhub-cli read --id arxiv:2005.11401
paperhub-cli read --id acl:2023.acl-long.1
paperhub-cli read --id doi:10.1145/nnnnnnn.nnnnnnn
paperhub-cli read --id openalex:W2741809807
Try full-text PDF extraction when possible:
paperhub-cli read --id arxiv:2005.11401 --full
Download
Download a paper PDF to the current directory:
paperhub-cli download --id arxiv:2005.11401
Export as plain text or Markdown instead of PDF:
paperhub-cli download --id arxiv:2005.11401 --format txt
paperhub-cli download --id arxiv:2005.11401 --format md
Choose a destination:
paperhub-cli download --id doi:10.1000/182 --dest papers/
Important Flags
--sources: comma-separated provider ids for direct backend selection--source: legacyarxiv/acl/bothselector used when--sourcesis not set--no-plan: bypass LLM planning and run one direct search--depth: planner depth for decomposed research runs--top-k: maximum number of papers per query or subtopic--recent-years: convenience filter for recent work--verbose: print LLM diagnostics and INFO logs
Environment Variables
The project uses PAPERHUB_* environment variables for provider-specific settings.
General
PAPERHUB_HTTP_USER_AGENT: override the default HTTP user agentPAPERHUB_LL_DIAG: enable LLM diagnostics output
Provider-specific
PAPERHUB_CROSSREF_MAILTO: contact email sent in Crossref requestsPAPERHUB_OPENALEX_EMAIL: email used for polite OpenAlex identificationPAPERHUB_UNPAYWALL_EMAIL: required for Unpaywall DOI lookupsPAPERHUB_SEMANTIC_SCHOLAR_API_KEY: optional Semantic Scholar API keyPAPERHUB_CORE_API_KEY: optional CORE API keyPAPERHUB_DOAJ_API_KEY: optional DOAJ API keyPAPERHUB_ZENODO_ACCESS_TOKEN: optional Zenodo tokenPAPERHUB_GOOGLE_SCHOLAR_PROXY_URL: optional proxy endpoint for fragile Scholar accessPAPERHUB_SCIHUB_ENABLED=1: explicit opt-in gate for the Sci-Hub stubPAPERHUB_IEEE_API_KEY: future IEEE integration gatePAPERHUB_ACM_API_KEY: future ACM integration gate
Example:
export PAPERHUB_UNPAYWALL_EMAIL="you@example.com"
export PAPERHUB_SEMANTIC_SCHOLAR_API_KEY="..."
paperhub-cli search "agentic retrieval" --no-plan --sources semantic_scholar,openalex
LLM Provider Configuration
Planning mode can run with:
- direct OpenAI-compatible APIs
- direct Gemini API
- a LiteLLM proxy that routes to many providers (OpenAI, Anthropic Claude, Vertex Gemini, Bedrock, and others)
Option A: OpenAI-compatible direct
export LLM_API_KEY="sk-..."
export LLM_MODEL="gpt-4o-mini"
# optional (defaults to https://api.openai.com/v1)
export LLM_HOST="https://api.openai.com/v1"
You can also use OPENAI_API_KEY instead of LLM_API_KEY.
Option B: Gemini direct
export LLM_PROVIDER="gemini"
export GEMINI_API_KEY="..."
export LLM_MODEL="gemini-2.0-flash"
Architecture
The codebase centers around a normalized Paper record and an async provider layer.
paperhub_cli/models.py:Paper,SearchFilters, and knownSourcevaluespaperhub_cli/providers/: provider implementations, capability metadata, id parsing, merging, and registrypaperhub_cli/search/orchestrator.py: direct multi-provider search orchestrationpaperhub_cli/tools/__init__.py: planner-facing tool registry built from the same provider layerpaperhub_cli/planner/agents/: LLM rephrase + decomposition workflowpaperhub_cli/reader/fetcher.py: provider-aware metadata fetch and PDF text extraction
Planner Tool Hints
Planner execution uses grouped tool hints backed by the same provider registry. Examples include:
multi_defaultopen_metadatabiomedicalpreprints_widebroad_scholarly
This keeps direct CLI search and planner-guided search aligned.
Testing
Run the test suite with:
PYTHONPATH=. pytest -q
The test suite includes:
- unit tests for provider resolution and reader id normalization
- fixture-based parsing tests for provider payload normalization
- planner/tool registry coverage
MCP Server Notes
This repository currently provides the provider layer and tool abstractions needed for an MCP server, but it does not yet ship a dedicated MCP server module.
If you want to expose it over MCP, the intended pattern is:
- create a thin MCP adapter around the existing provider registry and reader entrypoints
- expose stable tools such as
search_papers,read_paper,download_paper, andlist_providers - return JSON-serializable
Paper.to_dict()payloads - keep MCP tool definitions generic and pass provider ids as parameters instead of creating one MCP tool per provider
Example MCP tools to expose:
search_papers: search across one or more providersread_paper: fetch normalized metadata by stable iddownload_paper: download or export a paper aspdf,txt, ormdlist_providers: return provider capability metadata
Example Claude Desktop configuration:
{
"mcpServers": {
"paperhub": {
"command": "python",
"args": ["-m", "paperhub_cli.mcp.server"],
"env": {
"PAPERHUB_UNPAYWALL_EMAIL": "you@example.com",
"PAPERHUB_SEMANTIC_SCHOLAR_API_KEY": "your-api-key"
}
}
}
}
If you prefer a CLI-style entrypoint, the intended UX would be similar to:
{
"mcpServers": {
"paperhub": {
"command": "paperhub-cli",
"args": ["mcp"]
}
}
}
Caveats
- Some providers are best-effort or metadata-only.
- Read/download support depends on OA links or provider capabilities.
- Fragile sources such as Scholar-style scraping and Sci-Hub-style flows should remain opt-in.
- Upstream APIs and HTML layouts may change and require parser maintenance.
License / Attribution
This project references provider behavior and capability ideas similar to paper-search-mcp, but implements native async provider support directly in this repository rather than depending on that package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paperhub_cli-0.1.0.tar.gz.
File metadata
- Download URL: paperhub_cli-0.1.0.tar.gz
- Upload date:
- Size: 59.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cec6f82fae24fbe0d22d96f830989f549b03208ec968b3e5088500227eec35ba
|
|
| MD5 |
814a2ced550a48b08094be9564c8d5a0
|
|
| BLAKE2b-256 |
0eddacfeb1c321ac53be5e3e8125136a1de8c0487ee76000d58f87f1e388a5a3
|
File details
Details for the file paperhub_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: paperhub_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 77.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e303f27d6e2b330fed5756fa44dfb18882d01276f2a495a5938191f2276d2d53
|
|
| MD5 |
18e277b8917bfe21269d4b3672b0f9e0
|
|
| BLAKE2b-256 |
098ff1baf6dcd6d064f209d8ae8cdf9bcc1f4589ea4db62c4670b70cdf330f08
|