Skip to main content

CLI to search, download, and convert academic papers (arXiv, OpenAlex, DBLP, Crossref, Hugging Face Papers, Semantic Scholar, CORE) into Markdown — built for AI/ML researchers.

Project description

paperhound

paperhound — sniff out academic papers from the command line.

A small, fast CLI for AI/ML researchers who want a single tool to search, inspect, download, and convert to Markdown papers from many academic sources at once. Conversion is powered by docling, so the resulting Markdown is good enough to feed straight into an LLM context.

Features

  • 🔎 Unified search — one query, many backends. arXiv, OpenAlex, DBLP, Crossref and Hugging Face Papers (and optionally Semantic Scholar / CORE) are queried in parallel with a 10-second budget. Results are merged round-robin (one from each provider, then the next, …) so a fast provider can't monopolize the top-N — and deduplicated by arXiv id / DOI / title. Slow providers are dropped silently — the CLI returns whatever came back in time.
  • 📄 Inspect before downloadingpaperhound show <id> prints the abstract and metadata so you can decide if it's worth a download.
  • ⬇️ Download by identifier — arXiv id, DOI, Semantic Scholar paper id, or any paper URL. Open-access PDFs are resolved automatically.
  • 📝 PDF → Markdown via doclingpaperhound convert paper.pdf or paperhound get <id> for the full pipeline.
  • 📚 Local librarypaperhound add <id> stores metadata in a SQLite FTS5 database at ~/.paperhound/library/. paperhound list shows all saved papers; paperhound grep <query> does offline full-text search over titles, abstracts, and stored Markdown bodies; paperhound rm <id> removes an entry.
  • 🔌 MCP serverpaperhound mcp exposes all tools over stdio so Claude Code and other MCP-compatible agents can call paperhound directly without a skill shim. Install the optional extra: pip install 'paperhound[mcp]'.
  • 🤖 Agent-ready — ships with a SKILL.md and JSON output mode so any Claude / OpenAI / local agent can drive the CLI.
  • 🧪 Heavily tested — every module has unit tests; live integration tests are gated behind an environment variable.

Installation

pip install paperhound

or with uv:

uv tool install paperhound

Python 3.10+ is required. Docling pulls in PyTorch on first run, so the very first conversion may take a moment to download model weights.

Quick start

# Search across all providers
paperhound search "diffusion transformers" --limit 5

# Show the abstract for a specific paper
paperhound show 2401.12345
paperhound show 10.1038/s41586-020-2649-2          # DOI works too
paperhound show https://arxiv.org/abs/1706.03762   # ...and URLs

# Download the PDF
paperhound download 1706.03762 -o ./papers/

# Convert a local PDF to Markdown
paperhound convert ./papers/1706.03762.pdf -o attention.md

# Or do it all at once: search-resolve, download, convert, clean up
paperhound get 1706.03762 -o attention.md

JSON output for scripts and agents

--json is the pipe-friendly mode: no headers, no Rich formatting, no progress bars.

# search --json: JSONL — one compact JSON object per line (Paper schema)
paperhound search "graph neural networks" --json | jq '.title'

# show --json: single compact JSON object on one line
paperhound show 1706.03762 --json | jq .abstract

The schema is paperhound.models.Paper serialised via model_dump(mode="json"). Fields: title, authors[], abstract, year, venue, url, pdf_url, citation_count, identifiers.{arxiv_id,doi,semantic_scholar_id,openalex_id, dblp_key,core_id}, sources[].

--json and --format are mutually exclusive on show — use one or the other.

Commands

Command Description
paperhound search <query> Run a unified search. --limit, --source arxiv|openalex|dblp|crossref|huggingface|semantic_scholar|core (repeatable), --year RANGE, --min-citations N, --venue STRING, --author STRING, --timeout, --json (JSONL output), --rerank (embedding rerank; see below), --rerank-model.
paperhound show <id> Fetch a paper's metadata + abstract. --format markdown|bibtex|ris|csljson (default markdown), --json (compact JSON; mutually exclusive with --format).
paperhound download <id> -o <path> Download a paper PDF.
paperhound convert <pdf> -o <md> Convert a PDF (or any docling-supported file/URL) to Markdown.
paperhound get <id> -o <md> Download + convert in one step. --keep-pdf to keep the PDF.
paperhound refs <id> List works the paper cites (its references). --depth 1|2, --limit N, --source openalex|semantic_scholar, --json.
paperhound cited-by <id> List works that cite the paper. Same flags as refs.
paperhound add <id> Fetch metadata and add to local library. --convert also stores Markdown.
paperhound list List all papers in the local library.
paperhound grep <query> Full-text search the local library (title + abstract + Markdown body).
paperhound rm <id> Remove a paper from the local library (and its Markdown file, if any).
paperhound mcp Start an MCP server over stdio exposing all tools. Requires pip install 'paperhound[mcp]'.
paperhound version Print the installed version.

Run paperhound <command> --help for full options.

Filters

paperhound search accepts four filter flags. Filters are pushed down to providers that support them (OpenAlex, Crossref, Semantic Scholar) and always applied client-side after the merge as a safety net.

Flag Accepted values Example
--year RANGE YYYY, YYYY-YYYY, YYYY-, -YYYY --year 2022-2024
--min-citations N integer ≥ 0 --min-citations 100
--venue STRING case-insensitive substring --venue NeurIPS
--author STRING case-insensitive substring --author Hinton

--year is the preferred way to filter by year. It accepts a single year (2023), an inclusive range (2023-2026), open-ended from (2023-), or open-ended up-to (-2026). The older --year-min / --year-max flags are still accepted.

# Papers from 2022 to 2024 with at least 100 citations
paperhound search "vision transformers" --year 2022-2024 --min-citations 100

# NeurIPS papers by Hinton
paperhound search "deep learning" --venue NeurIPS --author Hinton

# arXiv-only papers from 2023 onwards
paperhound search "diffusion models" -s arxiv --year 2023-

# Combine filters and JSON output
paperhound search "llm alignment" --year 2023 --min-citations 50 --json | jq .title

Behavior with missing fields: papers whose year or venue field is unknown (null) are kept — the filter cannot be verified. Papers whose citation_count is unknown are excluded when --min-citations is set (conservative: the user asked for a floor).

Export formats

paperhound show can export a paper's metadata in four formats:

# Rich terminal view (default)
paperhound show 1706.03762

# BibTeX — paste into your .bib file
paperhound show 1706.03762 --format bibtex

# RIS — compatible with Zotero, Mendeley, EndNote
paperhound show 1706.03762 --format ris

# CSL-JSON — machine-readable, compatible with Pandoc and citation processors
paperhound show 1706.03762 --format csljson

BibTeX cite keys are derived deterministically as <firstAuthorLastName><year><firstSignificantTitleWord> (accents stripped, lowercased). LaTeX special characters (&, %, $, _, etc.) are escaped automatically.

Local library

paperhound keeps a persistent per-user library at ~/.paperhound/library/ (override with PAPERHOUND_LIBRARY_DIR). The library is backed by a SQLite FTS5 database — no extra dependencies required.

# Add a paper (metadata only)
paperhound add 1706.03762

# Add and also save the Markdown version of the PDF
paperhound add 1706.03762 --convert

# List all saved papers
paperhound list

# Full-text search offline
paperhound grep "attention mechanism"

# Remove a paper (and its Markdown file, if any)
paperhound rm 1706.03762

Re-adding a paper is idempotent — it updates the metadata in place. The schema is versioned; on a version mismatch paperhound reports a clear error rather than silently operating on a stale schema.

Citation graph

Traverse the citation graph around any paper using its arXiv id, DOI, or Semantic Scholar id.

# Papers that "Attention Is All You Need" cites
paperhound refs 1706.03762

# Papers that cite it
paperhound cited-by 1706.03762

# Go two hops deep (refs of refs / cites of cites), limit to 50 unique papers
paperhound refs 1706.03762 --depth 2 --limit 50

# Force a specific provider
paperhound cited-by 1706.03762 --source semantic_scholar

# JSON output for scripting
paperhound refs 1706.03762 --json | jq '.[].title'

Both commands return the same Paper format as search. The default provider order is OpenAlex first, Semantic Scholar as fallback (automatically triggered when OpenAlex returns nothing or errors). Results are deduplicated by arXiv id / DOI / title before being returned. At --depth 2, total fetched is capped at limit * 2 and a small pause (0.1 s) is inserted between hops to stay in the polite API pool.

Rerank

Add --rerank to any paperhound search call to re-sort results by embedding similarity between the query and each candidate's title + abstract. This can surface more relevant papers when the round-robin merge returns noisier results from some providers.

# Rerank using the default model (all-MiniLM-L6-v2, ~22 MB)
paperhound search "vision language models" --rerank

# Use a different SentenceTransformer model
paperhound search "graph neural networks" --rerank \
  --rerank-model sentence-transformers/all-mpnet-base-v2

Installation

pip install 'paperhound[rerank]'

--rerank exits with a clear error if sentence-transformers is not installed.

How it works

  1. The aggregator fetches up to limit * 3 candidates (capped at 50).
  2. Each candidate's text (title + abstract) is embedded alongside the query using the chosen SentenceTransformer model (cached per process).
  3. Candidates are sorted by cosine similarity (descending).
  4. Papers with neither a title nor an abstract keep their merge-order rank and are placed at the end.
  5. The top --limit results are returned.

MCP server

paperhound mcp starts an MCP (Model Context Protocol) server over stdio, exposing paperhound as callable tools to Claude Code and any other MCP-compatible agent.

Installation

pip install 'paperhound[mcp]'

Tools exposed

Tool Description
search(query, limit, sources) Search papers across providers; returns list of paper records.
show(identifier) Fetch metadata + abstract for a single paper.
download(identifier, dest) Download a paper PDF; returns the path.
convert(identifier, dest) Convert a PDF/URL to Markdown; returns path or inline Markdown.
library_add(identifier, convert) Add a paper to the local library (optionally with Markdown).
library_list() List all papers in the local library.
library_grep(query, limit) Full-text search the local library; returns records with snippets.

Wiring into Claude Code

Add the following to your Claude Code settings.json (~/.claude/settings.json or the project-level .claude/settings.json):

{
  "mcpServers": {
    "paperhound": {
      "command": "paperhound",
      "args": ["mcp"]
    }
  }
}

Or, if paperhound is installed in a virtual environment:

{
  "mcpServers": {
    "paperhound": {
      "command": "/path/to/venv/bin/paperhound",
      "args": ["mcp"]
    }
  }
}

After saving, restart Claude Code. The paperhound tools will appear in the available tool list and Claude can call them directly — no skill shim needed.

Identifier formats

paperhound accepts whatever you have on hand:

  • arXiv ids: 2401.12345, 2401.12345v3, cs.AI/0301001, arXiv:2401.12345
  • DOIs: 10.1038/s41586-020-2649-2, doi:10.1038/...
  • Semantic Scholar paper ids: 40-char hex
  • URLs: arxiv.org/abs/..., arxiv.org/pdf/..., doi.org/..., semanticscholar.org/paper/...

Configuration

Env var Purpose
OPENALEX_MAILTO Optional. Adds your email to OpenAlex requests so they land in the polite pool (better rate limits).
CROSSREF_MAILTO Optional. Same idea for Crossref's polite pool.
CORE_API_KEY Required to enable the CORE provider. Without a key the provider reports unavailable and the aggregator skips it silently. Get a free key at https://core.ac.uk/services/api.
SEMANTIC_SCHOLAR_API_KEY Optional. Semantic Scholar's anonymous quota is shared globally and 429s are common; the provider retries with exponential backoff. Set this to your own key for steadier throughput.

Adding a new provider

paperhound.search is a registry of provider factories. To add a new source:

  1. Create src/paperhound/search/<name>.py with a class subclassing SearchProvider. Declare its capabilities (TEXT_SEARCH, ID_LOOKUP, OPEN_ACCESS_PDF) and override available() if it needs an API key.
  2. Add unit tests in tests/unit/test_<name>.py that mock HTTP with respx.
  3. Register it in src/paperhound/search/__init__.py with one register("name", Factory) call. Done — the CLI picks it up automatically.

Use it from agents

paperhound is designed to be driven by AI agents. The repo ships a ready-to-install skill at skills/paperhound/SKILL.md that documents every command, recommends the JSON output flag, and gives an end-to-end example.

Install it into Claude Code (or any skills.sh-compatible agent) with one command:

npx skills add alexfdez1010/paperhound

This uses the skills CLI to discover the SKILL.md under skills/paperhound/ and place it in your agent's skill directory (~/.claude/skills/paperhound/ for Claude Code). Pass -a <agent> to target a specific agent (e.g. -a claude-code, -a opencode).

Development

make install            # uv sync --extra dev
make test               # unit tests (network-free, respx-mocked)
make test-integration   # live API tests — always live, no env-var gate
make test-all           # unit + integration
make check              # lint + format check + unit tests (run before pushing)

Unit tests use respx to mock HTTP, so they never touch the network. Integration tests under tests/integration/ always hit the real provider APIs (arXiv, OpenAlex, DBLP, Crossref, Hugging Face Papers, Semantic Scholar) — no env-var gate, no mocks. The SemanticScholarProvider retries 429s with exponential backoff; export SEMANTIC_SCHOLAR_API_KEY only if you want faster runs.

Releasing to PyPI

  1. Bump version in pyproject.toml and paperhound/__init__.py.
  2. Tag the release: git tag v0.1.1 && git push --tags.
  3. The Publish to PyPI GitHub Action builds and publishes via PyPI Trusted Publishing — no API token required, just configure the trusted publisher once on PyPI.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperhound-0.4.9.tar.gz (46.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperhound-0.4.9-py3-none-any.whl (62.0 kB view details)

Uploaded Python 3

File details

Details for the file paperhound-0.4.9.tar.gz.

File metadata

  • Download URL: paperhound-0.4.9.tar.gz
  • Upload date:
  • Size: 46.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paperhound-0.4.9.tar.gz
Algorithm Hash digest
SHA256 4ba0f659b0f1d2d4732ebb0272d4be3e54f8864d6cef4f72a83ce4d7f0e1f299
MD5 304b64b239db26405c00bde6a37b4a7b
BLAKE2b-256 5f7087ef20989fa417bef55b38bc647d7675da1f544228b7770d2881c97d1be3

See more details on using hashes here.

Provenance

The following attestation bundles were made for paperhound-0.4.9.tar.gz:

Publisher: publish.yml on alexfdez1010/paperhound

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paperhound-0.4.9-py3-none-any.whl.

File metadata

  • Download URL: paperhound-0.4.9-py3-none-any.whl
  • Upload date:
  • Size: 62.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paperhound-0.4.9-py3-none-any.whl
Algorithm Hash digest
SHA256 b0ae0a0f643d857b83cc1fc94d965d0820d74877e64bac2956a9798d8180caa1
MD5 2f3bfbab6db858829e5c1806a393f7f5
BLAKE2b-256 6562ccd6abdfa6c35695232c7f454a3a87bc9c40ce19c55a0d52b23cf85994e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for paperhound-0.4.9-py3-none-any.whl:

Publisher: publish.yml on alexfdez1010/paperhound

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page