Skip to main content

CLI to search, download, and convert academic papers (arXiv, OpenAlex, DBLP, Crossref, Hugging Face Papers, Semantic Scholar, CORE) into Markdown — built for AI/ML researchers.

Project description

🐾 paperhound

Sniff out academic papers from the command line — and from your agents.

PyPI Python License

paperhound is a fast, agent-ready CLI for AI/ML researchers and tooling authors. One binary to search, inspect, download, and convert to Markdown academic papers from many sources at once — arXiv, OpenAlex, DBLP, Crossref, Hugging Face Papers, Semantic Scholar and CORE. Conversion is powered by docling, so the resulting Markdown is good enough to feed straight into an LLM context window.

✨ Features

  • 🔎 Unified search — one query, many backends in parallel under a 10-second budget. Results merged round-robin (no provider can monopolize the top-N) and deduplicated by arXiv id / DOI / title.
  • 📄 Inspect before downloading — abstract + metadata in one call.
  • ⬇️ Download by identifier — arXiv id, DOI, Semantic Scholar id, or any paper URL. Open-access PDFs are resolved automatically.
  • 📝 PDF → Markdown via docling — figures, LaTeX equations and HTML tables are opt-in flags.
  • 📚 Local library — SQLite FTS5 at ~/.paperhound/library/. Add a paper, store its Markdown, then offline-grep over titles, abstracts and bodies.
  • 🧠 Embedding rerank — install paperhound[rerank] and the CLI reranks search results by query/abstract similarity automatically.
  • 🤖 Agent-ready — every command speaks JSON via --json, and a skills.sh skill ships in this repo for one-line install.
  • 🔗 Citation graphrefs and cited-by walk the OpenAlex / Semantic Scholar graph with depth control.
  • 📤 Export — BibTeX, RIS, CSL-JSON straight from show.

🤖 Install the agent skill (recommended first step)

The fastest way to use paperhound is from an agent (Claude, OpenAI, opencode, …). Install the skill — it teaches the agent every command, flag, and JSON schema:

npx skills add alexfdez1010/paperhound

This places SKILL.md in your agent's skill directory (~/.claude/skills/paperhound/ for Claude Code). Pass -a <agent> to target a specific agent (e.g. -a claude-code, -a opencode).

The skill auto-installs the paperhound CLI on first use, so you don't need to install anything else manually.

📦 Install the CLI

# pip
pip install paperhound

# uv (isolated CLI on $PATH)
uv tool install paperhound

# uv — upgrade later
uv tool upgrade paperhound

# uv (as a library inside another project)
uv add paperhound

Optional embedding rerank:

pip install 'paperhound[rerank]'

Python 3.10+ is required.

🚀 CLI usage

Once installed, paperhound is on your $PATH.

# 🔎 Search across all providers
paperhound search "diffusion transformers" --limit 5

# 📄 Show abstract + metadata
paperhound show 2401.12345
paperhound show 10.1038/s41586-020-2649-2          # DOI
paperhound show https://arxiv.org/abs/1706.03762   # URL
paperhound show 2001.08361 -s arxiv                # force a single provider

# ⬇️ Download the PDF
paperhound download 1706.03762 -o ./papers/

# 📝 Convert a local PDF to Markdown
paperhound convert ./papers/1706.03762.pdf -o attention.md

# 🪄 Or do it all at once: resolve, download, convert, clean up
paperhound get 1706.03762 -o attention.md

📋 Commands

Command Description
paperhound search <query> Unified search. --limit, --source (repeatable), --year RANGE, --min-citations N, --venue STRING, --author STRING, --timeout, --json, --rerank/--no-rerank, --rerank-model.
paperhound show <id> Metadata + abstract. --source (-s, repeatable — restrict the lookup to avoid poisoned aggregator metadata), --format markdown|bibtex|ris|csljson, --json.
paperhound download <id> -o <path> Download a paper PDF.
paperhound convert <pdf> -o <md> Convert a PDF (or URL) to Markdown. --with-figures, --equations latex, --tables html.
paperhound get <id> -o <md> Download + convert in one step. --keep-pdf to retain the PDF.
paperhound refs <id> Works the paper cites. --depth, --limit, --source, --json.
paperhound cited-by <id> Works that cite the paper. Same flags as refs.
paperhound add <id> Add to local library. --convert also stores Markdown.
paperhound list List papers in the local library.
paperhound grep <query> Full-text search the local library.
paperhound rm <id> Remove a paper from the local library.
paperhound providers List every search provider with its description, default-set membership, runtime availability, env-var status, and one-line setup hint. --json for machine-readable output.
paperhound version Print the installed version.

Run paperhound <command> --help for full options.

🤖 JSON output

--json is the pipe-friendly mode: no headers, no Rich formatting, no progress bars.

# JSONL — one compact JSON object per line
paperhound search "graph neural networks" --json | jq '.title'

# Single compact JSON object
paperhound show 1706.03762 --json | jq .abstract

Schema: paperhound.models.Paper via model_dump(mode="json"). Fields include title, authors[], abstract, year, venue, publication_type (journal/conference/preprint/book/other), url, pdf_url, citation_count, identifiers.{arxiv_id,doi,…}, sources[].

🎚️ Filters

paperhound search supports the filters below, pushed down to providers that support them (OpenAlex, Crossref, Semantic Scholar) and re-applied client-side as a safety net.

Flag Accepted values Example
--year RANGE YYYY, YYYY-YYYY, YYYY-, -YYYY --year 2022-2024
--min-citations N integer ≥ 0 --min-citations 100
--venue STRING case-insensitive substring --venue NeurIPS
--author STRING case-insensitive substring --author Hinton
--type T[,T…] journal, conference, preprint, book, other (repeatable) --type journal,conference
--peer-reviewed shortcut for --type journal,conference,book --peer-reviewed
--preprints-only shortcut for --type preprint --preprints-only
paperhound search "vision transformers" --year 2022-2024 --min-citations 100
paperhound search "deep learning" --venue NeurIPS --author Hinton
paperhound search "diffusion models" --peer-reviewed
paperhound search "agentic workflows" --preprints-only

Papers with unknown year/venue are kept (filter unverifiable); papers with unknown citation_count or unknown publication_type are excluded when the matching filter (--min-citations, --type, --peer-reviewed, or --preprints-only) is set.

📑 Conversion options

Flag Values Default Description
--with-figures off Extract figures to <stem>_assets/ and embed ![](...). Requires --output.
--equations inline, latex inline latex preserves math as $...$ / $$...$$ (uses docling's do_formula_enrichment).
--tables markdown, html markdown html embeds raw <table> blocks for merged/irregular cells.
paperhound convert paper.pdf -o paper.md --with-figures --equations latex --tables html

📤 Export formats

paperhound show 1706.03762 --format bibtex
paperhound show 1706.03762 --format ris
paperhound show 1706.03762 --format csljson

BibTeX cite keys are derived as <firstAuthorLastName><year><firstSignificantTitleWord> (accents stripped, lowercased). LaTeX special characters are escaped automatically.

📚 Local library

paperhound add 1706.03762
paperhound add 1706.03762 --convert
paperhound list
paperhound grep "attention mechanism"
paperhound rm 1706.03762

Default location: ~/.paperhound/library/ (override with PAPERHOUND_LIBRARY_DIR). Re-adds are idempotent.

🔗 Citation graph

paperhound refs 1706.03762
paperhound cited-by 1706.03762 --depth 2 --limit 50
paperhound refs 1706.03762 --source semantic_scholar --json | jq '.[].title'

Default provider order: OpenAlex first, Semantic Scholar fallback. Results are deduplicated by arXiv id / DOI / title. At --depth 2, total fetched is capped at limit * 2.

🧠 Rerank

With paperhound[rerank] installed, every CLI search reranks results by embedding similarity between the query and each candidate's title + abstract.

paperhound search "vision language models"          # rerank on by default
paperhound search "graph neural networks" --no-rerank
paperhound search "agents" --rerank-model sentence-transformers/all-mpnet-base-v2

Without the extra installed, the CLI silently falls back to merge-order ranking — no error, no hang.

🆔 Identifier formats

paperhound accepts whatever you have on hand:

  • arXiv ids: 2401.12345, 2401.12345v3, cs.AI/0301001, arXiv:2401.12345
  • DOIs: 10.1038/s41586-020-2649-2, doi:10.1038/...
  • Semantic Scholar paper ids: 40-char hex
  • URLs: arxiv.org/abs/..., arxiv.org/pdf/..., doi.org/..., semanticscholar.org/paper/...

⚙️ Configuration

Env var Purpose
OPENALEX_MAILTO Optional. Adds your email to OpenAlex requests for the polite pool (better rate limits).
CROSSREF_MAILTO Optional. Same idea for Crossref's polite pool.
CORE_API_KEY Required to enable the CORE provider. Get a free key at https://core.ac.uk/services/api.
SEMANTIC_SCHOLAR_API_KEY Optional. The anonymous quota is shared globally and 429s are common; set this for steadier throughput.
PAPERHOUND_LIBRARY_DIR Override the library directory (default ~/.paperhound/library/).

Run paperhound providers (or paperhound providers --json) to see, at a glance, which providers are configured on the current machine and what to export to enable or upgrade each one.

📚 More

📄 License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperhound-0.5.4.tar.gz (55.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperhound-0.5.4-py3-none-any.whl (83.4 kB view details)

Uploaded Python 3

File details

Details for the file paperhound-0.5.4.tar.gz.

File metadata

  • Download URL: paperhound-0.5.4.tar.gz
  • Upload date:
  • Size: 55.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paperhound-0.5.4.tar.gz
Algorithm Hash digest
SHA256 dde539bd5c1fd74a9a17a2730b2d9592dbdc2177b327f8e9c2aa6876f490f7ec
MD5 ca7671c38400e80f3134e49da1ea77bd
BLAKE2b-256 c59e06da71eeb9354215d23f25c583f7f42f090f5c8eeacf6f2c8c90c99e3cdc

See more details on using hashes here.

Provenance

The following attestation bundles were made for paperhound-0.5.4.tar.gz:

Publisher: publish.yml on alexfdez1010/paperhound

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paperhound-0.5.4-py3-none-any.whl.

File metadata

  • Download URL: paperhound-0.5.4-py3-none-any.whl
  • Upload date:
  • Size: 83.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paperhound-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2d9d63fc42d2bd01eed7bfb348e96729c5df4cd44303ae52c842e768b025b245
MD5 1bb2e054491c054d64441e6420daeb12
BLAKE2b-256 ce74f2042083caa6b547ca27cb0bf82366142ccc76fc22154ae2f54de5636301

See more details on using hashes here.

Provenance

The following attestation bundles were made for paperhound-0.5.4-py3-none-any.whl:

Publisher: publish.yml on alexfdez1010/paperhound

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page