Skip to main content

Distill arxiv research papers into an Obsidian-compatible markdown wiki.

Project description

paper-distiller

Distill arxiv research papers into an Obsidian-compatible markdown wiki.

CI PyPI version Python versions License: MIT

What it does

paper-distiller has two modes, both writing into the same Obsidian-compatible vault:

paper-distiller — single-pass mode. Give it a topic or author and it will:

  1. Search arxiv + Semantic Scholar for relevant papers
  2. Use an LLM to rank the top N most relevant
  3. Download each PDF and extract the text
  4. Distill each paper into a structured Chinese markdown note (一句话 / 问题动因 / 方法 / 关键结果 / 我的 take)
  5. Cross-link each note to existing entries in your vault via [[wikilinks]]
  6. Compose a session survey tying the new notes together

paper-distiller-qa (new in v0.5) — question-driven multi-round research loop. Give it a research question and it will:

  1. Have the LLM judge what's known vs. missing each round, and propose the next search query
  2. Run the single-pass distillation pipeline on the top N results
  3. Repeat until the LLM is confident, OR a budget cap fires (rounds / articles / cost / Ctrl+C)
  4. Synthesize a final cited answer document with an audit trail of all rounds

The output drops directly into a vault directory that opens in Obsidian — graph view, Dataview, tag pane, full-text search all work out of the box.

Install

From PyPI:

pip install paper-distiller

From source (for development or the latest unreleased changes):

git clone https://github.com/jesson-hh/paper-distiller
cd paper-distiller
python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # Linux/macOS
pip install -e .

Quick start

  1. Copy the config template and fill in your LLM API key:

    cp examples/example.env .env
    # then edit .env, replace PD_API_KEY=sk-your-key-here with your real key
    
  2. Point it at your Obsidian vault. Pick a mode:

    Single-pass — distill N papers on a topic:

    paper-distiller --vault /path/to/your/vault --topic "diffusion models for finance" --n 5
    

    Question-driven — let the agent plan multiple rounds and answer a question:

    paper-distiller-qa --vault /path/to/your/vault \
                       --question "What are the latest advances in diffusion models for long-horizon time-series forecasting?" \
                       --max-rounds 3 --per-round 2 --max-cost-cny 5
    
  3. Open your vault in Obsidian. New articles appear under articles/, a session survey under surveys/. QA sessions also write <vault>/.paper_distiller/qa-sessions/<sid>/state.json for --resume after a pause or crash.

How it works

Single-pass (paper-distiller):

search arxiv + SS  →  LLM filter  →  fetch PDFs       →  distill each  →  save articles  →  compose survey
   (~30 hits)         (→ top N)      (with fallback)     (LLM call)       (md+frontmatter)   (LLM call)

Question-driven (paper-distiller-qa) wraps the above in a state-machine loop:

                ┌──────────────────────────────────────────────────────────┐
                │   LLM reflect  →  search  →  rank  →  distill (N papers) │  ← one round
                │       ↑                                          │       │
                │       └──────────────────────────────────────────┘       │
                │                                                          │
                │   Stops when: LLM done / budget hit / no new candidates  │
                └──────────────────────────────────────────────────────────┘
                                       ↓
                          synthesize cited answer  →  surveys/qa-….md

Per-paper cost on qwen-plus / qwen3.5-plus (Aliyun Bailian): roughly $0.02 per paper. A 5-paper single-pass run is around $0.10 USD (~¥0.70). A typical 3-round QA session with 2 papers/round costs ~¥1.5-3.

For module structure, data flow internals, the 7 stop reasons, and how state persistence works, see docs/ARCHITECTURE.md.

Vault layout

paper-distiller writes into a vault with these category subdirectories (created on first run):

Category What goes there
articles/ Paper notes — one entry per paper
surveys/ Cluster mini-surveys composed by paper-distiller, linking multiple articles
techniques/, directions/, open-problems/, authors/ Reserved for human-curated content. paper-distiller doesn't write here in v0.1.

Frontmatter and [[wikilinks]] follow Obsidian conventions — no custom format.

Configuration

Env var Purpose Default
PD_API_KEY LLM API key (Aliyun Bailian, DeepSeek, OpenRouter — any OpenAI-compatible) required
PD_BASE_URL Endpoint base URL required
PD_MODEL Model identifier required
PD_PROVIDER_NAME Logging tag only unspecified
PD_PDF_TIMEOUT PDF download timeout (seconds) 60
PD_MIN_SURVEY Min articles before composing a survey 2

CLI flags override env vars where applicable (--model, --provider).

CLI reference

paper-distiller --vault <path> {--topic <str> | --author <str>}
                [--n 5] [--pool 30] [--source {arxiv,ss,both}]
                [--force] [--dry-run] [--verbose] [--model <name>] [--provider <name>]

paper-distiller-qa --vault <path> --question <str>
                   [--max-rounds 5] [--max-articles 15] [--max-cost-cny 20.0]
                   [--confidence-threshold 8] [--per-round 2]
                   [--source {arxiv,ss,both}] [--interactive] [--resume <session-id>]
                   [--dry-run] [--verbose] [--model <name>] [--provider <name>]

--dry-run skips all LLM calls and vault writes — useful for verifying config before spending API budget.

paper-distiller-qa flags worth knowing:

Flag What it does
--max-rounds N Hard upper bound on loop iterations (default 5). The loop also exits early on llm_done, llm_brake, no_candidates, or budget caps.
--max-articles N Stop after distilling N total articles across rounds (default 15)
--max-cost-cny F Cost circuit breaker, CNY (default 20.0). Uses qwen-plus pricing.
--confidence-threshold N LLM is_done confidence required to stop early (0-10, default 8)
--interactive Pause after each round and prompt Y/n/q
--resume <sid> Pick up a paused or errored session from its state.json

Customizing prompts

All 5 LLM prompts live as plain markdown — edit them directly to change tone, structure, or output language. No Python changes needed.

  • src/paper_distiller/prompts/{filter,article,survey}.md — single-pass mode
  • src/paper_distiller/qa/prompts/{reflect,answer}.md — question-driven mode

Optional companion: semantic search via vault-mcp

paper-distiller does NOT ship its own semantic-search engine for your vault. To search the vault by meaning (not keywords) from Claude Code, pair it with vault-mcp — a standalone MCP server purpose-built for markdown vaults, with live sync and multi-provider embedding support.

See docs/vault-mcp-recommendation.md for setup and rationale.

LLM provider examples

Provider PD_BASE_URL PD_MODEL
Aliyun Bailian (recommended, cheapest) https://dashscope.aliyuncs.com/compatible-mode/v1 qwen-plus
Aliyun Bailian (coding plan) https://coding.dashscope.aliyuncs.com/v1 qwen3.5-plus
DeepSeek https://api.deepseek.com/v1 deepseek-chat
OpenRouter https://openrouter.ai/api/v1 qwen/qwen3.5-plus
Local Ollama http://localhost:11434/v1 qwen2.5

Why "math research" specifically?

The default category schema (articles / techniques / directions / open-problems / authors / surveys) was designed for mathematical/scientific paper research. The tool works fine for other domains today; configurable schema is on the v0.2 roadmap.

Status

v0.5.0 — alpha. Single-pass (paper-distiller) and question-driven multi-round (paper-distiller-qa) modes both work end-to-end. 78 tests passing on Python 3.10/3.11/3.12.

Shipped

  • v0.1.0 — L2 single-pass search-and-distill against arxiv; LLM filter + ranker; PyMuPDF-based extraction; Obsidian-compatible markdown output.
  • v0.2.0 — arxiv-id-based dedup (prevents sibling articles for the same paper under different slugs); restored 500-char full-pdf threshold.
  • v0.3.0 — Semantic Scholar as second paper source (--source {arxiv,ss,both}, default both); PDF fallback chain (when arxiv's PDF download fails, try SS's openAccessPdf); DOI-based dedup.
  • v0.5.0paper-distiller-qa question-driven multi-round research loop. State-machine with 7 stop reasons, --interactive and --resume modes, audit-trail-equipped final answer survey. (See docs/ARCHITECTURE.md for internals.)

Future roadmap

  • v0.4deferred. We explored shipping our own semantic-search MCP server; concluded the right answer is to recommend vault-mcp instead (see docs/vault-mcp-recommendation.md). No v0.4 tag exists.
  • v0.6 — citation-graph traversal: given a seed article, follow its references / cited-by edges and rank them for inclusion.
  • v0.7 — broaden sources beyond arxiv + Semantic Scholar. Likely candidate: integrate OpenCLI to pull from logged-in browser sessions (ACM Digital Library, IEEE Xplore, lab homepages, Chinese platforms like 知乎/B站). Useful for venue-only papers and discussion context around papers.
  • Later / on-demand — per-vault paper-distiller.toml for custom category schemas; LEANN-backed in-pipeline crosslink retrieval (useful only when vault grows past ~500 entries).

Known limitations

  • arxiv.org occasionally returns 503 / 429; paper-distiller retries 3× then exits with a friendly error (use --verbose for the traceback).
  • The "full-pdf vs abstract-only" threshold (500 chars) is conservative; PyMuPDF rarely returns less, but scanned-only PDFs do correctly fall back to abstract-only mode.

Contributing

Issues and PRs welcome. Run tests before submitting:

pip install -e ".[dev]"
pytest -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_distiller-0.5.1.tar.gz (778.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_distiller-0.5.1-py3-none-any.whl (42.7 kB view details)

Uploaded Python 3

File details

Details for the file paper_distiller-0.5.1.tar.gz.

File metadata

  • Download URL: paper_distiller-0.5.1.tar.gz
  • Upload date:
  • Size: 778.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paper_distiller-0.5.1.tar.gz
Algorithm Hash digest
SHA256 b1b31cb788ce2b1c9dcdd9ad0fe6edd89f81b0396811eaf20cc68164f0f91723
MD5 a701f82cebf1edc79ab75fba3e80418b
BLAKE2b-256 d054371a3fe113f182090095e8bfa93bf230c425f19b5ce7cb741268aa5d5d13

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_distiller-0.5.1.tar.gz:

Publisher: release.yml on jesson-hh/paper-distiller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paper_distiller-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: paper_distiller-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 42.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paper_distiller-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ec74031f1a35e9b5c7b4866adc5efa89a119846180b9008d120831c69519c37
MD5 8d222101860b9f47fe987db7308a6195
BLAKE2b-256 c528b83b21b87ba2245a66c5d8cb151ef951f226d7374bd0336e8c5b1539b3bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_distiller-0.5.1-py3-none-any.whl:

Publisher: release.yml on jesson-hh/paper-distiller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page