Distill arxiv research papers into an Obsidian-compatible markdown wiki.
Project description
paper-distiller
Distill arxiv research papers into an Obsidian-compatible markdown wiki.
What it does
paper-distiller has two modes, both writing into the same Obsidian-compatible vault:
paper-distiller — single-pass mode. Give it a topic or author and it will:
- Search arxiv + Semantic Scholar for relevant papers
- Use an LLM to rank the top N most relevant
- Download each PDF and extract the text
- Distill each paper into a structured Chinese markdown note (一句话 / 问题动因 / 方法 / 关键结果 / 我的 take)
- Cross-link each note to existing entries in your vault via
[[wikilinks]] - Compose a session survey tying the new notes together
paper-distiller-qa (new in v0.5) — question-driven multi-round research loop. Give it a research question and it will:
- Have the LLM judge what's known vs. missing each round, and propose the next search query
- Run the single-pass distillation pipeline on the top N results
- Repeat until the LLM is confident, OR a budget cap fires (rounds / articles / cost /
Ctrl+C) - Synthesize a final cited answer document with an audit trail of all rounds
The output drops directly into a vault directory that opens in Obsidian — graph view, Dataview, tag pane, full-text search all work out of the box.
Install
From PyPI:
pip install paper-distiller
From source (for development or the latest unreleased changes):
git clone https://github.com/jesson-hh/paper-distiller
cd paper-distiller
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/macOS
pip install -e .
Quick start
-
Copy the config template and fill in your LLM API key:
cp examples/example.env .env # then edit .env, replace PD_API_KEY=sk-your-key-here with your real key -
Point it at your Obsidian vault. Pick a mode:
Single-pass — distill N papers on a topic:
paper-distiller --vault /path/to/your/vault --topic "diffusion models for finance" --n 5Question-driven — let the agent plan multiple rounds and answer a question:
paper-distiller-qa --vault /path/to/your/vault \ --question "What are the latest advances in diffusion models for long-horizon time-series forecasting?" \ --max-rounds 3 --per-round 2 --max-cost-cny 5 -
Open your vault in Obsidian. New articles appear under
articles/, a session survey undersurveys/. QA sessions also write<vault>/.paper_distiller/qa-sessions/<sid>/state.jsonfor--resumeafter a pause or crash.
How it works
Single-pass (paper-distiller):
search arxiv + SS → LLM filter → fetch PDFs → distill each → save articles → compose survey
(~30 hits) (→ top N) (with fallback) (LLM call) (md+frontmatter) (LLM call)
Question-driven (paper-distiller-qa) wraps the above in a state-machine loop:
┌──────────────────────────────────────────────────────────┐
│ LLM reflect → search → rank → distill (N papers) │ ← one round
│ ↑ │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Stops when: LLM done / budget hit / no new candidates │
└──────────────────────────────────────────────────────────┘
↓
synthesize cited answer → surveys/qa-….md
Per-paper cost on qwen-plus / qwen3.5-plus (Aliyun Bailian): roughly $0.02 per paper. A 5-paper single-pass run is around $0.10 USD (~¥0.70). A typical 3-round QA session with 2 papers/round costs ~¥1.5-3.
For module structure, data flow internals, the 7 stop reasons, and how state persistence works, see docs/ARCHITECTURE.md.
Vault layout
paper-distiller writes into a vault with these category subdirectories (created on first run):
| Category | What goes there |
|---|---|
articles/ |
Paper notes — one entry per paper |
surveys/ |
Cluster mini-surveys composed by paper-distiller, linking multiple articles |
techniques/, directions/, open-problems/, authors/ |
Reserved for human-curated content. paper-distiller doesn't write here in v0.1. |
Frontmatter and [[wikilinks]] follow Obsidian conventions — no custom format.
Configuration
| Env var | Purpose | Default |
|---|---|---|
PD_API_KEY |
LLM API key (Aliyun Bailian, DeepSeek, OpenRouter — any OpenAI-compatible) | required |
PD_BASE_URL |
Endpoint base URL | required |
PD_MODEL |
Model identifier | required |
PD_PROVIDER_NAME |
Logging tag only | unspecified |
PD_PDF_TIMEOUT |
PDF download timeout (seconds) | 60 |
PD_MIN_SURVEY |
Min articles before composing a survey | 2 |
CLI flags override env vars where applicable (--model, --provider).
CLI reference
paper-distiller --vault <path> {--topic <str> | --author <str>}
[--n 5] [--pool 30] [--source {arxiv,ss,both}]
[--force] [--dry-run] [--verbose] [--model <name>] [--provider <name>]
paper-distiller-qa --vault <path> --question <str>
[--max-rounds 5] [--max-articles 15] [--max-cost-cny 20.0]
[--confidence-threshold 8] [--per-round 2]
[--source {arxiv,ss,both}] [--interactive] [--resume <session-id>]
[--dry-run] [--verbose] [--model <name>] [--provider <name>]
--dry-run skips all LLM calls and vault writes — useful for verifying config before spending API budget.
paper-distiller-qa flags worth knowing:
| Flag | What it does |
|---|---|
--max-rounds N |
Hard upper bound on loop iterations (default 5). The loop also exits early on llm_done, llm_brake, no_candidates, or budget caps. |
--max-articles N |
Stop after distilling N total articles across rounds (default 15) |
--max-cost-cny F |
Cost circuit breaker, CNY (default 20.0). Uses qwen-plus pricing. |
--confidence-threshold N |
LLM is_done confidence required to stop early (0-10, default 8) |
--interactive |
Pause after each round and prompt Y/n/q |
--resume <sid> |
Pick up a paused or errored session from its state.json |
Customizing prompts
All 5 LLM prompts live as plain markdown — edit them directly to change tone, structure, or output language. No Python changes needed.
src/paper_distiller/prompts/{filter,article,survey}.md— single-pass modesrc/paper_distiller/qa/prompts/{reflect,answer}.md— question-driven mode
Optional companion: semantic search via vault-mcp
paper-distiller does NOT ship its own semantic-search engine for your vault. To search the vault by meaning (not keywords) from Claude Code, pair it with vault-mcp — a standalone MCP server purpose-built for markdown vaults, with live sync and multi-provider embedding support.
See docs/vault-mcp-recommendation.md for setup and rationale.
LLM provider examples
| Provider | PD_BASE_URL |
PD_MODEL |
|---|---|---|
| Aliyun Bailian (recommended, cheapest) | https://dashscope.aliyuncs.com/compatible-mode/v1 |
qwen-plus |
| Aliyun Bailian (coding plan) | https://coding.dashscope.aliyuncs.com/v1 |
qwen3.5-plus |
| DeepSeek | https://api.deepseek.com/v1 |
deepseek-chat |
| OpenRouter | https://openrouter.ai/api/v1 |
qwen/qwen3.5-plus |
| Local Ollama | http://localhost:11434/v1 |
qwen2.5 |
Why "math research" specifically?
The default category schema (articles / techniques / directions / open-problems / authors / surveys) was designed for mathematical/scientific paper research. The tool works fine for other domains today; configurable schema is on the v0.2 roadmap.
Status
v0.5.0 — alpha. Single-pass (paper-distiller) and question-driven multi-round (paper-distiller-qa) modes both work end-to-end. 78 tests passing on Python 3.10/3.11/3.12.
Shipped
- v0.1.0 — L2 single-pass search-and-distill against arxiv; LLM filter + ranker; PyMuPDF-based extraction; Obsidian-compatible markdown output.
- v0.2.0 — arxiv-id-based dedup (prevents sibling articles for the same paper under different slugs); restored 500-char full-pdf threshold.
- v0.3.0 — Semantic Scholar as second paper source (
--source {arxiv,ss,both}, default both); PDF fallback chain (when arxiv's PDF download fails, try SS'sopenAccessPdf); DOI-based dedup. - v0.5.0 —
paper-distiller-qaquestion-driven multi-round research loop. State-machine with 7 stop reasons,--interactiveand--resumemodes, audit-trail-equipped final answer survey. (See docs/ARCHITECTURE.md for internals.)
Future roadmap
- v0.4 — deferred. We explored shipping our own semantic-search MCP server; concluded the right answer is to recommend vault-mcp instead (see docs/vault-mcp-recommendation.md). No v0.4 tag exists.
- v0.6 — citation-graph traversal: given a seed article, follow its references / cited-by edges and rank them for inclusion.
- v0.7 — broaden sources beyond arxiv + Semantic Scholar. Likely candidate: integrate OpenCLI to pull from logged-in browser sessions (ACM Digital Library, IEEE Xplore, lab homepages, Chinese platforms like 知乎/B站). Useful for venue-only papers and discussion context around papers.
- Later / on-demand — per-vault
paper-distiller.tomlfor custom category schemas; LEANN-backed in-pipeline crosslink retrieval (useful only when vault grows past ~500 entries).
Known limitations
- arxiv.org occasionally returns 503 / 429; paper-distiller retries 3× then exits with a friendly error (use
--verbosefor the traceback). - The "full-pdf vs abstract-only" threshold (500 chars) is conservative; PyMuPDF rarely returns less, but scanned-only PDFs do correctly fall back to abstract-only mode.
Contributing
Issues and PRs welcome. Run tests before submitting:
pip install -e ".[dev]"
pytest -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_distiller-0.5.1.tar.gz.
File metadata
- Download URL: paper_distiller-0.5.1.tar.gz
- Upload date:
- Size: 778.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1b31cb788ce2b1c9dcdd9ad0fe6edd89f81b0396811eaf20cc68164f0f91723
|
|
| MD5 |
a701f82cebf1edc79ab75fba3e80418b
|
|
| BLAKE2b-256 |
d054371a3fe113f182090095e8bfa93bf230c425f19b5ce7cb741268aa5d5d13
|
Provenance
The following attestation bundles were made for paper_distiller-0.5.1.tar.gz:
Publisher:
release.yml on jesson-hh/paper-distiller
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper_distiller-0.5.1.tar.gz -
Subject digest:
b1b31cb788ce2b1c9dcdd9ad0fe6edd89f81b0396811eaf20cc68164f0f91723 - Sigstore transparency entry: 1571760878
- Sigstore integration time:
-
Permalink:
jesson-hh/paper-distiller@4c6ea5db7c2a782762d357c104283146315eb230 -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/jesson-hh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4c6ea5db7c2a782762d357c104283146315eb230 -
Trigger Event:
push
-
Statement type:
File details
Details for the file paper_distiller-0.5.1-py3-none-any.whl.
File metadata
- Download URL: paper_distiller-0.5.1-py3-none-any.whl
- Upload date:
- Size: 42.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ec74031f1a35e9b5c7b4866adc5efa89a119846180b9008d120831c69519c37
|
|
| MD5 |
8d222101860b9f47fe987db7308a6195
|
|
| BLAKE2b-256 |
c528b83b21b87ba2245a66c5d8cb151ef951f226d7374bd0336e8c5b1539b3bb
|
Provenance
The following attestation bundles were made for paper_distiller-0.5.1-py3-none-any.whl:
Publisher:
release.yml on jesson-hh/paper-distiller
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper_distiller-0.5.1-py3-none-any.whl -
Subject digest:
4ec74031f1a35e9b5c7b4866adc5efa89a119846180b9008d120831c69519c37 - Sigstore transparency entry: 1571760927
- Sigstore integration time:
-
Permalink:
jesson-hh/paper-distiller@4c6ea5db7c2a782762d357c104283146315eb230 -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/jesson-hh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4c6ea5db7c2a782762d357c104283146315eb230 -
Trigger Event:
push
-
Statement type: