Chat-first paper distillation: turn arXiv papers into an Obsidian-ready knowledge base.
Project description
paper-distiller
Chat-first paper distillation. Turn arXiv papers into an Obsidian-ready knowledge base — via REPL, one-shot commands, or natural language.
paper-distiller is a command-line tool that searches academic paper sources (arXiv + Semantic Scholar), downloads PDFs, has an LLM distill each one into a structured markdown note, and writes everything to a folder that opens directly in Obsidian.
v1.0 ships a single paper-distiller-chat command with three modes:
| Mode | When to use |
|---|---|
paper-distiller-chat (no args) |
Interactive REPL — slash commands + natural-language input |
paper-distiller-chat distill |
One-shot: search a topic, distill N papers |
paper-distiller-chat ask |
One-shot: ask a research question, multi-round QA loop |
paper-distiller-chat resume |
One-shot: continue a paused/errored QA session |
Output is plain markdown with YAML frontmatter and [[wikilink]] cross-references — no proprietary format, no lock-in. Graph view, Dataview, tags, and full-text search all work out of the box.
Install
pip install paper-distiller
Requires Python 3.10+. From source:
git clone https://github.com/jesson-hh/paper-distiller
cd paper-distiller
pip install -e ".[dev]"
Configure
paper-distiller needs an OpenAI-compatible LLM endpoint. Cheapest reliable option: Aliyun Bailian's qwen-plus (~¥0.02 per paper).
cp examples/example.env .env
# Edit .env — set PD_API_KEY, PD_BASE_URL, PD_MODEL
| Env var | Required | Default | Purpose |
|---|---|---|---|
PD_API_KEY |
✓ | — | Any OpenAI-compatible API key |
PD_BASE_URL |
✓ | — | API endpoint base URL |
PD_MODEL |
✓ | — | Model identifier |
PD_PROVIDER_NAME |
unspecified |
Logging tag only | |
PD_PDF_TIMEOUT |
60 |
PDF download timeout (seconds) | |
PD_MIN_SURVEY |
2 |
Min articles before composing a session survey | |
PD_SS_API_KEY |
(none) | Optional — higher Semantic Scholar rate limit |
Provider quick reference
| Provider | PD_BASE_URL |
PD_MODEL |
|---|---|---|
| Aliyun Bailian (recommended) | https://dashscope.aliyuncs.com/compatible-mode/v1 |
qwen-plus |
| Aliyun Bailian (coding plan) | https://coding.dashscope.aliyuncs.com/v1 |
qwen3.5-plus |
| DeepSeek | https://api.deepseek.com/v1 |
deepseek-chat |
| OpenRouter | https://openrouter.ai/api/v1 |
qwen/qwen3.5-plus |
| Local Ollama | http://localhost:11434/v1 |
qwen2.5 |
Use it
Interactive REPL (recommended)
paper-distiller-chat --vault /path/to/your/vault
You see a welcome banner with provider + vault info, then a prompt. Type slash commands or natural language:
> /help
[command list]
> /vault
Vault: /path/to/your/vault
articles: 47
surveys: 6
...
> /distill diffusion models --n 3
[live status table during execution]
> 帮我研究下扩散模型在长周期金融时序生成上的最新进展
[intent-router] Intent: ask | confidence 9
question: 扩散模型在长周期金融时序生成上的最新进展
Missing: max_rounds, per_round, max_cost_cny
Apply defaults (max_rounds=3, per_round=2, max_cost_cny=5.0) and run? [Y/n]
> Y
[live status table for 3-round QA loop]
> /quit
(bye)
10 slash commands available: /distill, /ask, /resume, /sessions, /vault, /provider, /agents, /show, /help, /quit.
Natural-language input goes through an LLM intent-router that classifies into one of distill/ask/resume/show and proposes defaults for any missing parameters. You confirm before any expensive operation runs.
One-shot mode (good for scripts / cron)
Distill N papers on a topic:
paper-distiller-chat distill --vault /path/to/your/vault \
--topic "diffusion models for finance" --n 5
Answer a question across multiple rounds:
paper-distiller-chat ask --vault /path/to/your/vault \
--question "What are recent advances in long-horizon time-series diffusion?" \
--max-rounds 3 --per-round 2 --max-cost-cny 5
Resume a paused / errored session:
paper-distiller-chat resume --vault /path/to/your/vault \
--session-id 20260519-1635-a3f7
Use --dry-run on any subcommand to validate config without spending API budget.
Helpful flags
paper-distiller-chat [--vault PATH]
{distill | ask | resume}
[subcommand-specific flags]
paper-distiller-chat distill --help etc. show every flag for that subcommand.
What you get — a sample distilled article
---
title: "Convergence Rates of Conditional Flow Matching..."
category: articles
slug: cnf-convergence
tags: [generative-models, theory, distribution-estimation, arxiv-2024]
refs: [arxiv:2410.12345]
depth: full-pdf
---
# CFM 的样本复杂度上界
> **场合**: arxiv preprint, 2024 Oct
> **主题**: 给 CFM 训练给出第一个匹配 nonparametric minimax rate 的有限样本界
> **领域**: 统计 / 生成模型理论
## 一句话
作者证明 CFM 训练在 $\beta$-平滑目标密度下达到 $n^{-\beta/(2\beta+d)}$ 的 $W_2$ 收敛速度…
## 方法
核心是把 vector-field 估计误差 decompose 成 (1) approximation error 由 $\beta$-Hölder ball
覆盖控制 (2) statistical error 用 local Rademacher 处理 (3) discretization error 显式给…
## 与已有 wiki 的关联
对 [[cnf-convergence-distribution-learning]] 的分析路线是个自然的强化…
## 我的 take
最有意思的是 time-singularity 在 CFM 训练里其实从未出现…
Open the vault in Obsidian and this article cross-links automatically with everything else you've distilled.
Vault layout
paper-distiller writes into a vault with these subdirectories (auto-created on first run):
| Directory | Auto-written by tool | Description |
|---|---|---|
articles/ |
✓ | One file per paper |
surveys/ |
✓ | Multi-article surveys + qa-… final answer docs |
techniques/, directions/, open-problems/, authors/ |
— | Reserved for human-curated notes |
QA sessions persist resume state at <vault>/.paper_distiller/qa-sessions/<sid>/state.json.
How it works
paper-distiller v1.0 is built around an async DAG of sub-agents:
Single-pass (distill):
arxiv-searcher ss-searcher (parallel)
└────┬────┘
candidate-merger
│
candidate-ranker (LLM)
│
paper-processor × N (parallel: fetch PDF → extract → distill LLM)
│
vault-writer
│
survey-composer (LLM, optional)
Multi-round (ask):
┌──────────────────────────────────────────────────────┐
│ progress-reflector (LLM) │
│ ↓ │
│ [stop check: max_rounds / llm_done / llm_brake / ...] │
│ ↓ │
│ search → dedup → rank → distill × N → write │
└────────────────────────────────────────────────────────┘
↓
answer-synthesizer (LLM) → surveys/qa-<slug>-<date>.md
11 agents, 4 stop reasons in QA mode, all wired together by a topological-level scheduler. For module structure, full data flow, and internal contracts, see docs/ARCHITECTURE.md.
Cost
Aliyun Bailian qwen-plus pricing — roughly ¥2.1/M input tokens, ¥12.7/M output tokens.
| Operation | Typical cost |
|---|---|
| 1 paper distilled | |
| 5-paper single-pass + survey | |
| 3-round QA session @ 2 papers/round | ~¥1.5–3 |
| 5-round QA session @ 3 papers/round | ~¥4–8 |
paper-distiller-chat ask enforces --max-cost-cny (default ¥20). The cost number is for the circuit breaker — not billing-accurate.
Customize the output
All 6 LLM prompts are plain markdown — edit them to change tone, structure, or output language. No Python changes needed.
src/paper_distiller/prompts/{filter,article,survey}.md— distill modesrc/paper_distiller/agents/prompts/route.md— intent routersrc/paper_distiller/qa/prompts/{reflect,answer}.md— QA mode
Defaults produce Chinese-primary notes with this 5-section structure: 一句话 / 问题动因 / 方法 / 关键结果 / 我的 take.
Optional companion: semantic search via vault-mcp
paper-distiller does NOT ship its own semantic-search engine for your vault. To search by meaning (not keywords) from Claude Code, Cursor, or any MCP-aware agent, pair it with vault-mcp.
See docs/vault-mcp-recommendation.md for setup and rationale.
Status & roadmap
v1.0.0 — beta. Chat-first architecture stable; 168 tests passing on Python 3.10 / 3.11 / 3.12.
Migration from v0.5
| v0.5.x | v1.0 |
|---|---|
paper-distiller --topic X --n N |
paper-distiller-chat distill --topic X --n N |
paper-distiller-qa --question Y --max-rounds R |
paper-distiller-chat ask --question Y --max-rounds R |
| (no resume command) | paper-distiller-chat resume --session-id <sid> |
| (no interactive mode) | paper-distiller-chat (no subcommand) |
Flag names and defaults are otherwise preserved. See CHANGELOG for full details.
Coming
- v1.1 — citation-graph traversal: given a seed article, follow references / cited-by edges and rank them for inclusion.
- v1.2 — broaden sources beyond arxiv + SS: integrate browser-session scraping for ACM, IEEE, 知乎 etc.
- Later — per-vault
paper-distiller.tomlfor custom category schemas; LEANN in-pipeline crosslink retrieval for vaults > 500 entries.
Known limitations
- arxiv.org and Semantic Scholar occasionally rate-limit (HTTP 429); QA sessions exit with
error: search failed(resumable viapaper-distiller-chat resume <sid>). - Scanned-only PDFs fall through to abstract-only mode (PyMuPDF doesn't OCR — by design we flag rather than silently distill wrong text).
Contributing
Issues and PRs welcome.
git clone https://github.com/jesson-hh/paper-distiller
cd paper-distiller
pip install -e ".[dev]"
pytest -v
CI runs the same matrix on every PR. For a tour of the codebase, see docs/ARCHITECTURE.md.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_distiller-1.2.0.tar.gz.
File metadata
- Download URL: paper_distiller-1.2.0.tar.gz
- Upload date:
- Size: 884.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
380c48da11a66dd3a9543b649296dde9dc0a6142a26e0588f4782c2c4f57653d
|
|
| MD5 |
3aebb872e9e08c3469fa62eab0760338
|
|
| BLAKE2b-256 |
6f3792d27ae7aab69b6d087839d61a776cdc7e262e191ecb1b1db1c89be68663
|
Provenance
The following attestation bundles were made for paper_distiller-1.2.0.tar.gz:
Publisher:
release.yml on jesson-hh/paper-distiller
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper_distiller-1.2.0.tar.gz -
Subject digest:
380c48da11a66dd3a9543b649296dde9dc0a6142a26e0588f4782c2c4f57653d - Sigstore transparency entry: 1573018543
- Sigstore integration time:
-
Permalink:
jesson-hh/paper-distiller@a192c00c6946335f72a6ba8c87fe45f5d4c8f34b -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/jesson-hh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a192c00c6946335f72a6ba8c87fe45f5d4c8f34b -
Trigger Event:
push
-
Statement type:
File details
Details for the file paper_distiller-1.2.0-py3-none-any.whl.
File metadata
- Download URL: paper_distiller-1.2.0-py3-none-any.whl
- Upload date:
- Size: 79.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8b98e1eb04c1f048a499a614fe60cb52f49d1ad0917fab64bf6d9b578bf2b60
|
|
| MD5 |
350d861cb8eb32a5a2b5313c3529c32e
|
|
| BLAKE2b-256 |
aea093100244eb2af0b7e638697e649ac17355397a4f178e12513bfa2317ae83
|
Provenance
The following attestation bundles were made for paper_distiller-1.2.0-py3-none-any.whl:
Publisher:
release.yml on jesson-hh/paper-distiller
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper_distiller-1.2.0-py3-none-any.whl -
Subject digest:
c8b98e1eb04c1f048a499a614fe60cb52f49d1ad0917fab64bf6d9b578bf2b60 - Sigstore transparency entry: 1573018566
- Sigstore integration time:
-
Permalink:
jesson-hh/paper-distiller@a192c00c6946335f72a6ba8c87fe45f5d4c8f34b -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/jesson-hh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a192c00c6946335f72a6ba8c87fe45f5d4c8f34b -
Trigger Event:
push
-
Statement type: