Skip to main content

Chat-first paper distillation: turn arXiv papers into an Obsidian-ready knowledge base.

Project description

paper-distiller

Chat-first paper distillation. Turn arXiv papers into an Obsidian-ready knowledge base — via REPL, one-shot commands, or natural language.

CI PyPI version Python versions License: MIT

paper-distiller is a command-line tool that searches academic paper sources (arXiv + Semantic Scholar), downloads PDFs, has an LLM distill each one into a structured markdown note, and writes everything to a folder that opens directly in Obsidian.

v1.0 ships a single paper-distiller-chat command with three modes:

Mode When to use
paper-distiller-chat (no args) Interactive REPL — slash commands + natural-language input
paper-distiller-chat distill One-shot: search a topic, distill N papers
paper-distiller-chat ask One-shot: ask a research question, multi-round QA loop
paper-distiller-chat resume One-shot: continue a paused/errored QA session

Output is plain markdown with YAML frontmatter and [[wikilink]] cross-references — no proprietary format, no lock-in. Graph view, Dataview, tags, and full-text search all work out of the box.


Install

pip install paper-distiller

Requires Python 3.10+. From source:

git clone https://github.com/jesson-hh/paper-distiller
cd paper-distiller
pip install -e ".[dev]"

Configure

paper-distiller needs an OpenAI-compatible LLM endpoint. Cheapest reliable option: Aliyun Bailian's qwen-plus (~¥0.02 per paper).

cp examples/example.env .env
# Edit .env — set PD_API_KEY, PD_BASE_URL, PD_MODEL
Env var Required Default Purpose
PD_API_KEY Any OpenAI-compatible API key
PD_BASE_URL API endpoint base URL
PD_MODEL Model identifier
PD_PROVIDER_NAME unspecified Logging tag only
PD_PDF_TIMEOUT 60 PDF download timeout (seconds)
PD_MIN_SURVEY 2 Min articles before composing a session survey
PD_SS_API_KEY (none) Optional — higher Semantic Scholar rate limit

Provider quick reference

Provider PD_BASE_URL PD_MODEL
Aliyun Bailian (recommended) https://dashscope.aliyuncs.com/compatible-mode/v1 qwen-plus
Aliyun Bailian (coding plan) https://coding.dashscope.aliyuncs.com/v1 qwen3.5-plus
DeepSeek https://api.deepseek.com/v1 deepseek-chat
OpenRouter https://openrouter.ai/api/v1 qwen/qwen3.5-plus
Local Ollama http://localhost:11434/v1 qwen2.5

Use it

Interactive REPL (recommended)

paper-distiller-chat --vault /path/to/your/vault

You see a welcome banner with provider + vault info, then a prompt. Type slash commands or natural language:

> /help
[command list]

> /vault
Vault: /path/to/your/vault
  articles: 47
  surveys: 6
  ...

> /distill diffusion models --n 3
[live status table during execution]

> 帮我研究下扩散模型在长周期金融时序生成上的最新进展
[intent-router] Intent: ask  | confidence 9
  question: 扩散模型在长周期金融时序生成上的最新进展
Missing: max_rounds, per_round, max_cost_cny
Apply defaults (max_rounds=3, per_round=2, max_cost_cny=5.0) and run? [Y/n]
> Y
[live status table for 3-round QA loop]

> /quit
  (bye)

10 slash commands available: /distill, /ask, /resume, /sessions, /vault, /provider, /agents, /show, /help, /quit.

Natural-language input goes through an LLM intent-router that classifies into one of distill/ask/resume/show and proposes defaults for any missing parameters. You confirm before any expensive operation runs.

One-shot mode (good for scripts / cron)

Distill N papers on a topic:

paper-distiller-chat distill --vault /path/to/your/vault \
    --topic "diffusion models for finance" --n 5

Answer a question across multiple rounds:

paper-distiller-chat ask --vault /path/to/your/vault \
    --question "What are recent advances in long-horizon time-series diffusion?" \
    --max-rounds 3 --per-round 2 --max-cost-cny 5

Resume a paused / errored session:

paper-distiller-chat resume --vault /path/to/your/vault \
    --session-id 20260519-1635-a3f7

Use --dry-run on any subcommand to validate config without spending API budget.

Helpful flags

paper-distiller-chat [--vault PATH]
                     {distill | ask | resume}
                     [subcommand-specific flags]

paper-distiller-chat distill --help etc. show every flag for that subcommand.


What you get — a sample distilled article

---
title: "Convergence Rates of Conditional Flow Matching..."
category: articles
slug: cnf-convergence
tags: [generative-models, theory, distribution-estimation, arxiv-2024]
refs: [arxiv:2410.12345]
depth: full-pdf
---

# CFM 的样本复杂度上界

> **场合**: arxiv preprint, 2024 Oct
> **主题**: 给 CFM 训练给出第一个匹配 nonparametric minimax rate 的有限样本界
> **领域**: 统计 / 生成模型理论

## 一句话
作者证明 CFM 训练在 $\beta$-平滑目标密度下达到 $n^{-\beta/(2\beta+d)}$ 的 $W_2$ 收敛速度…

## 方法
核心是把 vector-field 估计误差 decompose 成 (1) approximation error 由 $\beta$-Hölder ball
覆盖控制 (2) statistical error 用 local Rademacher 处理 (3) discretization error 显式给…

## 与已有 wiki 的关联
对 [[cnf-convergence-distribution-learning]] 的分析路线是个自然的强化…

## 我的 take
最有意思的是 time-singularity 在 CFM 训练里其实从未出现…

Open the vault in Obsidian and this article cross-links automatically with everything else you've distilled.


Vault layout

paper-distiller writes into a vault with these subdirectories (auto-created on first run):

Directory Auto-written by tool Description
articles/ One file per paper
surveys/ Multi-article surveys + qa-… final answer docs
techniques/, directions/, open-problems/, authors/ Reserved for human-curated notes

QA sessions persist resume state at <vault>/.paper_distiller/qa-sessions/<sid>/state.json.


How it works

paper-distiller v1.0 is built around an async DAG of sub-agents:

Single-pass (distill):
  arxiv-searcher  ss-searcher          (parallel)
        └────┬────┘
        candidate-merger
              │
        candidate-ranker (LLM)
              │
        paper-processor × N            (parallel: fetch PDF → extract → distill LLM)
              │
        vault-writer
              │
        survey-composer (LLM, optional)

Multi-round (ask):
  ┌──────────────────────────────────────────────────────┐
  │  progress-reflector (LLM)                             │
  │      ↓                                                │
  │  [stop check: max_rounds / llm_done / llm_brake / ...] │
  │      ↓                                                │
  │  search → dedup → rank → distill × N → write          │
  └────────────────────────────────────────────────────────┘
                          ↓
                  answer-synthesizer (LLM) → surveys/qa-<slug>-<date>.md

11 agents, 4 stop reasons in QA mode, all wired together by a topological-level scheduler. For module structure, full data flow, and internal contracts, see docs/ARCHITECTURE.md.


Cost

Aliyun Bailian qwen-plus pricing — roughly ¥2.1/M input tokens, ¥12.7/M output tokens.

Operation Typical cost
1 paper distilled ¥0.02 ($0.003)
5-paper single-pass + survey ¥0.7 ($0.10)
3-round QA session @ 2 papers/round ~¥1.5–3
5-round QA session @ 3 papers/round ~¥4–8

paper-distiller-chat ask enforces --max-cost-cny (default ¥20). The cost number is for the circuit breaker — not billing-accurate.


Customize the output

All 6 LLM prompts are plain markdown — edit them to change tone, structure, or output language. No Python changes needed.

  • src/paper_distiller/prompts/{filter,article,survey}.md — distill mode
  • src/paper_distiller/agents/prompts/route.md — intent router
  • src/paper_distiller/qa/prompts/{reflect,answer}.md — QA mode

Defaults produce Chinese-primary notes with this 5-section structure: 一句话 / 问题动因 / 方法 / 关键结果 / 我的 take.


Optional companion: semantic search via vault-mcp

paper-distiller does NOT ship its own semantic-search engine for your vault. To search by meaning (not keywords) from Claude Code, Cursor, or any MCP-aware agent, pair it with vault-mcp.

See docs/vault-mcp-recommendation.md for setup and rationale.


Status & roadmap

v1.0.0 — beta. Chat-first architecture stable; 168 tests passing on Python 3.10 / 3.11 / 3.12.

Migration from v0.5

v0.5.x v1.0
paper-distiller --topic X --n N paper-distiller-chat distill --topic X --n N
paper-distiller-qa --question Y --max-rounds R paper-distiller-chat ask --question Y --max-rounds R
(no resume command) paper-distiller-chat resume --session-id <sid>
(no interactive mode) paper-distiller-chat (no subcommand)

Flag names and defaults are otherwise preserved. See CHANGELOG for full details.

Coming

  • v1.1 — citation-graph traversal: given a seed article, follow references / cited-by edges and rank them for inclusion.
  • v1.2 — broaden sources beyond arxiv + SS: integrate browser-session scraping for ACM, IEEE, 知乎 etc.
  • Later — per-vault paper-distiller.toml for custom category schemas; LEANN in-pipeline crosslink retrieval for vaults > 500 entries.

Known limitations

  • arxiv.org and Semantic Scholar occasionally rate-limit (HTTP 429); QA sessions exit with error: search failed (resumable via paper-distiller-chat resume <sid>).
  • Scanned-only PDFs fall through to abstract-only mode (PyMuPDF doesn't OCR — by design we flag rather than silently distill wrong text).

Contributing

Issues and PRs welcome.

git clone https://github.com/jesson-hh/paper-distiller
cd paper-distiller
pip install -e ".[dev]"
pytest -v

CI runs the same matrix on every PR. For a tour of the codebase, see docs/ARCHITECTURE.md.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_distiller-1.3.0.tar.gz (886.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_distiller-1.3.0-py3-none-any.whl (81.4 kB view details)

Uploaded Python 3

File details

Details for the file paper_distiller-1.3.0.tar.gz.

File metadata

  • Download URL: paper_distiller-1.3.0.tar.gz
  • Upload date:
  • Size: 886.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paper_distiller-1.3.0.tar.gz
Algorithm Hash digest
SHA256 9f0db0ffb2d5787fd2210e613099d20641b7691a06430ac9db73e72b3bb18849
MD5 e168ff091417ff222dd3dfb4110ba458
BLAKE2b-256 717a977a899958a1cf2ba52957fed057f6d903a6e9b0ebbdd83ef5b1427bbea9

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_distiller-1.3.0.tar.gz:

Publisher: release.yml on jesson-hh/paper-distiller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paper_distiller-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: paper_distiller-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 81.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paper_distiller-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b5ab2611dd5ced056d35dceb8ce42a5779536a02c0b18ee58a0e75893263ba19
MD5 63f484994c52c17315916d628e0dfd29
BLAKE2b-256 518853f81b956833c447fbad8e29354fc902a33a79c5b718a6f2781370df35ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_distiller-1.3.0-py3-none-any.whl:

Publisher: release.yml on jesson-hh/paper-distiller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page