Chat-first paper distillation: turn arXiv papers into an Obsidian-ready knowledge base.

These details have not been verified by PyPI

Project description

paper-distiller

Conversational research agent for arXiv papers. Search → deep-distill → cross-reference proofs. Writes Obsidian-compatible markdown vaults.

paper-distiller is a conversational research agent that talks to you in natural language, decides which of 7 LLM-callable tools to use, and turns arXiv papers into a deeply-distilled, cross-referenced markdown knowledge base.

❯ 帮我搜索最近三年关于扩散模型理论的论文，挑 5 篇蒸馏

⏺ search(topic="diffusion model theory", sort="date", source="arxiv")
● search → 30 candidates  [0.01s]                ← local mirror, zero API hit

⏺ distill_by_id(ids=[...], topic="diffusion theory")
  paper-processor[1/5] LLM distill: Latent Diffusion Convergence...
  paper-processor[2/5] PDF fetch: 2510.12345
  ...
● distill_by_id → 5 articles · 23 theorems extracted · ¥0.21  [12m]

● 已蒸馏 5 篇，全部存进 vault。其中 Theorem 4.3 (arxiv:2510.12345) 用了 Bernstein
  concentration + Dudley chaining，跟 Paper B 的 Lemma 5.1 是同一套技术——已在两篇
  的"与已有 wiki 的关联"中互链。

qwen3.5-plus  ·  54,000 ↑  12,500 ↓  ·  ¥0.2147  ·  default

Output is plain markdown + YAML frontmatter + [[wikilinks]] — opens directly in Obsidian, works with Dataview, graph view, full-text search.

Features

Conversational REPL — natural language in, LLM decides tool calls, no flag-juggling
7 LLM-callable tools — search · distill_by_id · show · ask · research · ask_user · find_proof
Local arXiv mirror (~1.7M papers, ~5 GB) — bootstrap once via OAI-PMH, search forever zero-latency
Deep 12-section distillation — 3-6k Chinese chars per paper, capturing theorems / proofs / experiments / techniques in a researcher-grade lab-notebook format
Cross-paper proof retrieval (RAG) — every paper's proof sidecar (theorems + techniques) goes into a vault-local SQLite + FTS5 store; future distillations retrieve relevant prior theorems and feed them to the LLM as context, so notation + technique naming converges across the vault
Three-way candidate gathering — hardcoded keyword scan + FTS5 abstract match + LLM pre-extract → cap-and-merge retrieval
Multi-source fallback — arxiv (live + local mirror) / Semantic Scholar / OpenAlex with global per-source throttle + 429 cooldown
5 permission modes — default / auto / bypass / plan / safe, controlling plan-mode preview behavior
Persistent input history — ↑/↓ navigate past prompts across sessions, Ctrl-R reverse search (prompt_toolkit)
Streaming output + spinners — incremental text, per-agent activity reporting, abort with Ctrl-C
Cost tracking — per-turn + session-wide token + ¥ display, configurable budget gates

Install

pip install paper-distiller

Requires Python 3.10+. From source:

git clone https://github.com/jesson-hh/paper-distiller
cd paper-distiller
pip install -e ".[dev]"
pytest -v       # 436 tests should pass

Configure

paper-distiller needs an OpenAI-compatible LLM endpoint. Cheapest reliable option: Aliyun Bailian's qwen-plus (~¥0.04 per paper at v1.7+ depth).

cp examples/example.env .env
# Edit .env — set PD_API_KEY, PD_BASE_URL, PD_MODEL

Provider quick reference

Provider	`PD_BASE_URL`	`PD_MODEL`
Aliyun Bailian (default)	`https://dashscope.aliyuncs.com/compatible-mode/v1`	`qwen-plus`
Aliyun coding plan	`https://coding.dashscope.aliyuncs.com/v1`	`qwen3.5-plus`
DeepSeek	`https://api.deepseek.com/v1`	`deepseek-chat`
OpenRouter	`https://openrouter.ai/api/v1`	`qwen/qwen3.5-plus`
Local Ollama	`http://localhost:11434/v1`	`qwen2.5`

Configuration env vars

Variable	Default	Purpose
`PD_API_KEY` ✱	—	LLM API key
`PD_BASE_URL` ✱	—	LLM endpoint base
`PD_MODEL` ✱	—	Model identifier
`PD_PERMISSION_MODE`	`default`	Startup mode: `default`/`auto`/`bypass`/`plan`/`safe`
`PD_PLAN_THRESHOLD_CNY`	`10.0`	Plan-mode kicks in above this cost
`PD_LLM_TIMEOUT`	`600`	LLM read timeout (s); deep distillations can take 3-5 min
`PD_FANOUT_CONCURRENCY`	`5`	Parallel LLM calls during multi-paper distill
`PD_ARXIV_LOCAL_ONLY`	`0`	If `1`, never fall back to live arXiv API
`PD_ARXIV_LOCAL_DIR`	`~/.paper-distiller/arxiv`	Local mirror DB location
`PD_HISTORY_FILE`	`~/.paper-distiller/history.jsonl`	Input history file
`PD_SS_API_KEY`	(none)	Semantic Scholar API key (raises rate limit ~100×)

✱ required. Full list in docs/configuration.md.

Quick start

1. One-time arXiv mirror bootstrap (optional but recommended)

paper-distiller-arxiv bootstrap --since 2020-01-01
# ~2 hours, ~3 GB. Pulls ~600k papers via OAI-PMH. Auto-resumes on SSL errors.

Without this, search falls back to live arXiv API (rate-limited). After bootstrap, search hits a local SQLite + FTS5 index at <10 ms.

2. Launch the conversational REPL

paper-distiller-chat --vault /path/to/your/vault

You'll see a welcome banner with version, vault, model, and current permission mode. Then talk to it:

❯ 给我介绍一下 yuling jiao 最近五年的代表论文
❯ /mode plan                                  # require my OK before any tool
❯ 帮我深度研究扩散模型理论 (research)
❯ vault 里哪些定理用了 Bernstein 不等式？      # → calls find_proof
❯ /cost
❯ /exit

↑ / ↓ cycles through past prompts (across sessions). Ctrl-C cancels a running tool (conversation continues). Twice within 1.5s exits the REPL.

3. Single-shot mode (for scripts / cron)

# Distill 5 papers
paper-distiller-chat distill --vault X --topic "diffusion theory" --n 5

# Multi-round QA
paper-distiller-chat ask --vault X --question "近期扩散模型的收敛速率怎样？" --max-rounds 5

# Long deep research (5-phase loop)
paper-distiller-chat research --vault X --question "..." --duration 6h --max-papers 40

The 7 LLM-callable tools

Tool	Purpose
`search(topic, n, source, sort)`	Find papers — defaults to local arXiv mirror
`distill_by_id(ids, topic)`	Download PDFs + 12-section deep distill + sidecar
`show(slug, category)`	Read a vault entry back
`ask(question, ...)`	Multi-round QA loop: search → distill → reflect
`research(question, ...)`	Long-running 5-phase deep research (default 6h, 40 papers)
`ask_user(question, options)`	Pause and let the user pick between 2-4 options
`find_proof(query_type, query)`	Query the vault's accumulated theorem / technique knowledge base

System prompt steers the LLM to use these autonomously. See docs/tools.md for full schemas.

What a distilled article looks like

Every paper produces a 12-section markdown entry with this structure:

# 双向 GAN 的非渐近误差界

> **场合**: NeurIPS 2021
> **主题**: 首次为 BiGAN 提供联合分布匹配下的非渐近误差界理论保证
> **领域**: 理论机器学习 / 统计学习理论

## TL;DR (一句话)
本文首次为双向 GAN (BiGAN) 提供了基于 Dudley 距离的非渐近误差界...

## 1. 问题动因
传统 GAN 理论分析存在三个显著脱离实际的假设：(1) 维度匹配；(2) 紧支撑...

## 2. 设定与记号
- **目标分布** $\mu$：支撑在 $\mathbb{R}^d$ 上的数据分布
- **联合分布**：$\hat{\nu} = \tilde{g}\#\nu$，$\hat{\mu} = \tilde{e}\#\mu$
- **核心假设**: $\mathcal{F}_1$ 一致有界 1-Lipschitz...

## 3. 核心方法
### 3.1 主要思想
### 3.2 算法/构造
### 3.3 理论分析

## 4. 关键定理 / 命题
**Theorem 4.3** (Cross-Dimensional Empirical Pushforward): ...
*Proof sketch*: ...

## 5. 实验设置
- 数据集: CelebA-HQ (256×256), CIFAR-10
- 基线: BiGAN-baseline, ALI, ALAE
- 评估指标: FID, Inception Score
- 资源: 8× V100, 训练 72 小时

## 6. 关键结果
- 在 CelebA-HQ 上 FID 从 18.4 降到 12.7 (-31%)
- 证明了 $O(n^{-1/2})$ 而非 $O(n^{-1/4})$

## 7. 消融与敏感性
## 8. 局限与失败模式
## 9. 与已有 wiki 的关联       ← [[wikilinks]] to other distilled papers
## 10. 复现要点
## 11. 我的 take
## 12. 引用网络 (可选)

Plus a proof_sidecar JSON stored in .proof_store/proofs.db:

{
  "theorems": [
    {
      "name": "Theorem 4.3",
      "statement": "...",
      "proof_sketch": "...",
      "techniques_used": ["Bernstein", "Dudley chaining", "ReLU approximation"]
    }
  ],
  "key_techniques": ["Bernstein", "IPM duality", "ReLU approximation", ...]
}

When you later distill a related paper, the LLM automatically receives prior theorems whose techniques overlap — keeping notation and citation patterns coherent across the vault.

Permission modes

❯ /mode
current permission_mode: default
available modes: default, auto, bypass, plan, safe

  default   show plan-mode preview for tools >= ¥10 (auto-proceed after 5s)
  auto      skip plan-mode previews entirely
  bypass    same as auto (reserved for future destructive-op gates)
  plan      ALWAYS show plan preview, wait for explicit Enter / q
  safe      like plan, but at ¥0 threshold (every tool prompts)

❯ /mode plan
permission_mode → plan

The status line color-codes the current mode:

default — dim
auto — yellow
bypass — bold red (signal: dangerous)
plan — cyan
safe — bold green

Local arXiv mirror

paper-distiller-arxiv bootstrap [--since 2020-01-01] [--source auto|oai_pmh|internet_archive|kaggle]
paper-distiller-arxiv sync [--since DATE]      # daily increment
paper-distiller-arxiv search "diffusion" --n 10 --sort date --category cs.LG
paper-distiller-arxiv stats                    # papers count, db size, last sync
paper-distiller-arxiv doctor                   # diagnose integrity + connectivity

The mirror uses SQLite + FTS5 + BM25 for keyword + ranked retrieval, all local. Built-in author-search fallback when FTS5 misses on title/abstract.

Cost

Each deep distillation uses ~20-30k input tokens (paper full text) and ~10k output tokens. At qwen-plus rates (¥0.8/M in, ¥2.0/M out):

Operation	Typical cost	Time
1 paper distilled (v1.7+ deep)	~¥0.04	~3 min
5-paper survey	~¥0.21	~5-10 min (5-way concurrent)
`ask` 5 rounds × 3 papers	~¥1-3	~15-25 min
`research` 6h budget, 40 papers	~¥2-5	~1 hour (with local mirror)

Configurable via --max-cost-cny flags + global PD_PLAN_THRESHOLD_CNY env. Plan-mode shows a budget preview before any tool over the threshold runs.

Architecture

┌─────────────────────────────────────────────────────────────┐
│ AgentLoop (chat/agent_loop.py)                              │
│   prompt_toolkit input → 7 LLM tools → streaming output     │
└──────────────────┬──────────────────────────────────────────┘
                   ↓ tool call
┌─────────────────────────────────────────────────────────────┐
│ Async DAG orchestrator (agents/orchestrator.py)             │
│   topological scheduling + asyncio.Semaphore fanout cap     │
└─────┬────────────────────┬──────────────────┬───────────────┘
      ↓                    ↓                  ↓
┌──────────────┐  ┌──────────────────┐  ┌─────────────────┐
│ search/      │  │ paper-processor  │  │ vault-writer    │
│ arxiv-local  │  │ × N concurrent   │  │ proof-store     │
│ → 7 agents   │  │ → fetch+distill  │  │ → SQLite + md   │
└──────────────┘  └──────────────────┘  └─────────────────┘

Full module map and data flow: docs/ARCHITECTURE.md.

Vault layout

your-vault/
├── articles/         # one .md + .html per distilled paper
├── surveys/          # multi-paper syntheses, qa-* final answers
├── techniques/       # reserved for hand-curated notes
├── directions/
├── open-problems/
├── authors/
└── .proof_store/
    └── proofs.db     # SQLite + FTS5 of extracted theorems

Markdown is Obsidian-compatible. HTML siblings have MathJax for LaTeX.

Contributing

PRs welcome. See CONTRIBUTING.md for dev setup, test workflow, and conventions.

Issues: GitHub Issues (templates for bug / feature / question).

Security disclosures: see SECURITY.md.

Code of conduct: Contributor Covenant 2.1.

Citation

If you use paper-distiller in academic work, please cite via CITATION.cff.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.12.0

May 21, 2026

1.11.0

May 21, 2026

1.10.0

May 21, 2026

1.9.0

May 21, 2026

1.8.0

May 21, 2026

1.7.0

May 20, 2026

1.6.1

May 20, 2026

1.6.0

May 20, 2026

1.5.0

May 19, 2026

1.4.0

May 19, 2026

1.3.0

May 19, 2026

1.2.0

May 19, 2026

1.1.0

May 19, 2026

1.0.0

May 19, 2026

0.5.1

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_distiller-1.12.0.tar.gz (1.0 MB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paper_distiller-1.12.0-py3-none-any.whl (150.5 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file paper_distiller-1.12.0.tar.gz.

File metadata

Download URL: paper_distiller-1.12.0.tar.gz
Upload date: May 21, 2026
Size: 1.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paper_distiller-1.12.0.tar.gz
Algorithm	Hash digest
SHA256	`2ea5e5300d81809397635102de72ba0be05cdb92ab96fdb8eed1563994b63188`
MD5	`4a24a9d2d51ef12a33498b452449fd64`
BLAKE2b-256	`859a2b343a08ed594eb2da73a563f74c19b14ddc4a7b14c9bc7b1d82ddabb076`

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_distiller-1.12.0.tar.gz:

Publisher: release.yml on jesson-hh/paper-distiller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: paper_distiller-1.12.0.tar.gz
- Subject digest: 2ea5e5300d81809397635102de72ba0be05cdb92ab96fdb8eed1563994b63188
- Sigstore transparency entry: 1591076958
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: jesson-hh/paper-distiller@dbd3308f49204edfb7425e5271259850bdaad219
- Branch / Tag: refs/tags/v1.12.0
- Owner: https://github.com/jesson-hh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@dbd3308f49204edfb7425e5271259850bdaad219
- Trigger Event: push

File details

Details for the file paper_distiller-1.12.0-py3-none-any.whl.

File metadata

Download URL: paper_distiller-1.12.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 150.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paper_distiller-1.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`abe4b20e8c2b05a4b24680370c5a482acf13734a0a0b00692bbdeeb9b4aec28a`
MD5	`1f7b0baee0d5d5bd389f588190dde62c`
BLAKE2b-256	`bec3bd2194fbcaff42eb51a311013023e033d631c950045cda2827e6be049588`

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_distiller-1.12.0-py3-none-any.whl:

Publisher: release.yml on jesson-hh/paper-distiller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: paper_distiller-1.12.0-py3-none-any.whl
- Subject digest: abe4b20e8c2b05a4b24680370c5a482acf13734a0a0b00692bbdeeb9b4aec28a
- Sigstore transparency entry: 1591076967
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: jesson-hh/paper-distiller@dbd3308f49204edfb7425e5271259850bdaad219
- Branch / Tag: refs/tags/v1.12.0
- Owner: https://github.com/jesson-hh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@dbd3308f49204edfb7425e5271259850bdaad219
- Trigger Event: push

paper-distiller 1.12.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

paper-distiller

Features

Install

Configure

Provider quick reference

Configuration env vars

Quick start

1. One-time arXiv mirror bootstrap (optional but recommended)

2. Launch the conversational REPL

3. Single-shot mode (for scripts / cron)

The 7 LLM-callable tools

What a distilled article looks like

Permission modes

Local arXiv mirror

Cost

Architecture

Vault layout

Contributing

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance