Distill arxiv research papers into an Obsidian-compatible markdown wiki.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

xuyi1030jesson

These details have not been verified by PyPI

Project description

paper-distiller

Distill arxiv research papers into an Obsidian-compatible markdown wiki.

License: MIT

What it does

paper-distiller has two modes, both writing into the same Obsidian-compatible vault:

paper-distiller — single-pass mode. Give it a topic or author and it will:

Search arxiv + Semantic Scholar for relevant papers
Use an LLM to rank the top N most relevant
Download each PDF and extract the text
Distill each paper into a structured Chinese markdown note (一句话 / 问题动因 / 方法 / 关键结果 / 我的 take)
Cross-link each note to existing entries in your vault via [[wikilinks]]
Compose a session survey tying the new notes together

paper-distiller-qa (new in v0.5) — question-driven multi-round research loop. Give it a research question and it will:

Have the LLM judge what's known vs. missing each round, and propose the next search query
Run the single-pass distillation pipeline on the top N results
Repeat until the LLM is confident, OR a budget cap fires (rounds / articles / cost / Ctrl+C)
Synthesize a final cited answer document with an audit trail of all rounds

The output drops directly into a vault directory that opens in Obsidian — graph view, Dataview, tag pane, full-text search all work out of the box.

Install

From PyPI:

pip install paper-distiller

From source (for development or the latest unreleased changes):

git clone https://github.com/jesson-hh/paper-distiller
cd paper-distiller
python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # Linux/macOS
pip install -e .

Quick start

Copy the config template and fill in your LLM API key:

cp examples/example.env .env
# then edit .env, replace PD_API_KEY=sk-your-key-here with your real key

Point it at your Obsidian vault. Pick a mode:

Single-pass — distill N papers on a topic:

paper-distiller --vault /path/to/your/vault --topic "diffusion models for finance" --n 5

Question-driven — let the agent plan multiple rounds and answer a question:

paper-distiller-qa --vault /path/to/your/vault \
                   --question "What are the latest advances in diffusion models for long-horizon time-series forecasting?" \
                   --max-rounds 3 --per-round 2 --max-cost-cny 5

Open your vault in Obsidian. New articles appear under articles/, a session survey under surveys/. QA sessions also write <vault>/.paper_distiller/qa-sessions/<sid>/state.json for --resume after a pause or crash.

How it works

Single-pass (paper-distiller):

search arxiv + SS  →  LLM filter  →  fetch PDFs       →  distill each  →  save articles  →  compose survey
   (~30 hits)         (→ top N)      (with fallback)     (LLM call)       (md+frontmatter)   (LLM call)

Question-driven (paper-distiller-qa) wraps the above in a state-machine loop:

                ┌──────────────────────────────────────────────────────────┐
                │   LLM reflect  →  search  →  rank  →  distill (N papers) │  ← one round
                │       ↑                                          │       │
                │       └──────────────────────────────────────────┘       │
                │                                                          │
                │   Stops when: LLM done / budget hit / no new candidates  │
                └──────────────────────────────────────────────────────────┘
                                       ↓
                          synthesize cited answer  →  surveys/qa-….md

Per-paper cost on qwen-plus / qwen3.5-plus (Aliyun Bailian): roughly $0.02 per paper. A 5-paper single-pass run is around $0.10 USD (~¥0.70). A typical 3-round QA session with 2 papers/round costs ~¥1.5-3.

For module structure, data flow internals, the 7 stop reasons, and how state persistence works, see docs/ARCHITECTURE.md.

Vault layout

paper-distiller writes into a vault with these category subdirectories (created on first run):

Category	What goes there
`articles/`	Paper notes — one entry per paper
`surveys/`	Cluster mini-surveys composed by paper-distiller, linking multiple articles
`techniques/`, `directions/`, `open-problems/`, `authors/`	Reserved for human-curated content. paper-distiller doesn't write here in v0.1.

Frontmatter and [[wikilinks]] follow Obsidian conventions — no custom format.

Configuration

Env var	Purpose	Default
`PD_API_KEY`	LLM API key (Aliyun Bailian, DeepSeek, OpenRouter — any OpenAI-compatible)	required
`PD_BASE_URL`	Endpoint base URL	required
`PD_MODEL`	Model identifier	required
`PD_PROVIDER_NAME`	Logging tag only	`unspecified`
`PD_PDF_TIMEOUT`	PDF download timeout (seconds)	`60`
`PD_MIN_SURVEY`	Min articles before composing a survey	`2`

CLI flags override env vars where applicable (--model, --provider).

CLI reference

paper-distiller --vault <path> {--topic <str> | --author <str>}
                [--n 5] [--pool 30] [--source {arxiv,ss,both}]
                [--force] [--dry-run] [--verbose] [--model <name>] [--provider <name>]

paper-distiller-qa --vault <path> --question <str>
                   [--max-rounds 5] [--max-articles 15] [--max-cost-cny 20.0]
                   [--confidence-threshold 8] [--per-round 2]
                   [--source {arxiv,ss,both}] [--interactive] [--resume <session-id>]
                   [--dry-run] [--verbose] [--model <name>] [--provider <name>]

--dry-run skips all LLM calls and vault writes — useful for verifying config before spending API budget.

paper-distiller-qa flags worth knowing:

Flag	What it does
`--max-rounds N`	Hard upper bound on loop iterations (default 5). The loop also exits early on `llm_done`, `llm_brake`, `no_candidates`, or budget caps.
`--max-articles N`	Stop after distilling N total articles across rounds (default 15)
`--max-cost-cny F`	Cost circuit breaker, CNY (default 20.0). Uses qwen-plus pricing.
`--confidence-threshold N`	LLM `is_done` confidence required to stop early (0-10, default 8)
`--interactive`	Pause after each round and prompt Y/n/q
`--resume <sid>`	Pick up a paused or errored session from its state.json

Customizing prompts

All 5 LLM prompts live as plain markdown — edit them directly to change tone, structure, or output language. No Python changes needed.

src/paper_distiller/prompts/{filter,article,survey}.md — single-pass mode
src/paper_distiller/qa/prompts/{reflect,answer}.md — question-driven mode

Optional companion: semantic search via vault-mcp

paper-distiller does NOT ship its own semantic-search engine for your vault. To search the vault by meaning (not keywords) from Claude Code, pair it with vault-mcp — a standalone MCP server purpose-built for markdown vaults, with live sync and multi-provider embedding support.

See docs/vault-mcp-recommendation.md for setup and rationale.

LLM provider examples

Provider	`PD_BASE_URL`	`PD_MODEL`
Aliyun Bailian (recommended, cheapest)	`https://dashscope.aliyuncs.com/compatible-mode/v1`	`qwen-plus`
Aliyun Bailian (coding plan)	`https://coding.dashscope.aliyuncs.com/v1`	`qwen3.5-plus`
DeepSeek	`https://api.deepseek.com/v1`	`deepseek-chat`
OpenRouter	`https://openrouter.ai/api/v1`	`qwen/qwen3.5-plus`
Local Ollama	`http://localhost:11434/v1`	`qwen2.5`

Why "math research" specifically?

The default category schema (articles / techniques / directions / open-problems / authors / surveys) was designed for mathematical/scientific paper research. The tool works fine for other domains today; configurable schema is on the v0.2 roadmap.

Status

v0.5.0 — alpha. Single-pass (paper-distiller) and question-driven multi-round (paper-distiller-qa) modes both work end-to-end. 78 tests passing on Python 3.10/3.11/3.12.

Shipped

v0.1.0 — L2 single-pass search-and-distill against arxiv; LLM filter + ranker; PyMuPDF-based extraction; Obsidian-compatible markdown output.
v0.2.0 — arxiv-id-based dedup (prevents sibling articles for the same paper under different slugs); restored 500-char full-pdf threshold.
v0.3.0 — Semantic Scholar as second paper source (--source {arxiv,ss,both}, default both); PDF fallback chain (when arxiv's PDF download fails, try SS's openAccessPdf); DOI-based dedup.
v0.5.0 — paper-distiller-qa question-driven multi-round research loop. State-machine with 7 stop reasons, --interactive and --resume modes, audit-trail-equipped final answer survey. (See docs/ARCHITECTURE.md for internals.)

Future roadmap

v0.4 — deferred. We explored shipping our own semantic-search MCP server; concluded the right answer is to recommend vault-mcp instead (see docs/vault-mcp-recommendation.md). No v0.4 tag exists.
v0.6 — citation-graph traversal: given a seed article, follow its references / cited-by edges and rank them for inclusion.
v0.7 — broaden sources beyond arxiv + Semantic Scholar. Likely candidate: integrate OpenCLI to pull from logged-in browser sessions (ACM Digital Library, IEEE Xplore, lab homepages, Chinese platforms like 知乎/B站). Useful for venue-only papers and discussion context around papers.
Later / on-demand — per-vault paper-distiller.toml for custom category schemas; LEANN-backed in-pipeline crosslink retrieval (useful only when vault grows past ~500 entries).

Known limitations

arxiv.org occasionally returns 503 / 429; paper-distiller retries 3× then exits with a friendly error (use --verbose for the traceback).
The "full-pdf vs abstract-only" threshold (500 chars) is conservative; PyMuPDF rarely returns less, but scanned-only PDFs do correctly fall back to abstract-only mode.

Contributing

Issues and PRs welcome. Run tests before submitting:

pip install -e ".[dev]"
pytest -v

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

xuyi1030jesson

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.12.0

May 21, 2026

1.11.0

May 21, 2026

1.10.0

May 21, 2026

1.9.0

May 21, 2026

1.8.0

May 21, 2026

1.7.0

May 20, 2026

1.6.1

May 20, 2026

1.6.0

May 20, 2026

1.5.0

May 19, 2026

1.4.0

May 19, 2026

1.3.0

May 19, 2026

1.2.0

May 19, 2026

1.1.0

May 19, 2026

1.0.0

May 19, 2026

This version

0.5.1

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_distiller-0.5.1.tar.gz (778.2 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paper_distiller-0.5.1-py3-none-any.whl (42.7 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file paper_distiller-0.5.1.tar.gz.

File metadata

Download URL: paper_distiller-0.5.1.tar.gz
Upload date: May 19, 2026
Size: 778.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paper_distiller-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`b1b31cb788ce2b1c9dcdd9ad0fe6edd89f81b0396811eaf20cc68164f0f91723`
MD5	`a701f82cebf1edc79ab75fba3e80418b`
BLAKE2b-256	`d054371a3fe113f182090095e8bfa93bf230c425f19b5ce7cb741268aa5d5d13`

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_distiller-0.5.1.tar.gz:

Publisher: release.yml on jesson-hh/paper-distiller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: paper_distiller-0.5.1.tar.gz
- Subject digest: b1b31cb788ce2b1c9dcdd9ad0fe6edd89f81b0396811eaf20cc68164f0f91723
- Sigstore transparency entry: 1571760878
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: jesson-hh/paper-distiller@4c6ea5db7c2a782762d357c104283146315eb230
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/jesson-hh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4c6ea5db7c2a782762d357c104283146315eb230
- Trigger Event: push

File details

Details for the file paper_distiller-0.5.1-py3-none-any.whl.

File metadata

Download URL: paper_distiller-0.5.1-py3-none-any.whl
Upload date: May 19, 2026
Size: 42.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paper_distiller-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ec74031f1a35e9b5c7b4866adc5efa89a119846180b9008d120831c69519c37`
MD5	`8d222101860b9f47fe987db7308a6195`
BLAKE2b-256	`c528b83b21b87ba2245a66c5d8cb151ef951f226d7374bd0336e8c5b1539b3bb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper_distiller-0.5.1-py3-none-any.whl:

Publisher: release.yml on jesson-hh/paper-distiller

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: paper_distiller-0.5.1-py3-none-any.whl
- Subject digest: 4ec74031f1a35e9b5c7b4866adc5efa89a119846180b9008d120831c69519c37
- Sigstore transparency entry: 1571760927
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: jesson-hh/paper-distiller@4c6ea5db7c2a782762d357c104283146315eb230
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/jesson-hh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4c6ea5db7c2a782762d357c104283146315eb230
- Trigger Event: push

paper-distiller 0.5.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

paper-distiller

What it does

Install

Quick start

How it works

Vault layout

Configuration

CLI reference

Customizing prompts

Optional companion: semantic search via vault-mcp

LLM provider examples

Why "math research" specifically?

Status

Shipped

Future roadmap

Known limitations

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance