Keyword-driven paper search assistant that exports slides, summaries, and BibTeX.

These details have not been verified by PyPI

Project description

AutoPaperToPPT

Languages: English · 繁體中文 · 简体中文 · 日本語 · Español · Français · Deutsch · 한국어 · Português · Русский · Italiano · Tiếng Việt · हिन्दी · Bahasa Indonesia Documentation: autopapertoppt.readthedocs.io

A keyword-driven paper search assistant that fetches results from arXiv, Semantic Scholar, OpenAlex, PubMed, ACM (via Crossref), IEEE Xplore, DBLP, generic Crossref, OpenAIRE, Springer Nature, and Google Scholar, normalises them into one record shape, and exports the deduplicated set as a thesis-style PowerPoint deck, an Excel workbook, and a BibTeX file — all from one CLI call or one MCP tool call. Optionally enriches each paper by reading its PDF and producing a structured per-paper summary either in-context (LLM-as-agent flow) or via the Anthropic API (Python pipeline flow).

For AI agents driving this project

If you are an LLM (Claude Code, Cursor, Aider, Codex CLI, …) reading this README to figure out what to do — start here. Everything below is also pinned in AGENTS.md (cross-agent mirror) and CLAUDE.md (canonical reference); read those before making non-trivial changes.

What the user actually wants

The default deliverable is a thesis-style rich .pptx per paper, not the lightweight abstract-only deck the CLI produces by default. The lightweight emit is an intermediate artefact — your job is to upgrade it.

Decision tree

Is ANTHROPIC_API_KEY set in the environment? → just run the CLI; it auto-enriches via the Anthropic API. You are done.
No key, but you (an LLM) are driving the session → you produce the rich summary yourself. Read each PDF, hand-author a PaperSummary with rich-tier fields (pain_points, research_question, contributions_detailed, headline_metrics, technique_table, method_sections, evaluation_sections, system_flow, research_questions, rq_results, core_observation, limitations, future_work), drop a scripts/regen_<query>.py, run it. Do not tell the user to set the API key — you are the LLM that would have written the summary.
No LLM in the loop (CI / cron / unattended) → lightweight is acceptable.

6-step MCP workflow

1. (optional) list_sources()                              # see which plugins are enabled
2. search(keywords, sources, top_tier_only=true, ...)
3. (optional) download_pdfs(papers, out_dir="./exports/...")
4. fetch_pdf_text(pdf_url=paper.pdf_url)                  # per paper
5. (you read each PDF and produce a structured summary dict)
6. export(papers=[{...paper, "summary": {...}}], language="zh-tw", ...)

All eleven MCP tools (including list_sources, download_pdfs, pptx_inspect / pptx_update_slide / pptx_add_slide / etc.) are documented in docs/mcp.md.

Mandatory: URL / DOI verification before shipping

Publisher URL paths cannot be guessed — AAAI uses numeric IDs (v40i5.37389), IEEE uses an opaque arnumber, ACM uses opaque DOIs. When you hand-author a Paper, copy url / doi / arxiv_id verbatim from the search xlsx that produced this run — never from memory, never constructed from the title.

The xlsx is written to exports/<run>/<slug>-<timestamp>.xlsx with column 7 = DOI, column 8 = URL. Audit your regen script when you finish:

from openpyxl import load_workbook
from scripts.regen_<run> import ALL_PAPERS
real = {sh.cell(row=r, column=2).value: sh.cell(row=r, column=8).value
        for sh in [load_workbook("exports/<run>/<slug>-<ts>.xlsx")["Papers"]]
        for r in range(2, sh.max_row + 1)}
for p in ALL_PAPERS:
    actual = next((u for t, u in real.items() if p.title[:30] in (t or "")), None)
    if actual and not (p.url == actual
                       or p.url.split("v")[0] == actual.split("v")[0]):
        print(f"! {p.bibtex_key()} authored {p.url} vs real {actual}")

Two fabrications caught this way in production: wrong AAAI volume (v39i23.34521 vs real v39i22.34537) and invented author-slug path (view/fang2026 instead of v40i5.37389).

Mandatory: prune irrelevant downloads before shipping

Search keyword matching is keyword-based, so off-topic papers will slip in: a "Claude code" query returned a Viterbi-decoder paper because both contain "code"; "LLM code review" matched an object-detection literature review. Once you read the abstracts and classify a paper off-topic for the user's actual intent, prune the run dir:

from pathlib import Path
run = Path("exports/<run>")
irrelevant_keys = ("key-of-off-topic-paper-1", "key-of-off-topic-paper-2")
for key in irrelevant_keys:
    for path in (run / "pdfs" / f"{key}.pdf", run / f"{key}.pptx"):
        if path.exists():
            path.unlink()

Delete exports/<run>/pdfs/<key>.pdf + exports/<run>/<key>.pptx. Keep the aggregate <slug>-<timestamp>.xlsx / .bib — those are the honest record of what the search returned. Borderline cases get a rich summary; better to over-include than to silently drop a possible match.

Worked example

scripts/regen_llm_security_batch.py ships 8 hand-authored rich summaries built exactly this way. Use it as the template for any multi-paper search. The zh-tw companion is at scripts/regen_llm_security_batch_zh_tw.py.

Don'ts

Don't end a multi-paper search by telling the user "set ANTHROPIC_API_KEY for a rich deck" — you are the LLM that could have written the summaries.
Don't treat the per-paper lightweight .pptx as the deliverable.
Don't stop after download_pdfs reports N PDFs saved — that's the start of the rich-authoring phase, not the end.
Don't invent numbers, RQs, contributions, or limitations not in the paper.
Don't fabricate URLs / DOIs / arXiv IDs — see the rule above.
Don't leave irrelevant downloads in the run directory. Keyword search matches can include off-topic papers (a "Claude code" query pulled in a Viterbi-decoder paper; "LLM code review" pulled in an object-detection literature review). After classifying papers off-topic, delete their pdfs/<key>.pdf and lightweight <key>.pptx; keep the aggregate xlsx / bib as the honest record of what the search returned.
Don't mention "Claude", "Claude Code", "AI-generated", "GPT", "Copilot", or any AI tool/model name in commit messages, PR descriptions, code comments, or documentation.

Features

Eleven pluggable sources: arxiv, semantic_scholar, openalex, pubmed, acm (Crossref-scoped), dblp, crossref (unscoped), openaire, springer (needs API key), ieee (API key or opt-in scrape), scholar (opt-in scrape). Each lives in sources/<name>/ behind a Fetcher adapter. A top-tier-venue whitelist filters results to flagship CS conferences/journals plus Nature/Science/PNAS by default; pass --all-venues to disable.
Single-paper mode: paste an arXiv ID, arXiv URL, DOI, PMID, or IEEE document URL — AutoPaperToPPT resolves it via the right source and emits the same export bundle. Useful for paper reading notes and thesis defence prep.
Local PDF mode (--pdf <path>): pass one PDF or a directory. A heuristic extractor pulls title, authors, year, arXiv ID, DOI, and the real abstract straight from each PDF's front matter (anchored on the explicit Abstract / ABSTRACT / 摘要 header, not a blind prefix). --title / --authors / --year / --venue / --doi / --arxiv-id override on a single-PDF call; on a directory, per-file extraction wins so every paper gets its own deck named after its BibTeX key.
Five exporters:
- .pptx — 16:9 widescreen, page-numbered, three rendering tiers (lightweight abstract-only · enriched-flat · thesis-style with pain-point quadrants, KPI callouts, technique-comparison tables, per-RQ result tables, contribution summary, core observation, limitations & future work, Q&A, references). All template strings are i18n'd across 14 languages: English, 繁體中文, 简体中文, 日本語, Español, Français, Deutsch, 한국어, Português, Русский, Italiano, Tiếng Việt, हिन्दी, Bahasa Indonesia.
- .xlsx — Papers sheet + Query provenance sheet, hyperlinked URL / PDF, frozen header, auto column widths. Column 5 (Source) shows the real publication venue (e.g. "IEEE Access"); column 6 (Indexed via) shows which fetcher returned the metadata (e.g. "openalex"), so the two pieces of information never collide.
- .md — full source / title / abstract list.
- .bib — collision-free citation keys, LaTeX-escaped fields.
- .json — raw payload for downstream tooling.
PPT editing toolkit: autopapertoppt.exporters.pptx_edit (inspect / update_slide / delete_slide / reorder_slides / add_slide) works against any deck the exporter produces, plus the equivalent pptx_* MCP tools so an LLM agent can iterate on a generated deck.
MCP server: 11 tools — list_sources (discovery), search, fetch_paper, fetch_pdf_text, download_pdfs, export, and the five pptx_* editing tools. Lets any MCP-aware LLM (Claude Code, Claude Desktop, Cursor, …) drive the whole workflow.
Two enrichment paths for going beyond the abstract into a true thesis-style deck:
- LLM-as-agent (no API key) — the calling LLM reads the PDF body text via fetch_pdf_text, writes a structured summary in-context, and passes it to export.
- Python pipeline (--enrich) — the CLI calls Anthropic's API itself; default model claude-opus-4-7.
Safety by default: HTTPS-only HTTP transport, per-source rate limit (token bucket), defusedxml for any XML payload, path-traversal-safe export paths, no eval / exec / pickle on user input. Scholar and IEEE scraping are off by default (env-var opt-in).

Quick start

git clone <repo-url>
cd AutoPaperToPPT
python -m venv .venv
.venv\Scripts\Activate.ps1            # Windows PowerShell
# source .venv/bin/activate           # Linux / macOS

# Install with dev extras (also pulls in MCP SDK and intelligence deps)
pip install -e .[dev]

Search arXiv and export deck + workbook + BibTeX (default for --query):

py -m autopapertoppt --query "diffusion models" --source arxiv --max 10 `
                      --out .\exports\

Fetch a single paper by URL — defaults to .pptx + .bib (the .xlsx makes less sense for one row):

py -m autopapertoppt --paper "https://arxiv.org/abs/1706.03762" `
                      --filename-stem attention `
                      --out .\exports\

Render the deck in 繁體中文:

py -m autopapertoppt --paper "https://arxiv.org/abs/1706.03762" `
                      --lang zh-tw --out .\exports\

LLM-pipeline enrichment (Python calls Anthropic itself — needs API key):

$env:ANTHROPIC_API_KEY = "sk-ant-..."
py -m autopapertoppt --paper "https://arxiv.org/abs/1706.03762" `
                      --enrich --lang zh-tw --out .\exports\

CLI flags

Flag	Purpose
`--query` / `-q`	Keywords (required unless `--paper`).
`--paper` / `-p`	arXiv ID / URL, DOI, PMID, or IEEE document URL. Mutually exclusive with `--query`.
`--source` / `-s`	Comma-separated source list. Default `arxiv`.
`--max` / `-n`	Max results per source (1..200). Default 25.
`--year-from` / `--year-to`	Inclusive year filter.
`--export` / `-e`	Formats: any of `pptx,xlsx,md,bib,json`. Default depends on mode (see below).
`--out` / `-o`	Output directory. Default `./exports`.
`--filename-stem`	Override the generated filename stem.
`--no-abstract`	Omit abstract content from exports.
`--lang` / `-l`	Deck language: one of 14 — `en`, `zh-tw`, `zh-cn`, `ja`, `es`, `fr`, `de`, `ko`, `pt`, `ru`, `it`, `vi`, `hi`, `id`. Default `en`.
`--enrich`	Download PDF + Anthropic-summarise. Needs `ANTHROPIC_API_KEY` and `[intelligence]` extra.
`--lightweight`	Force the abstract-only deck even when `ANTHROPIC_API_KEY` is set.
`--llm-model`	Override default `claude-opus-4-7` for enrichment.
`--all-venues`	Disable the top-tier whitelist (default keeps flagship CS venues + Nature / Science / PNAS / CACM / LNCS).
`--paywall-threshold`	Fraction of paywalled results that triggers the confirmation prompt. Default 0.30.
`--yes`	Skip the paywall prompt and proceed.
`--max-slides`	Per-paper slide cap (default 25; pass 0 for unlimited).
`--quiet`	Suppress per-paper printout.

Environment variables

Variable	Used by	Purpose
`ANTHROPIC_API_KEY`	`--enrich`	LLM auth. Not needed for the LLM-as-agent path over MCP.
`AUTOPAPERTOPPT_LLM_MODEL`	`--enrich`	Override the default `claude-opus-4-7`.
`AUTOPAPERTOPPT_S2_API_KEY`	Semantic Scholar	Higher rate limit. Optional.
`AUTOPAPERTOPPT_NCBI_API_KEY`	PubMed	Raises NCBI's anonymous limit (3/s) to 10/s. Optional.
`AUTOPAPERTOPPT_CONTACT_EMAIL`	PubMed, ACM, Crossref, OpenAlex	Puts requests into Crossref's polite pool.
`AUTOPAPERTOPPT_IEEE_API_KEY`	IEEE (API path)	Official IEEE Xplore API; surfaces `pdf_url` for in-scope papers.
`AUTOPAPERTOPPT_ENABLE_IEEE_SCRAPING`	IEEE (scrape path)	`=1` opts into scraping. Not needed when the API key is set.
`AUTOPAPERTOPPT_CROSSREF_PLUS_TOKEN`	ACM, Crossref	Crossref Plus subscriber token (Bearer header). Optional.
`AUTOPAPERTOPPT_SPRINGER_API_KEY`	Springer	Required; free key from https://dev.springernature.com/. Plugin is silently skipped without it.
`AUTOPAPERTOPPT_ENABLE_SCHOLAR_SCRAPING`	Google Scholar	`=1` opts into scraping. Off by default — Scholar ToS forbids scraping.
`AUTOPAPERTOPPT_PDF_COOKIES_FILE`	PDF downloader	Netscape `cookies.txt`. Off by default. Use only with publishers you have institutional rights to.
`AUTOPAPERTOPPT_LOG_LEVEL`	logger	`INFO` default; `DEBUG` for verbose tracing.

Defaults: --query → pptx,xlsx,bib. --paper → pptx,bib. Always overridable with explicit --export.

MCP server

claude mcp add autopapertoppt -- ".venv\Scripts\python.exe" -m autopapertoppt.mcp

Or write to your settings file:

{
  "mcpServers": {
    "autopapertoppt": {
      "command": ".venv\\Scripts\\python.exe",
      "args": ["-m", "autopapertoppt.mcp"]
    }
  }
}

Tools:

Tool	Purpose
`list_sources`	Enumerate every plugin + report whether each is enabled in the current env. Call this once before `search`.
`search`	Keywords → list of papers. Accepts `top_tier_only`, `min_citations`; defaults to the full no-API-key source mix.
`fetch_paper`	arXiv / DOI / PMID / IEEE identifier → single paper.
`fetch_pdf_text`	Download one PDF, return extracted body text. The MCP path to "I read the paper".
`download_pdfs`	Batch-download a papers list's PDFs into `{out_dir}/pdfs/`. Returns per-paper results keyed by BibTeX key.
`export`	Papers list + formats → writes `.pptx/.xlsx/.md/.bib/.json`. Accepts a `summary` field per paper for the rich thesis-style schema and `max_slides_per_paper` (default 25).
`pptx_inspect`	Read slide / shape structure of an existing deck.
`pptx_update_slide`	Replace `title` / `body` / `meta` (by shape name) or arbitrary shapes by index.
`pptx_delete_slide`	Remove a slide and its part relationship.
`pptx_reorder_slides`	Permute slides via `sldIdLst`.
`pptx_add_slide`	Append or insert a new title / body / meta slide.

LLM-as-agent flow (no ANTHROPIC_API_KEY needed — the LLM is the agent):

1. (optional) list_sources()                       # discover enabled plugins
2. search(keywords=..., sources=[...], top_tier_only=true)
3. (optional) download_pdfs(papers, out_dir="./exports/...")  # persist PDFs
4. fetch_pdf_text(pdf_url=paper.pdf_url)           # per paper
5. (the LLM reads body text, produces a structured `summary` dict)
6. export(papers=[{...paper, "summary": {pain_points: [...], rq_results: [...]}}],
          language="zh-tw", formats=["pptx","bib"], ...)

Full reference in docs/mcp.md.

Project layout

AutoPaperToPPT/
├── autopapertoppt/                 # main package
│   ├── core/                        # Paper / PaperSummary / RqResult / dedup / ranking / pipeline
│   ├── fetchers/                    # HTTPS-only async client, token-bucket rate limit
│   ├── exporters/                   # pptx (thesis-style) · xlsx · bib · md · json · pptx_edit · i18n
│   ├── intelligence/                # PDF fetch + Anthropic summariser  ([intelligence] extra)
│   ├── mcp/                         # FastMCP server (11 tools)
│   ├── utils/                       # logging, path safety
│   ├── cli.py                       # argparse CLI
│   └── __main__.py
├── sources/                         # plugin folders: arxiv, semantic_scholar,
│                                    #   openalex, pubmed, acm, ieee, scholar,
│                                    #   dblp, crossref, openaire, springer
├── tests/                           # pytest suite + recorded fixtures (no live HTTP)
├── docs/                            # Sphinx (14 language trees)
├── scripts/                         # one-off regen scripts
└── pyproject.toml                   # ruff, bandit, build, optional extras

Definition of Done

.venv\Scripts\python.exe -m pytest tests/
.venv\Scripts\python.exe -m ruff check .
.venv\Scripts\python.exe -m bandit -c pyproject.toml -r autopapertoppt/ sources/

The -c flag on bandit is required — without it bandit ignores the project skip config. When touching the pptx exporter, also run an overflow check (see CLAUDE.md "Slide Deck Rules").

Desktop GUI (PySide6)

A native desktop interface ships behind the [gui] extra:

pip install autopapertoppt[gui]
autopapertoppt-gui                 # or: autopapertoppt gui

The window has four tabs — Search (functional), Settings (functional, persists API keys via QSettings), Enrich, and Deck (the latter two land in a follow-up). The Windows release .exe bundles PySide6, so autopapertoppt.exe gui works without a separate Python install. UI ships in English and Traditional Chinese; deck output language remains 14-language (CLI / GUI both consume the same exporters/i18n.py table).

Full reference: docs/gui.md.

Packaging as a standalone executable

Two packagers are documented for shipping a single-file binary that runs without Python installed:

docs/packaging-pyinstaller.md — fast build (under a minute), 200–300 MB output, 2–4 s startup. Best when you iterate on the build script.
docs/packaging-nuitka.md — slow build (5–15 minutes), 80–150 MB output, sub-second startup, some bytecode protection. Best when end users run the binary many times.

Both docs cover the project-specific gotcha — the dynamic source plugins under sources/<name>/ — and ship a verified command for the CLI and the MCP server entry points.

Continuous integration & releases

Two GitHub Actions workflows live under .github/workflows/:

ci.yml runs on every push and PR to main. Matrix is Ubuntu + Windows × Python 3.12 / 3.13 / 3.14 (6 jobs). Each job runs ruff check, bandit -c pyproject.toml, and pytest.
release.yml waits for ci.yml to complete on main (workflow_run trigger). It runs only if CI succeeded. Every CI-success push to main is a release — the workflow auto-bumps the patch version in pyproject.toml, commits the bump back to main as chore: bump version to X.Y.Z, and pipelines:
1. bump-version — read current X.Y.Z from pyproject.toml, increment to X.Y.(Z+1), commit + push back to main using the workflow GITHUB_TOKEN. That push does NOT re-trigger CI (per GitHub's rule that GITHUB_TOKEN-driven pushes can't start new workflow runs), so the cycle terminates naturally.
2. publish-pypi — build sdist + wheel, twine check, twine upload via PYPI_API_TOKEN.
3. create-draft-release — open a draft GitHub release at tag v<version> with auto-generated notes.
4. build-nuitka — compile a Nuitka onefile .exe on a Windows runner, smoke-test it, and attach the binary + a .sha256 checksum to the draft release. Windows-only by design: Linux / macOS users install from PyPI, so shipping native binaries there just inflates the release page. Build cache keyed on pyproject.toml cuts warm builds from ~15 min to ~3 min.
5. publish-release — unmark the draft once the Nuitka asset is uploaded, so users never see a half-finished release.
Skipping a release. Include [skip release] anywhere in the commit message and the bump + every downstream job is skipped — use this for docs-only / typo / refactor commits that shouldn't burn a version number.

To enable PyPI publishing + release executables:

Generate a project-scoped API token at https://pypi.org/manage/account/token/.
In the GitHub repo: Settings → Secrets and variables → Actions → New repository secret. Name it PYPI_API_TOKEN and paste the token value.
Allow GitHub Actions to push to main: Settings → Actions → General → Workflow permissions → Read and write permissions. The bump commit is pushed by the workflow's GITHUB_TOKEN.
Cut releases by merging PRs into main. The pipeline takes ~3–5 min to publish to PyPI and ~10–15 min more (or ~3–5 min with a warm Nuitka cache) for the Windows binary to attach.

The publish-pypi job intentionally does NOT attach a GitHub Environment, so each run surfaces as a Release entry (with its Nuitka .exe attached) rather than as a "Deployment" sidebar widget on the repo home — releases get their own dedicated page and a Deployment entry on top would just be redundant noise.

License

See LICENSE. The arXiv API is used under arXiv's API terms of use (https://info.arxiv.org/help/api/tou.html) — observe the 1 request per 3 seconds soft limit; the bundled fetcher already enforces this via its token bucket.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

May 17, 2026

0.1.6

May 17, 2026

0.1.5

May 17, 2026

0.1.4

May 16, 2026

This version

0.1.3

May 16, 2026

0.1.2

May 16, 2026

0.1.1

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autopapertoppt-0.1.3.tar.gz (153.1 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autopapertoppt-0.1.3-py3-none-any.whl (122.1 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file autopapertoppt-0.1.3.tar.gz.

File metadata

Download URL: autopapertoppt-0.1.3.tar.gz
Upload date: May 16, 2026
Size: 153.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for autopapertoppt-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`3672005099fa4ebe110a27300800a6128cda272de63956cefb6b0540a5b0eb62`
MD5	`493d7da64c2122a95ebf6a722aa3c292`
BLAKE2b-256	`5de43ab5e23d0e861f5682f3c4cdfaae62b7f039c9faeb6aed115efd1347a0f4`

See more details on using hashes here.

File details

Details for the file autopapertoppt-0.1.3-py3-none-any.whl.

File metadata

Download URL: autopapertoppt-0.1.3-py3-none-any.whl
Upload date: May 16, 2026
Size: 122.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for autopapertoppt-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d81f7e037bc2b919afef2446941ca0a8e486fb7918afbc25fa95f4c0a6528a74`
MD5	`e29e9ad7b5207c3c2d2fa3eb833caec5`
BLAKE2b-256	`72ce3ae1e0b2982164a26704533e9f06c7598753e0446512b93c1de26540627c`

See more details on using hashes here.

autopapertoppt 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

AutoPaperToPPT

For AI agents driving this project

What the user actually wants

Decision tree

6-step MCP workflow

Mandatory: URL / DOI verification before shipping

Mandatory: prune irrelevant downloads before shipping

Worked example

Don'ts

Features

Quick start

CLI flags

Environment variables

MCP server

Project layout

Definition of Done

Desktop GUI (PySide6)

Packaging as a standalone executable

Continuous integration & releases

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes