CLI tool to extract PDF table of contents using SiliconFlow Qwen/Qwen3-VL-32B-Instruct
Project description
ebook-toc
ebook-toc is a Python CLI that extracts a book’s Table of Contents (TOC) from PDFs using a Vision-Language Model (VLM), then optionally embeds the TOC back into the PDF as bookmarks. The current implementation integrates SiliconFlow’s Qwen3‑VL‑32B‑Instruct for TOC detection and printed‑page offset estimation. It supports scanned PDFs by falling back to page images when text is unavailable.
This project is currently in Alpha and intentionally prioritizes a quick‑and‑dirty end‑to‑end path so it can be exercised and validated early. The public API and on‑disk JSON format may change before v1.0.
Note: Additional VLMs will be supported in future releases; the current integration with SiliconFlow is for evaluation and prototyping.
Prerequisites
- Python 3.10+
- PDM
Installation
pdm install
Usage
pdm run ebook-toc scan input.pdf --api-key sk-xxx --output toc.json --pages 20
# or process a remote PDF directly
pdm run ebook-toc scan --remote-url https://example.com/sample.pdf --api-key sk-xxx --output toc.json
# scan with GoodNotes-clean workflow (strip non-dominant-size insertions before scanning)
pdm run ebook-toc scan input.pdf --goodnotes-clean --api-key sk-xxx --output toc.json
# show CLI help
pdm run python -m ebooktoc.cli help scan
# apply an existing TOC JSON to a PDF
pdm run ebook-toc apply input.pdf output/json/input_toc.json --output output/pdf/input_with_toc.pdf
# apply with GoodNotes-clean workflow (remove non-dominant-size inserts before resolving)
pdm run ebook-toc apply input.pdf output/json/input_toc.json --goodnotes-clean --output output/pdf/input_with_toc.pdf
input.pdf: path to the source PDF.--api-key: VLM API token (OpenAI-format; default backend is SiliconFlow).--api-base: OpenAI-compatible API base URL (e.g.https://api.siliconflow.cn/v1orhttps://api.openai.com/v1); defaults to SiliconFlow when omitted.--model: VLM model name in OpenAI format (defaultQwen/Qwen3-VL-32B-Instruct).--output: path to the output JSON file (defaults totoc.json).--pages: number of leading pages to analyze (default10, use0for the full document).--remote-url: optional PDF URL; when provided the localinput.pdfargument can be omitted.--timeout: VLM request timeout in seconds (default600).--max-pages: upper bound for automatic page expansion when no TOC is detected (default50).--step-pages: increase in pages per expansion step (default10).--no-auto-expand: disable automatic expansion and use only the initial--pagesvalue.--batch-size: number of pages sent to the VLM backend per request (default10).--max-workers: maximum number of concurrent VLM requests (default3).--save-json: skip the prompt and persist the TOC JSON to disk.--apply-toc: skip the prompt and write the TOC into the PDF as bookmarks.scan --goodnotes-clean: detect and strip non-dominant-size pages (e.g., GoodNotes inserts) before scanning, to improve printed-page offset inference and TOC stability.apply --goodnotes-clean: detect and strip non-dominant-size pages (e.g., GoodNotes insertions), resolve bookmarks against the clean PDF, then map them back to the original PDF for writing.--dry-run: preview detected TOC entries without creating files.--filter-contains: keep only entries whose content includes the given substring (case-insensitive).--filter-regex: keep only entries whose content matches the given regular expression (case-insensitive).--fuzzy-dedup: fuzzy deduplication threshold in[0.0, 1.0](default0.85, set to0.0to disable fuzzy matching).
Performance tuning / recommended settings
For large or scanned PDFs (e.g. 300–800 pages), you can tune a few flags for better throughput and robustness:
- SiliconFlow / generous rate limits:
--batch-size 10(default) and--max-workers 3(default) work well on a 4‑core CPU.- Keep
--fuzzy-dedup 0.85(default) to aggressively merge near-duplicate TOC lines from the VLM.
- Strict or per-request–billed OpenAI-style backends:
- Consider
--max-workers 1or2to avoid hitting rate limits. - If each request is expensive, prefer slightly larger
--batch-size(e.g.8–12) over more workers.
- Consider
- Very large PDFs (500+ pages):
- Use
--pages 0to allow scanning the entire document in one logical window, or combine--pages,--max-pages, and--step-pagesfor incremental expansion. - When your PDF was edited heavily in GoodNotes/Notability, add
--goodnotes-cleanso that non‑dominant‑size pages are removed before scanning.
- Use
Example for a big, GoodNotes-heavy textbook on SiliconFlow:
pdm run ebook-toc scan "book.pdf" \
--api-key sk-xxx \
--pages 0 \
--batch-size 10 \
--max-workers 3 \
--fuzzy-dedup 0.85 \
--goodnotes-clean \
--output output/json/book_toc.json
Output
The CLI writes a JSON list containing the detected table-of-contents items:
[
{"page": 4, "target_page": 5, "content": "Chapter 1: Introduction"},
{"page": 4, "target_page": 15, "content": "Chapter 2: Methods"}
]
page records where the TOC text was found, whereas target_page (if present) captures the destination page referenced in the entry. The CLI prints a status message before scanning and reports the number of entries upon completion. Future extensions can reuse this output to create PDF bookmarks or other metadata.
If no entries are detected within the initial page window, the tool automatically expands the range by --step-pages (unless --no-auto-expand is set) until it reaches --max-pages. Each batch submitted to the VLM backend (default 3 pages; configurable with --batch-size) is deduplicated, and the results can be further narrowed via --filter-contains / --filter-regex.
During scanning the CLI also samples a few PDF pages with the VLM to infer the offset between the PDF index and the printed page number. The inferred offset is shown in the terminal (and stored in the JSON) so that bookmarks align with the book’s logical pagination, even when the document contains unnumbered front matter.
After the scan finishes, the CLI prompts whether to save the TOC JSON or embed bookmarks into the PDF (you can skip the prompts with --save-json / --apply-toc). By default JSON files go to output/json/, PDF copies with bookmarks go to output/pdf/, using names derived from the source document. If you opt out of saving JSON, the entries are printed directly to the terminal; if you run with --dry-run, the tool only prints a preview list and leaves the file system untouched. The saved JSON also contains lightweight page fingerprints and a canonical page_map (logical page → PDF page) computed from dominant page dimensions, so apply can align bookmarks even if apps like GoodNotes inserted extra pages later.
You can rerun bookmark creation later with ebook-toc apply, passing the previously saved JSON file.
How It Works
- Input handling (
ebooktoc/cli.py): validates local files or downloads remote PDFs; optionally creates a GoodNotes‑cleaned copy by keeping only dominant page sizes. - Page extraction (
ebooktoc/vlm_api.py): extracts per‑page text (or renders JPEG when text is empty), batches VLM requests, and parses JSON robustly. - TOC parsing (
ebooktoc/toc_parser.py): normalizes entries, deduplicates, filters, and infers missing trailing numeric targets. - Offset and mapping (
ebooktoc/fingerprints.py,ebooktoc/cli.py): computes dominant dimensions, builds a canonical index map (logical → PDF), and estimates printed‑page offsets by sampling pages with the VLM; storestoc,page_offset,fingerprints, andpage_mapin JSON. - Apply phase (
ebooktoc/pdf_writer.py,ebooktoc/cli.py): rebuilds the canonical map, refines the offset, resolves target pages, and writes bookmarks.
Primary modules:
ebooktoc/cli.py: CLI commands (scan,apply), coordination, prompts, and IOebooktoc/vlm_api.py: batching, VLM calls, JSON parsing, offset estimationebooktoc/toc_parser.py: TOC normalization, deduplication, filtering, heuristicsebooktoc/fingerprints.py: dominant‑dimension detection and canonical index mappingebooktoc/pdf_writer.py: bookmark embedding and result reportingebooktoc/utils.py: filesystem and small helpers
Development Guide
- Environment setup
- Install PDM, then run
pdm install -G test. - Python 3.10+ is required.
- Install PDM, then run
- Commands
- Run locally:
pdm run ebook-toc ... - Tests:
pdm run pytest(coverage is enabled by default via pyproject)
- Run locally:
- CI
- GitHub Actions runs tests across Python 3.10–3.14 with PDM.
- Coverage is uploaded to Codecov; see badges above.
- Style and structure
- Follow PEP 8; prefer
Path, type hints, and sharedrich.Consoleoutput. - Keep generated artifacts under
output/jsonandoutput/pdf.
- Follow PEP 8; prefer
- Security and privacy
- Never commit API keys or proprietary PDFs; pass keys via
--api-keyor env vars. - Extend
.gitignorefor new caches or artifacts before adding tools that persist them.
- Never commit API keys or proprietary PDFs; pass keys via
Project Status
- Alpha quality. The current solution is intentionally quick‑and‑dirty to validate the end‑to‑end flow with real PDFs.
- API and JSON schema may evolve; minor breaking changes are possible before v1.0.
- The default backend is SiliconFlow, but any OpenAI-format VLM can be used via
--api-baseand--model.
Roadmap / TODO
- Continue improving support for additional OpenAI-format VLM backends
- Expand and harden the test suite
- Improve and extend developer documentation
- Add an interactive TUI for local use
Acknowledgements
- Powered by PyMuPDF for PDF parsing and bookmark embedding.
- SiliconFlow Qwen3‑VL‑32B‑Instruct for TOC detection and printed‑page sampling.
License
- License to be determined before v1.0. Until then, please consider this code provided for evaluation and prototyping.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ebook_toc-0.0.1b0.tar.gz.
File metadata
- Download URL: ebook_toc-0.0.1b0.tar.gz
- Upload date:
- Size: 2.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3676968e40a4c9ec263dd3b80b4a926085572109bef7f04f81d56b42fb75f6e7
|
|
| MD5 |
7e59cdcb56f78354b350db9850381065
|
|
| BLAKE2b-256 |
f35b9d5307b305bee3e9afa51d4b377adc497647d4af9a2b3b53ea9eb05cd044
|
Provenance
The following attestation bundles were made for ebook_toc-0.0.1b0.tar.gz:
Publisher:
publish-pypi.yml on pi-dal/ebook-toc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ebook_toc-0.0.1b0.tar.gz -
Subject digest:
3676968e40a4c9ec263dd3b80b4a926085572109bef7f04f81d56b42fb75f6e7 - Sigstore transparency entry: 725698470
- Sigstore integration time:
-
Permalink:
pi-dal/ebook-toc@3be633a3635052fcb2cb036b1178634d2a6524f0 -
Branch / Tag:
refs/tags/0.0.1-beta.0 - Owner: https://github.com/pi-dal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@3be633a3635052fcb2cb036b1178634d2a6524f0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ebook_toc-0.0.1b0-py3-none-any.whl.
File metadata
- Download URL: ebook_toc-0.0.1b0-py3-none-any.whl
- Upload date:
- Size: 41.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e3f6913bdd43498ff923bc4fb224ebd3d19824903fe9e68b94e0ac3e019d878
|
|
| MD5 |
a326692c05d7d48aba94143fd3599382
|
|
| BLAKE2b-256 |
e6607e6935d1bde50cc091b8988a02accd725e14899ec48b58e3c137e56bfd5a
|
Provenance
The following attestation bundles were made for ebook_toc-0.0.1b0-py3-none-any.whl:
Publisher:
publish-pypi.yml on pi-dal/ebook-toc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ebook_toc-0.0.1b0-py3-none-any.whl -
Subject digest:
1e3f6913bdd43498ff923bc4fb224ebd3d19824903fe9e68b94e0ac3e019d878 - Sigstore transparency entry: 725698489
- Sigstore integration time:
-
Permalink:
pi-dal/ebook-toc@3be633a3635052fcb2cb036b1178634d2a6524f0 -
Branch / Tag:
refs/tags/0.0.1-beta.0 - Owner: https://github.com/pi-dal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@3be633a3635052fcb2cb036b1178634d2a6524f0 -
Trigger Event:
push
-
Statement type: