CLI tool to extract PDF table of contents using SiliconFlow Qwen/Qwen3-VL-32B-Instruct

Project description

ebook-toc icon

ebook-toc

ebook-toc is a Python CLI that extracts a book’s Table of Contents (TOC) from PDFs using a Vision-Language Model (VLM), then optionally embeds the TOC back into the PDF as bookmarks. The current implementation integrates SiliconFlow’s Qwen3‑VL‑32B‑Instruct for TOC detection and printed‑page offset estimation. It supports scanned PDFs by falling back to page images when text is unavailable.

This project is currently in Alpha and intentionally prioritizes a quick‑and‑dirty end‑to‑end path so it can be exercised and validated early. The public API and on‑disk JSON format may change before v1.0.

Note: Additional VLMs will be supported in future releases; the current integration with SiliconFlow is for evaluation and prototyping.

Prerequisites

Python 3.10+
PDM

Installation

pdm install

Usage

pdm run ebook-toc scan input.pdf --api-key sk-xxx --output toc.json --pages 20

# or process a remote PDF directly
pdm run ebook-toc scan --remote-url https://example.com/sample.pdf --api-key sk-xxx --output toc.json

# scan with GoodNotes-clean workflow (strip non-dominant-size insertions before scanning)
pdm run ebook-toc scan input.pdf --goodnotes-clean --api-key sk-xxx --output toc.json

# show CLI help
pdm run python -m ebooktoc.cli help scan

# apply an existing TOC JSON to a PDF
pdm run ebook-toc apply input.pdf output/json/input_toc.json --output output/pdf/input_with_toc.pdf

# apply with GoodNotes-clean workflow (remove non-dominant-size inserts before resolving)
pdm run ebook-toc apply input.pdf output/json/input_toc.json --goodnotes-clean --output output/pdf/input_with_toc.pdf

input.pdf: path to the source PDF.
--api-key: VLM API token (OpenAI-format; default backend is SiliconFlow).
--api-base: OpenAI-compatible API base URL (e.g. https://api.siliconflow.cn/v1 or https://api.openai.com/v1); defaults to SiliconFlow when omitted.
--model: VLM model name in OpenAI format (default Qwen/Qwen3-VL-32B-Instruct).
--output: path to the output JSON file (defaults to toc.json).
--pages: number of leading pages to analyze (default 10, use 0 for the full document).
--remote-url: optional PDF URL; when provided the local input.pdf argument can be omitted.
--timeout: VLM request timeout in seconds (default 600).
--max-pages: upper bound for automatic page expansion when no TOC is detected (default 50).
--step-pages: increase in pages per expansion step (default 10).
--no-auto-expand: disable automatic expansion and use only the initial --pages value.
--batch-size: number of pages sent to the VLM backend per request (default 10).
--max-workers: maximum number of concurrent VLM requests (default 3).
--save-json: skip the prompt and persist the TOC JSON to disk.
--apply-toc: skip the prompt and write the TOC into the PDF as bookmarks.
scan --goodnotes-clean: detect and strip non-dominant-size pages (e.g., GoodNotes inserts) before scanning, to improve printed-page offset inference and TOC stability.
apply --goodnotes-clean: detect and strip non-dominant-size pages (e.g., GoodNotes insertions), resolve bookmarks against the clean PDF, then map them back to the original PDF for writing.
--dry-run: preview detected TOC entries without creating files.
--filter-contains: keep only entries whose content includes the given substring (case-insensitive).
--filter-regex: keep only entries whose content matches the given regular expression (case-insensitive).
--fuzzy-dedup: fuzzy deduplication threshold in [0.0, 1.0] (default 0.85, set to 0.0 to disable fuzzy matching).

Performance tuning / recommended settings

For large or scanned PDFs (e.g. 300–800 pages), you can tune a few flags for better throughput and robustness:

SiliconFlow / generous rate limits:
- --batch-size 10 (default) and --max-workers 3 (default) work well on a 4‑core CPU.
- Keep --fuzzy-dedup 0.85 (default) to aggressively merge near-duplicate TOC lines from the VLM.
Strict or per-request–billed OpenAI-style backends:
- Consider --max-workers 1 or 2 to avoid hitting rate limits.
- If each request is expensive, prefer slightly larger --batch-size (e.g. 8–12) over more workers.
Very large PDFs (500+ pages):
- Use --pages 0 to allow scanning the entire document in one logical window, or combine --pages, --max-pages, and --step-pages for incremental expansion.
- When your PDF was edited heavily in GoodNotes/Notability, add --goodnotes-clean so that non‑dominant‑size pages are removed before scanning.

Example for a big, GoodNotes-heavy textbook on SiliconFlow:

pdm run ebook-toc scan "book.pdf" \
  --api-key sk-xxx \
  --pages 0 \
  --batch-size 10 \
  --max-workers 3 \
  --fuzzy-dedup 0.85 \
  --goodnotes-clean \
  --output output/json/book_toc.json

Output

The CLI writes a JSON list containing the detected table-of-contents items:

[
  {"page": 4, "target_page": 5, "content": "Chapter 1: Introduction"},
  {"page": 4, "target_page": 15, "content": "Chapter 2: Methods"}
]

page records where the TOC text was found, whereas target_page (if present) captures the destination page referenced in the entry. The CLI prints a status message before scanning and reports the number of entries upon completion. Future extensions can reuse this output to create PDF bookmarks or other metadata.

If no entries are detected within the initial page window, the tool automatically expands the range by --step-pages (unless --no-auto-expand is set) until it reaches --max-pages. Each batch submitted to the VLM backend (default 3 pages; configurable with --batch-size) is deduplicated, and the results can be further narrowed via --filter-contains / --filter-regex.

During scanning the CLI also samples a few PDF pages with the VLM to infer the offset between the PDF index and the printed page number. The inferred offset is shown in the terminal (and stored in the JSON) so that bookmarks align with the book’s logical pagination, even when the document contains unnumbered front matter.

After the scan finishes, the CLI prompts whether to save the TOC JSON or embed bookmarks into the PDF (you can skip the prompts with --save-json / --apply-toc). By default JSON files go to output/json/, PDF copies with bookmarks go to output/pdf/, using names derived from the source document. If you opt out of saving JSON, the entries are printed directly to the terminal; if you run with --dry-run, the tool only prints a preview list and leaves the file system untouched. The saved JSON also contains lightweight page fingerprints and a canonical page_map (logical page → PDF page) computed from dominant page dimensions, so apply can align bookmarks even if apps like GoodNotes inserted extra pages later.

You can rerun bookmark creation later with ebook-toc apply, passing the previously saved JSON file.

How It Works

Input handling (ebooktoc/cli.py): validates local files or downloads remote PDFs; optionally creates a GoodNotes‑cleaned copy by keeping only dominant page sizes.
Page extraction (ebooktoc/vlm_api.py): extracts per‑page text (or renders JPEG when text is empty), batches VLM requests, and parses JSON robustly.
TOC parsing (ebooktoc/toc_parser.py): normalizes entries, deduplicates, filters, and infers missing trailing numeric targets.
Offset and mapping (ebooktoc/fingerprints.py, ebooktoc/cli.py): computes dominant dimensions, builds a canonical index map (logical → PDF), and estimates printed‑page offsets by sampling pages with the VLM; stores toc, page_offset, fingerprints, and page_map in JSON.
Apply phase (ebooktoc/pdf_writer.py, ebooktoc/cli.py): rebuilds the canonical map, refines the offset, resolves target pages, and writes bookmarks.

Primary modules:

ebooktoc/cli.py: CLI commands (scan, apply), coordination, prompts, and IO
ebooktoc/vlm_api.py: batching, VLM calls, JSON parsing, offset estimation
ebooktoc/toc_parser.py: TOC normalization, deduplication, filtering, heuristics
ebooktoc/fingerprints.py: dominant‑dimension detection and canonical index mapping
ebooktoc/pdf_writer.py: bookmark embedding and result reporting
ebooktoc/utils.py: filesystem and small helpers

Development Guide

Environment setup
- Install PDM, then run pdm install -G test.
- Python 3.10+ is required.
Commands
- Run locally: pdm run ebook-toc ...
- Tests: pdm run pytest (coverage is enabled by default via pyproject)
CI
- GitHub Actions runs tests across Python 3.10–3.14 with PDM.
- Coverage is uploaded to Codecov; see badges above.
Style and structure
- Follow PEP 8; prefer Path, type hints, and shared rich.Console output.
- Keep generated artifacts under output/json and output/pdf.
Security and privacy
- Never commit API keys or proprietary PDFs; pass keys via --api-key or env vars.
- Extend .gitignore for new caches or artifacts before adding tools that persist them.

Project Status

Alpha quality. The current solution is intentionally quick‑and‑dirty to validate the end‑to‑end flow with real PDFs.
API and JSON schema may evolve; minor breaking changes are possible before v1.0.
The default backend is SiliconFlow, but any OpenAI-format VLM can be used via --api-base and --model.

Roadmap / TODO

Continue improving support for additional OpenAI-format VLM backends
Expand and harden the test suite
Improve and extend developer documentation
Add an interactive TUI for local use

Acknowledgements

Powered by PyMuPDF for PDF parsing and bookmark embedding.
SiliconFlow Qwen3‑VL‑32B‑Instruct for TOC detection and printed‑page sampling.

License

License to be determined before v1.0. Until then, please consider this code provided for evaluation and prototyping.

Project details

Release history Release notifications | RSS feed

This version

0.0.1b0 pre-release

Nov 26, 2025

0.0.1a0 pre-release

Nov 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ebook_toc-0.0.1b0.tar.gz (2.4 MB view details)

Uploaded Nov 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ebook_toc-0.0.1b0-py3-none-any.whl (41.0 kB view details)

Uploaded Nov 26, 2025 Python 3

File details

Details for the file ebook_toc-0.0.1b0.tar.gz.

File metadata

Download URL: ebook_toc-0.0.1b0.tar.gz
Upload date: Nov 26, 2025
Size: 2.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ebook_toc-0.0.1b0.tar.gz
Algorithm	Hash digest
SHA256	`3676968e40a4c9ec263dd3b80b4a926085572109bef7f04f81d56b42fb75f6e7`
MD5	`7e59cdcb56f78354b350db9850381065`
BLAKE2b-256	`f35b9d5307b305bee3e9afa51d4b377adc497647d4af9a2b3b53ea9eb05cd044`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ebook_toc-0.0.1b0.tar.gz:

Publisher: publish-pypi.yml on pi-dal/ebook-toc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ebook_toc-0.0.1b0.tar.gz
- Subject digest: 3676968e40a4c9ec263dd3b80b4a926085572109bef7f04f81d56b42fb75f6e7
- Sigstore transparency entry: 725698470
- Sigstore integration time: Nov 26, 2025
Source repository:
- Permalink: pi-dal/ebook-toc@3be633a3635052fcb2cb036b1178634d2a6524f0
- Branch / Tag: refs/tags/0.0.1-beta.0
- Owner: https://github.com/pi-dal
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@3be633a3635052fcb2cb036b1178634d2a6524f0
- Trigger Event: push

File details

Details for the file ebook_toc-0.0.1b0-py3-none-any.whl.

File metadata

Download URL: ebook_toc-0.0.1b0-py3-none-any.whl
Upload date: Nov 26, 2025
Size: 41.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ebook_toc-0.0.1b0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e3f6913bdd43498ff923bc4fb224ebd3d19824903fe9e68b94e0ac3e019d878`
MD5	`a326692c05d7d48aba94143fd3599382`
BLAKE2b-256	`e6607e6935d1bde50cc091b8988a02accd725e14899ec48b58e3c137e56bfd5a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ebook_toc-0.0.1b0-py3-none-any.whl:

Publisher: publish-pypi.yml on pi-dal/ebook-toc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ebook_toc-0.0.1b0-py3-none-any.whl
- Subject digest: 1e3f6913bdd43498ff923bc4fb224ebd3d19824903fe9e68b94e0ac3e019d878
- Sigstore transparency entry: 725698489
- Sigstore integration time: Nov 26, 2025
Source repository:
- Permalink: pi-dal/ebook-toc@3be633a3635052fcb2cb036b1178634d2a6524f0
- Branch / Tag: refs/tags/0.0.1-beta.0
- Owner: https://github.com/pi-dal
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@3be633a3635052fcb2cb036b1178634d2a6524f0
- Trigger Event: push

ebook-toc 0.0.1b0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ebook-toc

Prerequisites

Installation

Usage

Performance tuning / recommended settings

Output

How It Works

Development Guide

Project Status

Roadmap / TODO

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance