Skip to main content

A bookmark generator for pdf

Project description

tocPDF

Tests

This project was created due to the lack of outlines included with most digital PDFs of textbooks. This command line tools aims at resolving this by automatically generating the missing outline based on the table of contents.

https://github.com/user-attachments/assets/d73711d9-c236-4716-8dba-dfb4c0851722

Installation

The package can be installed using pip:

pip install tocPDF

Available Parsers

This package supports a number of different parsers for extracting the table of contents of from the PDF. The different parsers might yield varying results depending on the format of the table of contents. Therefore, if you are unhappy with the results of tocPDF, make sure to try a different parser to see if the results are improved. They can be selected using the -p options. The supported parsers are:

Inconsistent Offset

An additional difficulty with automatically generating outlines for PDFs stems from the fact that the PDF page numbers (displayed by your PDF viewer) do not match the page numbers of the book that you are trying to outline. In addition, certain PDFs will be missing some pages (usually between root chapters) compared to the book. This means that the page difference between the book and the PDF is not consistent throughout the document and needs to be recomputed between chapters. tocPDF can automatically recompute this offset by comparing the expected page number to the one found in the book.

Usage

This program requires 3 input parameters: the first and last PDF page of the table of contents as well as the PDF-book page offset. The offset is defined as the PDF page corresponding to the first book page with Arabic numerals (usually the first chapter). If your book has missing pages in between chapter, add the flag --missing_pages. This will dynamically adapt the page offset if there are missing pages. Note that this option will make the outline creation much more robust however the execution time will be a bit slower. If your PDF is not missing any pages you can omit this flag.

$ tocPDF -h
Usage: tocPDF [OPTIONS] FILENAME

  Generates outlined PDF based on the Table of Contents.

  Example: tocPDF -s 3 -e 5 -o 9 -p pypdf -m example.pdf

Options:
  -s, --start_toc INTEGER         PDF page number of FIRST page of Table of
                                  Contents.  [required]
  -e, --end_toc INTEGER           PDF page number of LAST page of Table of
                                  Contents.  [required]
  -o, --offset INTEGER            Global page offset, defined as PDF page
                                  number of first page with arabic numerals.
                                  [required]
  -p, --parser [pdfplumber|pypdf|tika]
                                  Parsers for extracting table of contents.
                                  [default: pdfplumber]
  -m, --missing_pages             Automatically recompute offsets by verifying
                                  book page number matches expected PDF page.
  -i, --inplace                   Overwrite original PDF with new outline.
  -d, --debug                     Outputs PDF file containing the pages
                                  provided for the table of contents.
  -h, --help                      Show this message and exit.

Example

The CLI can be simply invoked with the PDF as parameter:

tocPDF example.pdf

which will interactively prompt the user for the start/end pages of the PDF as well as the offset to the first page of the PDF.

These parameters can be directly provided as arguments to the CLI. For instance, the following command generates the correct outlined PDF for the example document found in example_pdf/example.pdf:

tocPDF --start_toc 7 --end_toc 8 --offset 9 --parser pypdf --missing_pages example.pdf

Or equivalently:

tocPDF -s 7 -e 8 -o 9 -p pypdf -m example.df

By default the outlined PDF written to {filename}_toc.pdf. However, it may also be performed inplace using the -i/--inplace flag which will overwrite the outline of the original document.

Limitations

tocPDF does not support:

  • scanned PDF since it does not perform OCR
  • multi-column table of contents

Alternative Software

In case the generated outline is slightly off, I recommend using the jpdfbookmarks (can be directly downloaded from sourceforge) which is a nice piece of free software for manually editing bookmarks for PDFs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tocpdf-0.3.10.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

tocPDF-0.3.10-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file tocpdf-0.3.10.tar.gz.

File metadata

  • Download URL: tocpdf-0.3.10.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for tocpdf-0.3.10.tar.gz
Algorithm Hash digest
SHA256 b1bd7390567b150a7537e95afbbb05bc7690c6a7fba3ce39481f98d3c6a6ab83
MD5 68770ce4f1659adbd41827f229b709cb
BLAKE2b-256 9cb8e1e705a11e34899bcba8b25b06ebfbe6d4c2ebab73eb6e8c9a91e2159e3a

See more details on using hashes here.

File details

Details for the file tocPDF-0.3.10-py3-none-any.whl.

File metadata

  • Download URL: tocPDF-0.3.10-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for tocPDF-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 9ee211d22a9b76d8ec1519a064f43d537f4ee9f58390c08061ef66cd310b4847
MD5 f103ef03fb183fb3ee93e9c6dcc71aed
BLAKE2b-256 b62b480da3dc6fc141e9c78054173c8604367e464f5b16dbcccb990875f028c8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page