Skip to main content

A bookmark generator for pdf

Project description

tocPDF

This project was created due to the lack of outlines included with most digital PDFs of textbooks. This command line tools aims at resolving this by automatically generating the missing outline based on the table of contents.

Table of Contents

Installation

PIP

The package can be downloaded using pip:

pip install tocPDF

Manually

It can be installed manually by first cloning the repository:

git clone https://github.com/kszenes/tocPDF.git

Then navigate into the base directory (toc-pdf-package) of the project and install the package using pip:

pip install .

This will fetch all the necessary dependencies for running the program as well as install the command line tool.

Inconsistent Offset

The main difficulty with automatically generating outlines for PDFs stems from the fact that the PDF page numbers (displayed by your PDF viewer) do not match the page numbers of the book that you are trying to outline. In addition, certain PDFs will be missing some pages (usually between root chapters) compared to the book. This means that the page difference between the book and the PDF is not consistent throughout the document and needs to be recomputed between chapters. tocPDF can automatically recompute this offset by comparing the expected page number to the one found in the book.

Usage

This program requires 3 input parameters: the first and last PDF page of the table of contents as well as the PDF-book page offset. The offset is defined as the PDF page corresponding to the first book page with Arabic numerals (usually the first chapter). If your book has missing pages in between chapter, add the flag --missing_pages followed by either tika or pdfplumber. This will determine the parser used to make sure that the PDF-book page offset is still correct. Note that this option will make the outline creation much more robust however the execution time will be a bit slower. If your PDF is not missing any pages you can omit this flag.

Usage: tocPDF [OPTIONS] FILENAME

  Generates outlined PDF based on the Table of Contents. Version: 0.3.1

  Example: tocPDF example.pdf

Options:
  -s, --start_toc INTEGER   PDF page number of FIRST page of Table of
                            Contents.  [required]
  -e, --end_toc INTEGER     PDF page number of LAST page of Table of Contents.
                            [required]
  -o, --offset INTEGER      Global page offset, defined as PDF page number of
                            first page with arabic numerals.  [required]
  -m, --missing_pages TEXT  Parser (tika or pdfplumber) used to automatically
                            detect offset by verifying book page number
                            matches expected PDF page.
  -d, --debug               Outputs PDF file (tmp_toc.pdf) containing the
                            pages provided for the table of contents.
  -h, --help                Show this message and exit.

Example

The CLI can be simply invoked with the PDF as parameter:

tocPDF example.pdf

and then the user will be prompted to add the start/end pages of the PDF as well as the offset to the first page of the PDF.

These parameters can be directly provided as arguments to the CLI. For instance, the following command generates the correct outlined PDF for the example document found in example_pdf/example.pdf:

tocPDF --start_toc 3 --end_toc 5 --offset 9 --missing_pages tika example.pdf

Or equivalently:

tocPDF -s 3 -e 5 -o 9 -m tika example.df

This will generate a new outlined PDF with the name out.pdf.

Supported Formats

The format for table of contents varies from document to document and I can not guarantee that tocPDF will work perfectly. I have tested it out on a dozen documents and it produces decent results. Make sure to run with both parsers (-m tika and -m pdfplumber) and compare results. If you have encountered any bugs or found any unsupported table of content formats, feel free to open an issue.

Alternative Software

In case the generated outline is slightly off, I recommend using the jpdfbookmarks (can be directly downloaded from sourceforge) which is a nice piece of free software for manually editing bookmarks for PDFs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tocpdf-0.3.4.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

tocPDF-0.3.4-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file tocpdf-0.3.4.tar.gz.

File metadata

  • Download URL: tocpdf-0.3.4.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for tocpdf-0.3.4.tar.gz
Algorithm Hash digest
SHA256 78a28d4544dfa941689e3579871376c3fde3b543be79e98de36988ad5a87149e
MD5 a7790f1a4e72e7872135515ff1260aaa
BLAKE2b-256 fe4150b8158b7d454eeabe933ebc41f1ff34f4343debde81eb59e009485b8441

See more details on using hashes here.

File details

Details for the file tocPDF-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: tocPDF-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.0

File hashes

Hashes for tocPDF-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 93828d76d4cfaa7fb7980a857c93bd0bf19e9c9cdfa697f11fdc1a2b96c8e7a7
MD5 75f8f3366d83daf291a41d8eeaac42e9
BLAKE2b-256 9218bb51a18518cb6c3580c0a6e1178dfd76883a6a76c40a9bfa68c45f59d0c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page