Automatically generate table of contents for pdf files

These details have not been verified by PyPI

Project links

Project description

pdf.tocgen

                          in.pdf
                            |
                            |
     +----------------------+--------------------+
     |                      |                    |
     V                      V                    V
+----------+          +-----------+         +----------+
|          |  recipe  |           |   ToC   |          |
| pdfxmeta +--------->| pdftocgen +-------->| pdftocio +---> out.pdf
|          |          |           |         |          |
+----------+          +-----------+         +----------+

pdf.tocgen is a set of command-line tools for automatically extracting and generating the table of contents (ToC) of a PDF file. It uses the embedded font attributes and position of headings to deduce the basic outline of a PDF file.

It works best for PDF files produces from a TeX document using pdftex (and its friends pdflatex, pdfxetex, etc.), but it's designed to work with any software-generated PDF files (i.e. you shouldn't expect it to work with scanned PDFs). Some examples include troff/groff, Adobe InDesign, Microsoft Word, and probably more.

Please see the homepage for a detailed introduction.

Installation

pdf.tocgen written in Python 3. It is known to work with Python 3.8 under Linux, but Python 3.7 should be the minimum. Use

$ pip install -U pdf.tocgen

to install the latest version systemwide, or use

$ pip install -U --user pdf.tocgen

to install it for the current user. I would recommend the latter approach to avoid messing up the package managers on your system.

Workflow

The design of pdf.tocgen is influenced by the Unix philosophy. I intentionally separated pdf.tocgen to 3 separate programs. They work together, but each of them is useful on their own.

pdfxmeta: extract the metadata (font attributes, positions) of headings to build a recipe file.
pdftocgen: generate a table of contents from the recipe.
pdftocio: import the table of contents to the PDF document.

You should read the example on the homepage for a proper introduction, but the basic workflow follows like this.

First, use pdfxmeta to search for metadata of headings

$ pdfxmeta -p page in.pdf pattern >> recipe.toml
$ pdfxmeta -p page in.pdf pattern2 >> recipe.toml

Edit the recipe.toml file to pick out the attributes you need and specify the heading levels.

$ vim recipe.toml # edit

An example recipe would look like this:

[[filter]]
level = 1
font.name = "Times-Bold"
font.size = 19.92530059814453

[[filter]]
level = 2
font.name = "Times-Bold"
font.size = 11.9552001953125

Then pass the recipe to pdftocgen to generate a table of contents,

$ pdftocgen in.pdf < recipe.toml
"Preface" 5
    "Bottom-up Design" 5
    "Plan of the Book" 7
    "Examples" 9
    "Acknowledgements" 9
"Contents" 11
"The Extensible Language" 14
    "1.1 Design by Evolution" 14
    "1.2 Programming Bottom-Up" 16
    "1.3 Extensible Software" 18
    "1.4 Extending Lisp" 19
    "1.5 Why Lisp (or When)" 21
"Functions" 22
    "2.1 Functions as Data" 22
    "2.2 Defining Functions" 23
    "2.3 Functional Arguments" 26
    "2.4 Functions as Properties" 28
    "2.5 Scope" 29
    "2.6 Closures" 30
    "2.7 Local Functions" 34
    "2.8 Tail-Recursion" 35
    "2.9 Compilation" 37
    "2.10 Functions from Lists" 40
"Functional Programming" 41
    "3.1 Functional Design" 41
    "3.2 Imperative Outside-In" 46
    "3.3 Functional Interfaces" 48
    "3.4 Interactive Programming" 50
[--snip--]

which can be directly imported to the PDF file using pdftocio,

$ pdftocgen in.pdf < recipe.toml | pdftocio -o out.pdf in.pdf

Or if you want to edit the table of contents before importing it,

$ pdftocgen in.pdf < recipe.toml > toc
$ vim toc # edit
$ pdftocio in.pdf < toc

Each of the three programs has some extra functionalities. Use the -h option to see all the options you could pass in.

Development

If you want to modify the source code or contribute anything, first install poetry, which is a dependency and package manager for Python used by pdf.tocgen. Then run

$ poetry install

in the root directory of this repository to set up development dependencies.

If you want to test the development version of pdf.tocgen, use the poetry run command:

$ poetry run pdfxmeta in.pdf "pattern"

Alternatively, you could also use the

$ poetry shell

command to open up a virtual environment and run the development version directly:

(pdf.tocgen) $ pdfxmeta in.pdf "pattern"

Before you send a patch or pull request, make sure the unit test passes by running:

$ make test

License

pdf.tocgen is free software. The source code of pdf.tocgen is licensed under the GNU GPLv3 license.

pdf.tocgen is based on PyMuPDF, licensed under the GNU GPLv3 license, which is again based on MuPDF, licensed under the GNU AGPLv3 license. A copy of the AGPLv3 license is included in the repository.

If you want to make any derivatives based on this project, please follow the terms of the GNU GPLv3 license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.4

Nov 26, 2023

1.3.3

Apr 21, 2023

1.3.2

Apr 20, 2023

1.3.1

Apr 20, 2023

1.3.0

Nov 10, 2021

1.2.3

Jan 7, 2021

1.2.2

Oct 11, 2020

1.2.1

Aug 7, 2020

1.2.0

Aug 7, 2020

1.1.3

Aug 4, 2020

1.1.2

Aug 4, 2020

1.1.1

Aug 1, 2020

1.1.0

Jul 31, 2020

1.0.1

Jul 30, 2020

1.0.0

Jul 28, 2020

This version

0.9.9

Jul 28, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf.tocgen-0.9.9.tar.gz (37.5 kB view details)

Uploaded Jul 28, 2020 Source

Built Distribution

pdf.tocgen-0.9.9-py3-none-any.whl (40.5 kB view details)

Uploaded Jul 28, 2020 Python 3

File details

Details for the file pdf.tocgen-0.9.9.tar.gz.

File metadata

Download URL: pdf.tocgen-0.9.9.tar.gz
Upload date: Jul 28, 2020
Size: 37.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.0.10 CPython/3.8.3 Linux/5.7.10-arch1-1

File hashes

Hashes for pdf.tocgen-0.9.9.tar.gz
Algorithm	Hash digest
SHA256	`2a0289c58b2a62b0c8b8105375bcd4bf825a61617314ad772bd83181309d88ea`
MD5	`3a46b0cbdbfe3f4f74d8be1619b7f076`
BLAKE2b-256	`26b4f57ae1ebaf37f7bf1aadc2a24d3d76d87b2916885e7c6a8f8edd9876832d`

See more details on using hashes here.

File details

Details for the file pdf.tocgen-0.9.9-py3-none-any.whl.

File metadata

Download URL: pdf.tocgen-0.9.9-py3-none-any.whl
Upload date: Jul 28, 2020
Size: 40.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.0.10 CPython/3.8.3 Linux/5.7.10-arch1-1

File hashes

Hashes for pdf.tocgen-0.9.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf874ef5e11c888db776f71303ccb0de3735415d3427e5fd2837d6aedce310b4`
MD5	`a73b123988faaae6fd471f520eebe78d`
BLAKE2b-256	`d69e7d5f2a8ed48ba0b249046045cfcbf89ccabb178e0a76063344521e3a5e05`

See more details on using hashes here.

pdf.tocgen 0.9.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdf.tocgen

Installation

Workflow

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes