Skip to main content

A cli tool to automatically add bookmarks to PDFs

Project description

autoindex 📑

A Python project that automatically adds an index/bookmarks/outlines to a PDF

Installation

Using Pip

  • Run pip install autoindex

From Source

  • Clone the repo or download the zip
  • cd to the folder
  • Run pip install -r "requirements.txt"
  • Run python autoindex.py [OPTIONS]

Usage

autoindex works well with PDFs that have clearly outlined bookmarks with numerical page numbers and no images. Nesting can be detected by differences in font sizes or the indents in bookmarks. In both cases, the thresholds to detect child bookmarks have to be configured. The -d/--diagnose option can be useful for this. It prints the most common font sizes, line starting coordinates which can be used to figure out the threshold values

Most PDFs have an offset between the actual page number and what's shown in the reader. That can be specified using the --offset option

Scanned PDFs are not supported yet

Limitations

  • Multi line bookmarks might not be extracted completely.
  • PDFs meant for printing have different offsets for text on odd/even pages which can cause problems while detecting nesting

Options

Usage: autoindex.py [OPTIONS]

Options:
  -i, --input TEXT                input file name  [required]
  -o, --output TEXT               output file name. If not provided, defaults
                                  to the input file name suffixed with
                                  "-bookmarked"

  --toc-page-numbers, --toc INTEGER...
                                  range of pages (from, to) having the table
                                  of contents

  -d, --diagnose                  print the most common font sizes and line
                                  starting points to help choose values for
                                  fontsize/indent thresholds

  --nest-using-fontsize           flag to try and figure out nested bookmarks
                                  using font sizes

  --nest-using-indents            flag to try and figure out nested bookmarks
                                  using indents

  --offset INTEGER                offset to add to the page numbers from the
                                  table of contents

  --char-margin FLOAT             spacing between characters to be considered
                                  as a part of the same line

  --line-margin FLOAT             spacing between lines to be considered as a
                                  part of the same text box

  --header-fontsize-threshold FLOAT
                                  font size difference for a line to be
                                  considered as header

  --topic-fontsize-threshold FLOAT
                                  font size difference for lines to be
                                  considered as a part of the same parent
                                  header

  --header-indent-threshold FLOAT
                                  indent difference for a line to be
                                  considered as header

  --topic-indent-threshold FLOAT  indent difference for lines to be considered
                                  as a part of the same parent header

  --help                          Show this message and exit.

To Do

  • Detect nesting using indents
  • Output an intermediate YAML containing bookmarks that can be fixed before being added to the file
  • Add support for EPUB/DjVu
  • Expose as a web app
  • Add GUI/diagnostics to help choose configuration params

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoindex-0.3.0.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

autoindex-0.3.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file autoindex-0.3.0.tar.gz.

File metadata

  • Download URL: autoindex-0.3.0.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for autoindex-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f4086cbb968a2b74645f3cd1db83beb5d34b1712f08be464f7fb5b4c46870422
MD5 f4658cf42db764da37c5940af366cdbd
BLAKE2b-256 c8bd307f7e31a980220e8ac4069873b5096deea6c60d0f8001f1f2baed6d8c2b

See more details on using hashes here.

File details

Details for the file autoindex-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: autoindex-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for autoindex-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a770b6f4d10a260fa13e72cb5e3cf54a6dfa6a457c42aee8af0a35380440965c
MD5 6766f02ace6d3a72a8ba90dfbc0579a0
BLAKE2b-256 c2c734c7c8096a0264ef25e8b33ff24d78eec87c9d6eefc903d9cbd4513d08e4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page