Skip to main content

Add navigable bookmarks to a PDF based on its heading structure.

Project description

bmrk

A simple CLI tool for adding structured bookmarks to PDFs.

bmrk analyses a PDF's text and font metadata to detect its heading structure, then writes a bookmarked copy for easier navigation in any PDF viewer.


Table of Contents


Installation

pip install bmrk

For an isolated install that keeps bmrk available globally without polluting your Python environment:

# pipx
pipx install bmrk

# uv
uv tool install bmrk

To run bmrk once without installing it:

# pipx
pipx run bmrk paper.pdf paper_bookmarked.pdf

# uvx (uv's ephemeral tool runner)
uvx bmrk paper.pdf paper_bookmarked.pdf

From source

pip install git+https://github.com/AnvarAtayev/bmrk.git

With OCR support

For scanned PDFs that lack a text layer, install the optional OCR extra:

pip install "bmrk[ocr]"
# or
pipx install "bmrk[ocr]"
# or
uv tool install "bmrk[ocr]"

This pulls in ocrmypdf, which itself requires Tesseract and Ghostscript to be installed on your system:

# macOS
brew install tesseract ghostscript

# Debian/Ubuntu
sudo apt install tesseract-ocr ghostscript

# Windows -- download installers from:
#   https://github.com/UB-Mannheim/tesseract/wiki
#   https://www.ghostscript.com/releases/gsdnld.html

Then pass --ocr to bmrk:

bmrk scanned.pdf scanned_bookmarked.pdf --ocr

OCR in a dev environment

# 1. Clone the repo and sync all extras
git clone https://github.com/AnvarAtayev/bmrk.git
cd bmrk
uv sync --extra dev --extra ocr

# 2. Install system deps (macOS example)
brew install tesseract ghostscript

# 3. Run
uv run bmrk scanned.pdf out.pdf --ocr

Usage

bmrk [OPTIONS] <INPUT>.pdf [<OUTPUT>.pdf]

Basic

bmrk paper.pdf paper_bookmarked.pdf

Options

Flag Default Description
--threshold RATIO / -t 1.05 Font-size ratio above which text is treated as a heading. Raise to 1.15 for noisy PDFs; lower to 1.01 to catch bold same-size section titles.
--verbose / -v off Print detected headings and progress info.
--dry-run / -n off Detect and print headings only; do not write an output file. Useful for tuning --threshold.
--ocr off Run OCR before detection. Requires bmrk[ocr].
--export-headings FILE -- Write detected heading structure to FILE (TSV). Edit and feed back in with --import-headings.
--import-headings FILE -- Use headings from FILE instead of running detection. Enables manual adjustments.
--cover-pages N 0 Skip the first N pages when detecting headings (e.g. cover page).
--max-depth N / -d 3 Maximum heading depth to include (1 = chapters only, 2 = + sections, 3 = + subsections).

Inspect before writing

bmrk paper.pdf --dry-run --verbose

Manual heading adjustments

If the auto-detected bookmarks are not quite right, you can export the heading structure, edit it by hand, and import the corrected version back in.

Step 1 -- Export the detected headings

bmrk paper.pdf --export-headings headings.tsv

When OUTPUT is omitted, bmrk runs detection and exports the heading list without writing a PDF.

Step 2 -- Edit the TSV file

Open headings.tsv in any text editor or spreadsheet app. The format is tab-separated with three columns:

# bmrk heading export
# level	page	title
1	1	Introduction
2	3	Background
2	7	Methods
1	12	Results
3	14	Statistical Analysis
  • level -- heading depth (1 = top-level chapter, 2 = section, 3 = subsection, ...).
  • page -- 1-based page number where the heading appears.
  • title -- the bookmark text shown in the PDF viewer.
  • Lines starting with # are comments and are ignored on import.

Common edits:

  • Remove a heading -- delete the line entirely.
  • Add a missing heading -- insert a new line with the correct level, page, and title.
  • Fix a title -- change the text in the third column.
  • Change nesting -- adjust the level number (e.g. change 2 to 1 to promote a section to a chapter).
  • Reorder headings -- rearrange lines; bookmarks are inserted in the order they appear in the file.

Step 3 -- Import and produce the bookmarked PDF

bmrk paper.pdf paper_bookmarked.pdf --import-headings headings.tsv

This skips detection entirely and uses your edited headings to write the bookmarked PDF.

Tune for a noisy PDF

# More conservative -- only large headings
bmrk paper.pdf out.pdf --threshold 1.15

# More aggressive -- catches bold same-size section titles
bmrk paper.pdf out.pdf --threshold 1.01

Handle a cover page

# Skip page 1 (the cover) when detecting headings
bmrk report.pdf report_bookmarked.pdf --cover-pages 1

How it works

bmrk reads every text span in the PDF along with its font size and style, then uses three signals to find headings:

  1. Font size -- text larger than the body font is a heading. The biggest text becomes H1, the next size H2, and so on.
  2. Numbered prefixes -- lines like 1 Introduction or 2.3 Methods are headings, with depth inferred from the numbering.
  3. Bold/italic at body size -- some documents style section headings in bold or italic without changing the font size. These are picked up as the lowest heading level.

After detection, bmrk cleans up the results (removes running page headers, deduplicates, merges chapter labels like Chapter 1 with the title that follows) and writes the final bookmark outline into the output PDF.

flowchart LR
    A[PDF] --> B[Extract spans]
    B --> C[Pre-process]
    C --> D[Detect headings]
    D --> E[Clean up]
    E --> F[Write bookmarks]

    C -.- C1["Skip cover/TOC pages
    Exclude headers/footers
    Estimate body font size"]
    D -.- D1["1. Font size > body size
    2. Numbered prefixes
    3. Bold/italic at body size"]
    E -.- E1["Remove running headers
    Deduplicate adjacent titles
    Merge chapter labels
    Filter by max depth"]

Code structure

src/bmrk/
├── cli.py        # Typer CLI entry point
├── detector.py   # Heading detection logic and HeadingEntry dataclass
├── bookmarker.py # PDF bookmark writing

Limitations

  • Scanned/image PDFs -- bmrk cannot detect headings in PDFs without selectable text. Run OCR first with bmrk --ocr (requires bmrk[ocr]).
  • Existing bookmarks -- bmrk replaces any existing outline; it does not merge with pre-existing bookmarks.

Development

uv sync --extra dev

# Lint
uv run ruff check src/

# Test
uv run pytest

Contributing

Contributions are welcome. Bug reports, feature requests, and pull requests can all be submitted via GitHub Issues or as a pull request against main.

Before opening a pull request, run the lint and test suite to confirm nothing is broken:

uv sync --extra dev
uv run ruff check src/
uv run pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bmrk-0.1.0.tar.gz (31.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bmrk-0.1.0-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file bmrk-0.1.0.tar.gz.

File metadata

  • Download URL: bmrk-0.1.0.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bmrk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 64816dc71a70c38a2885e3c6f80a1e4179cd26f78a088fc1d8d02d3104bd3b0a
MD5 abec117ac54cf715a4f503c4d51f8931
BLAKE2b-256 04d12fee7fc0be4e0faadd2b8ab60a7ef9bf2a86184836976e9b06638aa6a09d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bmrk-0.1.0.tar.gz:

Publisher: release.yml on AnvarAtayev/bmrk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bmrk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bmrk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bmrk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c8e21b952ac7c6b793e916578e07e56b3b2e0612432e0968212e9018bb2d16d
MD5 5e82b2e2026672624e00bb5afae1d947
BLAKE2b-256 05e1109b43b721c0773da3f233dc3570be0b3e91dd8192baafda47e778e5ead6

See more details on using hashes here.

Provenance

The following attestation bundles were made for bmrk-0.1.0-py3-none-any.whl:

Publisher: release.yml on AnvarAtayev/bmrk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page