Add navigable bookmarks to a PDF based on its heading structure.
Project description
bmrk
A simple CLI tool for adding structured bookmarks to PDFs.
bmrk analyses a PDF's text and font metadata to detect its heading structure, then writes a bookmarked copy for easier navigation in any PDF viewer.
Table of Contents
Installation
pip install bmrk
For an isolated install that keeps bmrk available globally without polluting your Python environment:
# pipx
pipx install bmrk
# uv
uv tool install bmrk
To run bmrk once without installing it:
# pipx
pipx run bmrk paper.pdf paper_bookmarked.pdf
# uvx (uv's ephemeral tool runner)
uvx bmrk paper.pdf paper_bookmarked.pdf
From source
pip install git+https://github.com/AnvarAtayev/bmrk.git
With OCR support
For scanned PDFs that lack a text layer, install the optional OCR extra:
pip install "bmrk[ocr]"
# or
pipx install "bmrk[ocr]"
# or
uv tool install "bmrk[ocr]"
This pulls in ocrmypdf, which itself requires Tesseract and Ghostscript to be installed on your system:
# macOS
brew install tesseract ghostscript
# Debian/Ubuntu
sudo apt install tesseract-ocr ghostscript
# Windows -- download installers from:
# https://github.com/UB-Mannheim/tesseract/wiki
# https://www.ghostscript.com/releases/gsdnld.html
Then pass --ocr to bmrk:
bmrk scanned.pdf scanned_bookmarked.pdf --ocr
OCR in a dev environment
# 1. Clone the repo and sync all extras
git clone https://github.com/AnvarAtayev/bmrk.git
cd bmrk
uv sync --extra dev --extra ocr
# 2. Install system deps (macOS example)
brew install tesseract ghostscript
# 3. Run
uv run bmrk scanned.pdf out.pdf --ocr
Usage
bmrk [OPTIONS] <INPUT>.pdf [<OUTPUT>.pdf]
Basic
bmrk paper.pdf paper_bookmarked.pdf
Options
| Flag | Default | Description |
|---|---|---|
--threshold RATIO / -t |
1.05 |
Font-size ratio above which text is treated as a heading. Raise to 1.15 for noisy PDFs; lower to 1.01 to catch bold same-size section titles. |
--verbose / -v |
off | Print detected headings and progress info. |
--dry-run / -n |
off | Detect and print headings only; do not write an output file. Useful for tuning --threshold. |
--ocr |
off | Run OCR before detection. Requires bmrk[ocr]. |
--export-headings FILE |
-- | Write detected heading structure to FILE (TSV). Edit and feed back in with --import-headings. |
--import-headings FILE |
-- | Use headings from FILE instead of running detection. Enables manual adjustments. |
--cover-pages N |
0 |
Skip the first N pages when detecting headings (e.g. cover page). |
--max-depth N / -d |
3 |
Maximum heading depth to include (1 = chapters only, 2 = + sections, 3 = + subsections). |
Inspect before writing
bmrk paper.pdf --dry-run --verbose
Manual heading adjustments
If the auto-detected bookmarks are not quite right, you can export the heading structure, edit it by hand, and import the corrected version back in.
Step 1 -- Export the detected headings
bmrk paper.pdf --export-headings headings.tsv
When OUTPUT is omitted, bmrk runs detection and exports the heading list without writing a PDF.
Step 2 -- Edit the TSV file
Open headings.tsv in any text editor or spreadsheet app. The format is tab-separated with three columns:
# bmrk heading export
# level page title
1 1 Introduction
2 3 Background
2 7 Methods
1 12 Results
3 14 Statistical Analysis
- level -- heading depth (1 = top-level chapter, 2 = section, 3 = subsection, ...).
- page -- 1-based page number where the heading appears.
- title -- the bookmark text shown in the PDF viewer.
- Lines starting with
#are comments and are ignored on import.
Common edits:
- Remove a heading -- delete the line entirely.
- Add a missing heading -- insert a new line with the correct level, page, and title.
- Fix a title -- change the text in the third column.
- Change nesting -- adjust the level number (e.g. change
2to1to promote a section to a chapter). - Reorder headings -- rearrange lines; bookmarks are inserted in the order they appear in the file.
Step 3 -- Import and produce the bookmarked PDF
bmrk paper.pdf paper_bookmarked.pdf --import-headings headings.tsv
This skips detection entirely and uses your edited headings to write the bookmarked PDF.
Tune for a noisy PDF
# More conservative -- only large headings
bmrk paper.pdf out.pdf --threshold 1.15
# More aggressive -- catches bold same-size section titles
bmrk paper.pdf out.pdf --threshold 1.01
Handle a cover page
# Skip page 1 (the cover) when detecting headings
bmrk report.pdf report_bookmarked.pdf --cover-pages 1
How it works
bmrk reads every text span in the PDF along with its font size and style, then uses three signals to find headings:
- Font size -- text larger than the body font is a heading. The biggest text becomes H1, the next size H2, and so on.
- Numbered prefixes -- lines like
1 Introductionor2.3 Methodsare headings, with depth inferred from the numbering. - Bold/italic at body size -- some documents style section headings in bold or italic without changing the font size. These are picked up as the lowest heading level.
After detection, bmrk cleans up the results (removes running page headers, deduplicates, merges chapter labels like Chapter 1 with the title that follows) and writes the final bookmark outline into the output PDF.
flowchart LR
A[PDF] --> B[Extract spans]
B --> C[Pre-process]
C --> D[Detect headings]
D --> E[Clean up]
E --> F[Write bookmarks]
C -.- C1["Skip cover/TOC pages
Exclude headers/footers
Estimate body font size"]
D -.- D1["1. Font size > body size
2. Numbered prefixes
3. Bold/italic at body size"]
E -.- E1["Remove running headers
Deduplicate adjacent titles
Merge chapter labels
Filter by max depth"]
Code structure
src/bmrk/
├── cli.py # Typer CLI entry point
├── detector.py # Heading detection logic and HeadingEntry dataclass
├── bookmarker.py # PDF bookmark writing
Limitations
- Scanned/image PDFs --
bmrkcannot detect headings in PDFs without selectable text. Run OCR first withbmrk --ocr(requiresbmrk[ocr]). - Existing bookmarks --
bmrkreplaces any existing outline; it does not merge with pre-existing bookmarks.
Development
uv sync --extra dev
# Lint
uv run ruff check src/
# Test
uv run pytest
Contributing
Contributions are welcome. Bug reports, feature requests, and pull requests can all be submitted via GitHub Issues or as a pull request against main.
Before opening a pull request, run the lint and test suite to confirm nothing is broken:
uv sync --extra dev
uv run ruff check src/
uv run pytest
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bmrk-0.1.1.tar.gz.
File metadata
- Download URL: bmrk-0.1.1.tar.gz
- Upload date:
- Size: 31.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
097428c92318938e4cbc5cbe6776d270cabebf2a948678767dfecee021068b65
|
|
| MD5 |
12021cfec651a95223f54bae87b11b8c
|
|
| BLAKE2b-256 |
c1d85072a37ab639869fde8afc0ba12ecd7b6f174cb98c5cee9cff08b1331bd0
|
Provenance
The following attestation bundles were made for bmrk-0.1.1.tar.gz:
Publisher:
release.yml on AnvarAtayev/bmrk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bmrk-0.1.1.tar.gz -
Subject digest:
097428c92318938e4cbc5cbe6776d270cabebf2a948678767dfecee021068b65 - Sigstore transparency entry: 1006325724
- Sigstore integration time:
-
Permalink:
AnvarAtayev/bmrk@3acd8099a7789032e46ee5839f9d63fc58818ee3 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/AnvarAtayev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3acd8099a7789032e46ee5839f9d63fc58818ee3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file bmrk-0.1.1-py3-none-any.whl.
File metadata
- Download URL: bmrk-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa61319bad785dfd47d8a371926d3689610666cf2fe71ada6e92a28f0b2f16ec
|
|
| MD5 |
c93492e65c8941fe685c152b867b91ec
|
|
| BLAKE2b-256 |
1e7109afbfbaeb87421c206f47936c232c41623f2fdc724b3a917c898d86041b
|
Provenance
The following attestation bundles were made for bmrk-0.1.1-py3-none-any.whl:
Publisher:
release.yml on AnvarAtayev/bmrk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bmrk-0.1.1-py3-none-any.whl -
Subject digest:
aa61319bad785dfd47d8a371926d3689610666cf2fe71ada6e92a28f0b2f16ec - Sigstore transparency entry: 1006325729
- Sigstore integration time:
-
Permalink:
AnvarAtayev/bmrk@3acd8099a7789032e46ee5839f9d63fc58818ee3 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/AnvarAtayev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3acd8099a7789032e46ee5839f9d63fc58818ee3 -
Trigger Event:
push
-
Statement type: