automatically create bookmarks in a PDF file
Project description
pdf_scout
This CLI tool automatically generates PDF bookmarks (also known as an 'outline' or a 'table of contents') for computer-generated PDF documents.
You can install it globally via pip:
pip install --user pdf_scout
pdf_scout ./my_document.pdf
pip uninstall pdf_scout
This project is a work in progress and will likely only generate suitable bookmarks for documents that conform to the following requirements:
- Single column of text (not multiple columns)
- Font size of header text > font size of body text
- Header text is justified or left-aligned
- Paragraph spacing for headers > body text paragraph spacing
- Consistent left margins on every page
Supported document types
pdf_scout
has been tested on and expressly supports the following classes of documents:
- Singapore State Court and Supreme Court Judgments (unreported)
- Singapore Law Reports
- OpenDoc-generated PDFs, such as the State Court Practice Directions 2021 and the Supreme Court Practice Directions 2021
It may support other types of documents as well. If a particular class of document isn't supported or does not work well, please open an issue and I will consider adding support for it.
Development
This project manages its dependencies using poetry and is only supported for Python ^3.9. After installing poetry and entering the project folder, run the following to install the dependencies:
poetry install
To open a virtualenv in the project folder with the dependencies, run:
poetry shell
To run a script directly, run:
poetry run python ./pdf_scout/app.py <INPUT_FILE_PATH>
Tests
There are snapshot tests. Input PDFs are not provided at the moment, so you will have to populate the /pdf
folder manually using the relevant sources (you may want to consider using Clerkent to download the unreported versions of judgments):
poetry run pytest
poetry run pytest --snapshot-update
Static type-checking
poetry run mypy pdf_scout/app.py
Tips
- Processing a large PDF can take some time, so to iterate faster when debugging certain behaviour, extract the problematic part of the PDF as a separate file
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf_scout-0.0.6.tar.gz
.
File metadata
- Download URL: pdf_scout-0.0.6.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.10.9 Linux/5.15.0-1024-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87908911f26ca52c3e030d4c76c3c4273d0ed51c01d1b4271af4d302a917f331 |
|
MD5 | e543ddfa6f4b39997d88e011aeba88ce |
|
BLAKE2b-256 | 7a08352acbf5c5dd59db3c4beef7da258788b4e7e9614886b3edfcbd6b8d84a0 |
File details
Details for the file pdf_scout-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: pdf_scout-0.0.6-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.10.9 Linux/5.15.0-1024-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a21a3ddca70215016f0c5cf1f670e3913ba5fbc92552ed913f14b20853f706d |
|
MD5 | c3c66cb5c05237ac12b5927075eabab5 |
|
BLAKE2b-256 | 8aee8e0b7ec8cce767959b283f60ab98047bd9932a6331aafd74b36e64407fe0 |